Page MenuHomePhabricator

Categorymembers namespace filtering is inefficient, uses ugly hack in miser mode
Closed, DeclinedPublic

Description

On r53052 domas disabled querying specific namespace if miser mode is enabled.

The query now produce erroneus results.
*It silently skips the page_namespace where clause instead of giving an error.
*It's not documented on api.php nor www.mediawiki.org

Either 'fix' the above issues or implement whatever domas told about how to have efficient subcategory access.


Version: 1.16.x
Severity: normal

Details

Reference
bz19640

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:38 PM
bzimport set Reference to bz19640.

cbm.wikipedia wrote:

This change is now live on the WMF servers, giving results that conflict with the documentation dynamically generated by api.php.

(In reply to comment #0)

Either 'fix' the above issues or implement whatever domas told about how to
have efficient subcategory access.

It would be nice if any of the rest of us could even '''know''' whatever domas told.

It would be nice if any of the rest of us could even '''know''' whatever domas
told.

I only know what is in r53052 comment.

I made a couple tweaks to the disabling in r53087 so that it would die and give an error, rather than silently ignoring it, but it hasn't yet been synced to Wikimedia.

herd wrote:

(In reply to comment #3)

It would be nice if any of the rest of us could even '''know''' whatever domas
told.

I only know what is in r53052 comment.

From http://toolserver.org/~amidaniel/chanlogs/%23mediawiki/20090710.txt :

[19:42:35] <domas> that was ages ago
[19:42:49] <domas> there're few ways how to provide subcategories
[19:43:04] <domas> it is simple
[19:43:09] <domas> you just structure data that way

(more in the log, well... more of domas' brand of humor, heh)

(In reply to comment #5)

From http://toolserver.org/~amidaniel/chanlogs/%23mediawiki/20090710.txt :

[19:42:35] <domas> that was ages ago
[19:42:49] <domas> there're few ways how to provide subcategories
[19:43:04] <domas> it is simple
[19:43:09] <domas> you just structure data that way

(more in the log, well... more of domas' brand of humor, heh)

Isn't the data already structured "that way", i.e. each page is specifically tagged with a namespace number? And in that log, when asked about it, he says the nothing quoted above and then starts complaining about multi-language templates on commons, which seems to have nothing to do with the API's cmnamespace.

I must be missing something big in there, hopefully someone who understands this will come by to enlighten me. And I still have to wonder how something that's been in the code since '''June 2007''' is suddenly such a problem that requires breaking things in the hastiest possible way.

Created attachment 6341
Patch against r53286 to apply cmnamespace in php rather than sql (and rather than just ignoring it with a warning)

Rather than just ignoring cmnamespace (even with a warning), why not filter by cmnamespace in the PHP code that builds the result dataset when $wgMiserMode is enabled?

While this could result in the categorymembers node being empty if the requested namespaces are particularly sparse in the category, that sort of thing can happen anyway since r46845 so it's much less of a breaking change. It also would save bandwidth in the same circumstances, by not returning 5000 results the client doesn't want; when categorymembers is being used as a generator, it would also save having the generated modules querying those unwanted pages.

attachment diff ignored as obsolete

well, proper fix would be prefixing categories sortkey with something what sorts high up.
brion thought it was too dirty, so the change didn't get much traction, and we have current situation.

Brad's PHP filtering is... oh... well... I won't comment ;-)

(In reply to comment #8)

well, proper fix would be prefixing categories sortkey with something what
sorts high up.
brion thought it was too dirty, so the change didn't get much traction, and we
have current situation.

Brad's PHP filtering is... oh... well... I won't comment ;-)

...a dirty, temporary hack, but good enough for now. Committed in r53304. Of course this filtering should be moved to the database side soon.

  • Bug 19816 has been marked as a duplicate of this bug. ***

Okay, this is just FUBAR. If we don't currently have an efficient way of getting the subcategories of a category, then one should be created. Unlike Domas, I'm no DB expert, but I'd suggest adding a cl_fromns (?) field to the categorylinks table and creating new indexes to use it. (Unfortunately, I don't think MySQL is smart enough to use those indexes for queries that don't involve the namespace, so we'd either have to keep some redundant indexes or use the UNION hack. Domas can probably say which one he thinks would be the lesser evil.) Or just use the sortkey hack that Domas originally suggested.

By the way, doesn't the CategoryTree extension have the exact same performance problem, or is the fact that it uses memcached enough to avoid any issues there?

herd wrote:

*** Bug 20072 has been marked as a duplicate of this bug. ***

  • Bug 20072 has been marked as a duplicate of this bug. ***

Anyway, I'm reopening this (with a slightly changed summary) because the temporary hack applied in r53304 really is ugly and inefficient and should be replaced by a better fix (e.g. adding new fields and indexes to the categorylinks table, or using the sortkey hack suggested by Domas).

  • Bug 24354 has been marked as a duplicate of this bug. ***

mysql> DESCRIBE SELECT /* ApiQueryCategoryMembers::run */ cl_from,cl_sortkey,page_namespace,page_title,page_id FROM mw_page,mw_categorylinks FORCE INDEX (cl_sortkey) WHERE (cl_from=page_id) AND cl_to = 'Oh_rly' ORDER BY cl_sortkey, cl_from LIMIT 11\G
+----+-------------+------------------+--------+---------------+------------+---------+---------------------------------+------+------------------------------------------+

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra

+----+-------------+------------------+--------+---------------+------------+---------+---------------------------------+------+------------------------------------------+

1SIMPLEmw_categorylinksrefcl_sortkeycl_sortkey257const2Using where; Using index; Using filesort
1SIMPLEmw_pageeq_refPRIMARYPRIMARY4wikidb.mw_categorylinks.cl_from1

+----+-------------+------------------+--------+---------------+------------+---------+---------------------------------+------+------------------------------------------+
2 rows in set (0.00 sec)

mysql>

Was wondering.

Just removing the order by removes the filesort ;)

Bryan.TongMinh wrote:

The proper order by for the cl_sortkey index would be cl_to,cl_type,cl_sortkey

IF that is all that's required to fix it... It's worth adding an index (though, I've got a feeling adding the index might be quite slow especially on wmf et al. And is it still right post the sorting changes?), and backporting...

(In reply to comment #19)

IF that is all that's required to fix it... It's worth adding an index (though,
I've got a feeling adding the index might be quite slow especially on wmf et
al. And is it still right post the sorting changes?), and backporting...

That's not the problem. The problem is that you have to join in the page table to filter by namespace, because the namespace isn't in the categorylinks table.

*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*

RESOLVED LATER

There's nothing actionable by the API at the moment

If there does become something in future, feel free to reopen the bug at that point

Do we have a tracking bug for that or something?

Mentioned in SAL (#wikimedia-operations) [2019-01-24T22:11:55Z] <mutante> wikitech-static attempted to use certbot with --authenticator webroot and --installer apache to make it properly work with certbot renew in the future. it created account in /etc/letsencrypt/ made backup in /root/; challenge fails though because all domains need to serve out of a webroot and there is status.wikimedia.org here as well. (T21640)