Page MenuHomePhabricator

dealing with large categories
Closed, ResolvedPublic

Description

Author: hemanshu_desai

Description:
there is no limit currently to the number of pages that can be in a category.
There is also no limit to how big the category page itself is. this leads to
attempts to list thousands of articles on one page.

I want to suggest that it be possible to use the first letter of the article
title as a variable. So it would be possible to edit the template and change it
from [[Category:XYZ]] to [[Category:XYZ:
{{VARIABLE_CONTAINING_FIRST_LETTER_OF_TITLE}}]] This will separate the category
by first letter... alternately this could be extended to first 2 letters.... all
that is required is to create the variable.

The choice of first letter or first few letters may be arbitrary but this could
be seen as a temporary fix until some other solution can be found (perhaps until
category pages can be rendered better?)

The advantage is that only the templates need to be changed once the variable is
available.


Version: 1.3.x
Severity: normal

Details

Reference
bz1058

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:07 PM
bzimport set Reference to bz1058.

bugzilla_wikipedia_org.to.jamesd wrote:

This resulted from some discussion of how to handle large categories (which are
typically used from within templates). Largest categories in some big wikis are:

en

Disambiguation16338
GFDL_images13884
Public_domain_images8181
People_stubs6983
Geography_stubs4771

de

Mann22669
Begriffsklärung12601
GFDL-Bild5913
Deutscher5047
Frau3632
Autor3442

fr

Wikipédia:ébauche11090
Homonymie4183
Années2759

(from use jawiki; select cl_to, count(*) as c from categorylinks group by cl_to
order by c desc limit 10;)

In addition, in conjunction with other load, this query took over 140 seconds on
Ariel:

SELECT DISTINCT cur_title,cur_namespace,cl_sortkey FROMcur,categorylinks WHERE
cl_to='Disambiguation' and cl_from=cur_id ORDER (truncated)

The de Mann query took over 160 seconds. It's typically less than this in run
time but it should be switched to run on slaves, ideally before the contemplated
switch to suda as master. LIMIT in this case may be ineffective if the order by
isn't in the order of the index fields being used in the where.

bugzilla_wikipedia_org.to.jamesd wrote:

The proposed variable won't be sufficient. Assuming first letter reduces the
size to 1/10th the current size, a 22,000 entry category will be reduced to an
uncomforably large 2,200 entry category. de has a category "mann" for all men.
Commons has a category for GFDL images, likely to contain half or more of all
images on that site and perhaps more than half of all images on all projects.
For this reason, we need some better approach to large categories.

As a partial solution, variables for the first letter, 2 letters, 3 letters, 4
letters and so on to 9 letters should help. Assuming only 500,000 GFDL images on
Commons and use of these categories the first 4-5 letters would take it to a
tolerable range of members but it's good to be prepared with the rest...:) Not
an ideal solution: better approaches are welcome.

Restructuring the schema in MediaWiki 1.5 will help but probably not enough long
term because the cur and category tables are frequently edited, disabling the
query cache (not database page cache) for those tables. One approach might be to
have reporting copies of categories and cur updated periodically and report from
them. Since the query cache will then work for a while, total load would be
reduced. Allpages subpages would also benefit from this reporting table approach
for cur.

For those unfamiliar with it, the MySQL query cache does an exact string
comparison on a query and if that query has happened recently enough to be in
cache, the cached result is returned, instead of executing the query normally.
All cached results for a query are flushed whenever any table used in the query
is changed.

bugzilla_wikipedia_org.to.jamesd wrote:

This is now substantially improved in MediaWiki 1.4 beta 5. Not
closing yet because there are still cases where these pages take a
long time and I'm observing the change to drop the cur_namespace
index to see if that assists.

Reassigning to jamesday as he is monitoring the issue :)

Bumping this. I know cur is long since gone, so is this as big of a problem with the current schema?

matthew.britton wrote:

(In reply to comment #5)

Bumping this. I know cur is long since gone, so is this as big of a problem
with the current schema?

Categories are only listed 200 pages at a time, so this bug presumably isn't a problem any more. (Though of course there are still many problems with large categories in general).

ayg wrote:

This was long ago fixed by paginating category pages, AFAICT.