Page MenuHomePhabricator

Divide wikis into database lists by approximate size for performance engineering
Closed, ResolvedPublic

Description

There are a number of bugs in which small wikis are unfairly impacted by the performance constraints of large wikis. For example, many Special pages have been disabled across all Wikimedia wikis (cf. bug 15434). A small wiki such as ch.wikipedia.org, with 151 content pages, is treated the same as a wiki with over four million content pages. This doesn't make any sense.

This situation is unacceptable. A small wiki should not see a reduced user experience because of the existence of (almost entirely unrelated) wikis that have millions of content pages. We know the approximate sizes involved, so we should be able to safely and sanely tier these wikis (and then periodically check those tiers for accuracy and appropriateness). While we all wish that every wiki could be treated equally, it doesn't make any sense to punish small wikis indefinitely due to circumstances over which they have no control or involvement (i.e., an explosion in growth on a sibling project).

Some stats are available at https://wiki.toolserver.org/view/Wiki_server_assignments. There are other lists at Meta-Wiki, I believe. And I can query the *links tables for size if that's deemed necessary.

As far I as understand this, step one would be to make a set of groupings and then create individual wiki lists. Or perhaps just have a small.dblist or a large.dblist and add conditional statements based on that?

It looks like a small.dblist may already exist, even? Is that a list of small wikis (https://noc.wikimedia.org/conf/small.dblist doesn't load for me)?


Version: unspecified
Severity: enhancement

Details

Reference
bz39667

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:10 AM
bzimport set Reference to bz39667.
bzimport added a subscriber: Unknown Object (MLST).

This looks useful: http://meta.wikimedia.org/wiki/List_of_Wikimedia_projects_by_size

Where should the line be between a large and a small wiki?

(In reply to comment #1)

Where should the line be between a large and a small wiki?

Any number is going to be arbitrary. Maybe the actual first step is to write a maintenance script that can evaluate the size of the wikis in the cluster and then output a file based on their sizes (with a --size flag or something). So it'd be something like "php measureWikis.php --size=10000 > large.dblist" or something?

Measuring the number of content pages is probably easiest, as it's a stored value (in site_stats) and it gives a decent comparison between wikis (or it should in theory, at least).

(In reply to comment #1)

This looks useful:
http://meta.wikimedia.org/wiki/List_of_Wikimedia_projects_by_size

Where should the line be between a large and a small wiki?

That Meta page is auto-generated based on Special:Statistics, which in turn is just queryable from the sitestats database table. So (not to be nitpicky), just to be clear if and when we're going to use a server-side script to create dblists[1] groups by pagecount; it can simply use the db directly, no need to use that wiki page.

[1] https://gerrit.wikimedia.org/r/gitweb?p=operations/mediawiki-config.git;a=tree

btw, for technical aspects we should probably use total page count as opposed to article count. That way file pages / categories / users are also taken into account. Because as far as the database is concerned page and revisions are all the same, whether they are articles or not.

Fortunately both total page count and article count are tracked in site_stats.

Marking this as easy. Writing a maintenance script to query the cluster and output the dblist(s) should be trivial.

  1. Disable all the query pages that take more than about 15 minutes to update
  2. wgDisableQueryPageUpdate @{

'wgDisableQueryPageUpdate' => array(
'enwiki' => array(

		'Ancientpages',
		// 'CrossNamespaceLinks', # disabled by hashar - bug 16878
		'Deadendpages',
		'Lonelypages',
		'Mostcategories',
		'Mostlinked',
		'Mostlinkedcategories',
		'Mostlinkedtemplates',
		'Mostrevisions',
		'Fewestrevisions',
		'Uncategorizedcategories',
		'Wantedtemplates',
		'Wantedpages',

),
'default' => array(

		'Ancientpages',
		'Deadendpages',
		'Mostlinked',
		'Mostrevisions',
		'Wantedpages',
		'Fewestrevisions',
		// 'CrossNamespaceLinks', # disabled by hashar - bug 16878

),
),

@} end of wgDisableQueryPageUpdate

Source: http://noc.wikimedia.org/conf/InitialiseSettings.php.txt. Just pasting this here so I don't lose it.

(In reply to comment #5)

Marking this as easy. Writing a maintenance script to query the cluster and
output the dblist(s) should be trivial.

I've actually just restored small.dblist from the history books.

It's VERY out of date

https://gerrit.wikimedia.org/r/gitweb?p=operations/mediawiki-config.git;a=blob;f=small.dblist;h=5b0a78abf7fe1018576518382cae7a4f5342e422;hb=HEAD

(In reply to comment #7)

(In reply to comment #5)

Marking this as easy. Writing a maintenance script to query the cluster and
output the dblist(s) should be trivial.

I've actually just restored small.dblist from the history books.

I'm not sure what value that provides other than nostalgia. It's a very out of date list that needs a maintenance script of some kind to be able to re-generate (update) it. If you want to use "small.dblist" for the name of small databases list for nostalgia's sake (and continuity's sake as well, I suppose), that's fine, I guess. But we're really nowhere closer to resolving this bug.

Created attachment 11366
Sizes!

attachment sizes.txt ignored as obsolete

(In reply to comment #9)

Created attachment 11366 [details]
Sizes!

That's using the value of select ss_good_articles from site_stats

attachment sizes.txt ignored as obsolete

Basic script (work in progress!) to dump all the wikis sorted by ss_good_articles in https://gerrit.wikimedia.org/r/#/c/33694

Created attachment 11379
ss_total_pages

Attached:

(In reply to comment #13)

Updated https://gerrit.wikimedia.org/r/#/c/33694 moar and added the dblists
to noc conf etc

This change has now been merged.

I wonder what more is needed to resolve this bug.

(In reply to comment #13 by Reedy)

Updated https://gerrit.wikimedia.org/r/#/c/33694 moar and added the dblists
to noc conf etc

Reedy: Any idea what else is needed to resolve this request completely?

Personally (let Max chime in), I would've thought that this was enough.

We've now got a script to make size related dblist (parameters might want changing at a later date, but that's trivial). Those dblists have been created and are exposed via noc.

The next task is to potentially do something for bug 15434 using those new lists.

Marking this bug resolved/fixed now that bug 43668 ("Re-enable disabled Special pages on small wikis (wikis in small.dblist)") exists. Thanks again, Reedy!