Page MenuHomePhabricator

Special:Allpages is too slow!
Closed, ResolvedPublic

Description

The following query was pulled from ishmael:

/* SpecialAllpages::showToplevel */ select page_title from page where page_namespace = ? and page_is_redirect = ? and (page_title >= ?) and (page_title >= ?) order by page_title limit ?

Real sample:

SELECT /* SpecialAllpages::showToplevel 108.35.125.176 */ page_title FROM page WHERE page_namespace = '0' AND page_is_redirect = '0' AND (page_title >= '1887') AND (page_title >= 'Centennial_Trail_State_Park') ORDER BY page_title ASC LIMIT 86786,2

Hundreds of concurrent instances of this have been seen. They touched ~9M rows each, took over 100s, and had problematic clauses like that amazing LIMIT. They are not duplicates, having different page_title values and limit offsets (yet always a limit count of 2).

Furthermore batches of this query have been increasing in frequency.

The query itself comes from:
http://git.wikimedia.org/blob/mediawiki%2Fcore.git/ceba5987b0d36f134ffcd9e4f704d1608fe88b79/includes%2Fspecials%2FSpecialAllpages.php#L225
including that wonderful 'limit ...,2'.

It has this wonderful comment (at http://git.wikimedia.org/blob/mediawiki%2Fcore.git/ceba5987b0d36f134ffcd9e4f704d1608fe88b79/includes%2Fspecials%2FSpecialAllpages.php#L175):

  1. TODO: Either make this *much* faster or cache the title index points
  2. in the querycache table.

...so that should probably be done.

My best guess is that some third-party crawler is trying to obtain the titles of all articles by paging through Special:Allpages --- and, as we found out the hard way --- generating the list that was is O(N^3) where N is how many pages into the list you are.

If we "cach[ing] the title index points" we can get the complexity down to O(N^2). Which is still bad, but an improvement.

To make the page request (amortized) O(1) we need to generate the index points for the entire list of articles in a single traversal of all the articles (possibly done in chunks, possibly done by a background task).


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=65159

Details

Reference
bz56840

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 2:24 AM
bzimport set Reference to bz56840.
bzimport added a subscriber: Unknown Object (MLST).

Urgency highest, as this is actively producing problems on production right now and, in fact, the above query was an outlier in yesterday's incident investigation.

(In reply to comment #0)

My best guess is that some third-party crawler is trying to obtain the titles
of all articles by paging through Special:Allpages --- and, as we found out
the hard way --- generating the list that was is O(N^3) where N is how many
pages into the list you are.

No crawler or similarly automated process should be hitting index.php?title=Special:AllPages. MediaWiki has a robust Web API at api.php?action=query&list=allpages.

If a particular crawler or two are disrupting production, blocking those crawlers by IP or User-Agent would be the best temporary fix. Though obviously we'll need to evaluate and address the underlying query issue as well.

You can probably make the query a lot faster by removing the offset and adding a WHERE page_title > $last_title$. It changes the feature slightly but iterating using something unique that you are sorted on and have indexed is generally pretty quick and prevents the duplicates you can get with OFFSET clauses. I think this is worth doing even if we don't want stuff crawling the end point because it closes up a vector of attack.

After the outage mentioned in comment #1, there are temporary pt-kill jobs running on S1 slaves to snipe this query if it runs over 30s or appears in batches. This is to keep the slaves alive.

If people as well as crawlers get affected, that's why.

The hierarchical contents list feature is a cute nostalgic reference to the idea of volumes of paper encyclopedias, but it is difficult to implement efficiently. I think it's doubtful that anyone derives a significant benefit from it on a wiki the size of en.wikipedia.org. We could just replace it with a simple AlphabeticPager, like the one on Special:ListUsers.

That is, unless someone has an idea of how to make this truly efficient, like O(1) in the total number of articles on the wiki? Surely we would set the bar that high if we were introducing the feature today.

Change 94690 had a related patch set uploaded by Tim Starling:
In Special:AllPages, limit the size of hierarchical lists

https://gerrit.wikimedia.org/r/94690

Change 94690 merged by jenkins-bot:
In Special:AllPages, limit the size of hierarchical lists

https://gerrit.wikimedia.org/r/94690

Tim's patch resolves the immediate operational issue. I moved the feature discussion to wikitech-l: http://lists.wikimedia.org/pipermail/wikitech-l/2013-November/073052.html.

Change 95084 had a related patch set uploaded by Ori.livneh:
In Special:AllPages, limit the size of hierarchical lists

https://gerrit.wikimedia.org/r/95084

Change 95085 had a related patch set uploaded by Ori.livneh:
In Special:AllPages, limit the size of hierarchical lists

https://gerrit.wikimedia.org/r/95085

Change 95084 merged by jenkins-bot:
In Special:AllPages, limit the size of hierarchical lists

https://gerrit.wikimedia.org/r/95084

Change 95085 merged by jenkins-bot:
In Special:AllPages, limit the size of hierarchical lists

https://gerrit.wikimedia.org/r/95085

Don't see any remaining open patches here. Reclosing.

Out of curiosity, do we actually know why the section query seems to touch all pages? From what I can tell the offset is always <# of pages in wiki> / 100, and the second >= page_title clause is incremented on each iteration. Maybe the MySQL optimizer is not clever enough to figure out that it can drop the first page_title clause:

(page_title >= '1887') AND (page_title >= 'Centennial_Trail_State_Park')

If that is the case, then changing the query to only use the second clause should bring the complexity down to O(N) per page for O(N^2) total.

I think it's actually the OFFSET clause (in this case, the first element of the LIMIT tuple) which is slowing things down. There isn't an effective index that can quickly give you the "86786th article after Centennial_Trail_State_Park". So instead it's sequentially scanning through 86,786 entries (and the offset grows the further you go).

I'm curious how this seemingly suddenly became a problem. Was this is a MariaDB performance regression?

(In reply to comment #16)

I'm curious how this seemingly suddenly became a problem. Was this is a
MariaDB
performance regression?

See the original report:

My best guess is that some third-party crawler is trying to obtain the titles
of all articles by paging through Special:Allpages --- and, as we found out
the
hard way --- generating the list that was is O(N^3) where N is how many pages
into the list you are.

(In reply to comment #15)

I think it's actually the OFFSET clause (in this case, the first element of
the
LIMIT tuple) which is slowing things down. There isn't an effective index
that
can quickly give you the "86786th article after
Centennial_Trail_State_Park".
So instead it's sequentially scanning through 86,786 entries

Yeah, that's 1/100 of 8.6 million pages. Very close to the nine million rows from the original report.

(and the offset grows the further you go).

Are we sure that this is the case? My reading of the code suggests that it remains constant per wiki.

(In reply to comment #18)

(In reply to comment #15)

(and the offset grows the further you go).

Are we sure that this is the case? My reading of the code suggests that it
remains constant per wiki.

Re-reading, you're right. Still O(N^3) to fetch all titles from a wiki with N titles because the offset, although constant, is O(N). O(N) 'pages' * O(N) 'queries per page' * O(N) 'rows touched per query due to the offset'.

(Hopefully $dbr->estimateRowCount() executes in O(1) time!)

(In reply to comment #19)

(Hopefully $dbr->estimateRowCount() executes in O(1) time!)

For MySQL, PostgreSQL, and MSSQL it is O(1) because it uses an explain. Looks like Sqllite and Oracle do it with a COUNT(*) though. I know Oracle hides its explain output behind obtuse views that you may or may not have access to. I'm not sure about Sqllite.

(In reply to comment #19)

(In reply to comment #18)

(In reply to comment #15)

(and the offset grows the further you go).

Are we sure that this is the case? My reading of the code suggests that it
remains constant per wiki.

Re-reading, you're right. Still O(N^3) to fetch all titles from a wiki with
N
titles because the offset, although constant, is O(N). O(N) 'pages' * O(N)
'queries per page' * O(N) 'rows touched per query due to the offset'.

O(N/100) (titles per section) times 100 (sections) is still O(N), so O(N^2) total. Note that the number of sections is constant, not O(N).