The following query was pulled from ishmael:
/* SpecialAllpages::showToplevel */ select page_title from page where page_namespace = ? and page_is_redirect = ? and (page_title >= ?) and (page_title >= ?) order by page_title limit ?
Real sample:
SELECT /* SpecialAllpages::showToplevel 108.35.125.176 */ page_title FROM page WHERE page_namespace = '0' AND page_is_redirect = '0' AND (page_title >= '1887') AND (page_title >= 'Centennial_Trail_State_Park') ORDER BY page_title ASC LIMIT 86786,2
Hundreds of concurrent instances of this have been seen. They touched ~9M rows each, took over 100s, and had problematic clauses like that amazing LIMIT. They are not duplicates, having different page_title values and limit offsets (yet always a limit count of 2).
Furthermore batches of this query have been increasing in frequency.
The query itself comes from:
http://git.wikimedia.org/blob/mediawiki%2Fcore.git/ceba5987b0d36f134ffcd9e4f704d1608fe88b79/includes%2Fspecials%2FSpecialAllpages.php#L225
including that wonderful 'limit ...,2'.
It has this wonderful comment (at http://git.wikimedia.org/blob/mediawiki%2Fcore.git/ceba5987b0d36f134ffcd9e4f704d1608fe88b79/includes%2Fspecials%2FSpecialAllpages.php#L175):
- TODO: Either make this *much* faster or cache the title index points
- in the querycache table.
...so that should probably be done.
My best guess is that some third-party crawler is trying to obtain the titles of all articles by paging through Special:Allpages --- and, as we found out the hard way --- generating the list that was is O(N^3) where N is how many pages into the list you are.
If we "cach[ing] the title index points" we can get the complexity down to O(N^2). Which is still bad, but an improvement.
To make the page request (amortized) O(1) we need to generate the index points for the entire list of articles in a single traversal of all the articles (possibly done in chunks, possibly done by a background task).
Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=65159