Page MenuHomePhabricator

Special:WantedPages should use simpler query (so that it can be enabled on WMF wikis)
Open, LowPublicFeature

Description

Author: darklama

Description:
simplify query for faster performance

Updates for Special:WantedPage have been disabled on en.wikibooks. I tried asking on #mediawiki-tech, yesterday and was told it was probably due to being inefficient. Since updates to Special:WantedCategories are still being allowed currently, I propose simplifying Special:WantedPages since that would appear to be the reason.

I have included an attached patch which I believe does simplify the query and reduces the performance impact, without reducing functionality.


Version: unspecified
Severity: enhancement
URL: http://en.wikibooks.org/wiki/Special:WantedPages

attachment simplify_wantedpages.patch ignored as obsolete

Details

Reference
bz14786

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:12 PM
bzimport set Reference to bz14786.
bzimport added a subscriber: Unknown Object (MLST).

ayg wrote:

The patch 1) functionally changes the query, apparently unintentionally; and 2) includes tons of completely unrelated changes that I'm not even going to look at. It might improve performance, but only by accident and by introducing bugs.

This query should be okay with correct join order. On MySQL 5.x I'm seeing that with STRAIGHT_JOIN, it's

  • 1. row ******* id: 1 select_type: SIMPLE table: pagelinks type: index

possible_keys: pl_from,pl_namespace

    key: pl_namespace
key_len: 265
    ref: NULL
   rows: 254989417
  Extra: Using where; Using index
  • 2. row ******* id: 1 select_type: SIMPLE table: page type: eq_ref

possible_keys: name_title

    key: name_title
key_len: 261
    ref: enwiki.pagelinks.pl_namespace,enwiki.pagelinks.pl_title
   rows: 1
  Extra: Using where; Using index
  • 3. row ******* id: 1 select_type: SIMPLE table: page type: eq_ref

possible_keys: PRIMARY,name_title

    key: PRIMARY
key_len: 4
    ref: enwiki.pagelinks.pl_from
   rows: 1
  Extra: Using where

So just a table scan of pagelinks, not a filesort. This should run in under 15 minutes even on enwiki, hopefully, and surely on enwikibooks. IIRC, that was the criterion Tim used for shutting off queries. I'm trying it out on the toolserver now.

Without the STRAIGHT_JOIN, as now, MySQL 5 tries to put pg2 first in the join order, for some ungodly reason.

ayg wrote:

Oops, I forgot about the ORDER BY. That adds the filesort again. If update_specialpages.php could be rewritten to write an unsorted result to disk, close the transaction, sort the result on disk, and cache that, it wouldn't have to keep open horribly long database transactions and might be more useful.

Even with the order by and without the straight join, actually, this runs in less than a minute on the toolserver for enwikibooks. Was this really taking a long time on the real servers? This query takes 16.90s on toolserver's enwikibooks_p:

EXPLAIN SELECT 'Wantedpages' as type, pl_namespace AS namespace, pl_title AS title, COUNT(*) AS value FROM pagelinks LEFT JOIN page AS pg1 ON pl_namespace = pg1.page_namespace AND pl_title = pg1.page_title LEFT JOIN page AS pg2 ON pl_from = pg2.page_id WHERE pg1.page_namespace IS NULL AND pl_namespace NOT IN ( 2, 3 ) AND pg2.page_namespace != 8 GROUP BY 1,2,3 HAVING COUNT(*) > 0 ORDER BY value DESC LIMIT 5000;

Maybe MySQL 4 is optimizing it poorly?

darklama wrote:

Isn't MediaWiki suppose to work with both MySQL and PostgreSQL? I don't think PostgreSQL supports STRAIGHT_JOIN, which is why I didn't use that.

ayg wrote:

MySQL conditional comments can be used to hide MySQL-specific things:

http://dev.mysql.com/doc/refman/4.1/en/comments.html

The abstraction layer can also be used. Either way, it's not a problem.

Fwiw: Removing the second JOIN to remove links from MW messages seems to get rid of the filesort.

I know that was added (bug 5723) to prevent listing of $1, $2 in messages (which showed up as wanted). However, removing the join doesn't seem to be adding these fake links to the results. Has something else changed to keep them out of link tables, rendering this extra join useless?

ayg wrote:

Removing the extra join would serve more or less the same purpose as using STRAIGHT_JOIN. Both should remove the filesort, if what I said above is correct.

sumanah wrote:

-patch since there is currently no patch.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:01 AM