Page MenuHomePhabricator

Very slow query for Special:WhatLinksHere limitted to a namespace when page has large number of backlinks
Closed, ResolvedPublic

Description

Special:WhatLinksHere on Commons occasionally gives Database errors. For example:

https://commons.wikimedia.org/w/index.php?title=Special%3AWhatLinksHere&target=Template%3ALocation%2Flayout&namespace=10

and

https://commons.wikimedia.org/w/index.php?title=Special%3AWhatLinksHere&target=Module%3AFallbacklist&namespace=828

gave me:

Database error
A database query error has occurred. This may indicate a bug in the software.
  Function: SpecialWhatLinksHere::showIndirectLinks
  Error: 0

Antoine "hashar" Musso checked on that and wrote:
"The first URL worked for me, the second throws a database error.

Thu Jan 30 14:11:19 UTC 2014 /* query */ Thu Jan 30 14:19:33 UTC 2014 /* query */ Connection lost and reconnected after 59.754s".

This might be related to the fact that both pages within last month or two went through changes that affects update links tables on millions of pages, and those link tables are slow to update, resulting in a state when results of Special:WhatLinksHere do not match the actual linking dependencies.


Version: 1.23.0
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=58157

Details

Reference
bz60618

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:56 AM
bzimport set Reference to bz60618.

The underlying cause seems to be the query killer mentioned in bug 58157 comment 5. The query here appears to be along the lines of:

SELECT /*! STRAIGHT_JOIN */ page_id,page_namespace,page_title,rd_from FROM templatelinks,page LEFT JOIN redirect ON ((rd_from = page_id) AND rd_namespace = '828' AND rd_title = 'Fallbacklist' AND (rd_interwiki = '' OR rd_interwiki IS NULL)) WHERE (page_id=tl_from) AND tl_namespace = '828' AND tl_title = 'Fallbacklist' AND page_namespace = '828' ORDER BY tl_from LIMIT 51;

Presumably it's slow because Module:Fallbacklist probably has something like 25 million transclusions, few of which are in the Module namespace. What might be done about that, though, I have no idea.

Removing the /*! STRAIGHT_JOIN */ would probably help as there is only 163 pages in the module namespace in commons (On tool labs, query without straight join took 0.16 seconds), so much more efficient to join the other way around.

However that doesn't really help the general case of a template with millions of transclusions none of which are in the namespace being looked for, and that namespace also has millions of pages. I'm not sure if anything can help with that case short of duplicating the namespace of tl_from into template links table (Wouldn't it be nice if indexes could cross table boundaries?)


https://commons.wikimedia.org/w/index.php?title=Special%3AWhatLinksHere&target=Module%3AFallbacklist&namespace=828

If you really wanted to know, the modules that link to Fallbacklist are:

+------------------+

page_title

+------------------+

File
Coordinates
Fallback
Fallbacklist
Fallbacklist/doc
Coordinates/doc
Fallback/sandbox
File/doc

+------------------+

  • Bug 60838 has been marked as a duplicate of this bug. ***

Change 114070 had a related patch set uploaded by Aaron Schulz:
Removed STRAIGHT_JOIN; the is slower when a namespace has a few pages

https://gerrit.wikimedia.org/r/114070

Change 114070 merged by jenkins-bot:
Removed STRAIGHT_JOIN; the is slower when a namespace has a few pages

https://gerrit.wikimedia.org/r/114070

(In reply to Gerrit Notification Bot from comment #5)

Change 114070 merged by jenkins-bot:
Removed STRAIGHT_JOIN; the is slower when a namespace has a few pages

https://gerrit.wikimedia.org/r/114070

This both helped and harmed some cases. More work is needed here.

Aaron pinged me on IRC. Some observations made during that discussion follow.

The pain points are now mostly mid-sized wikis (eg, *wiktionary, metawiki) that have:

  • large numbers of page or template links (hundreds of millions)
  • comparatively few pages (less than 10 million)
  • data skewed toward one or two namespaces
  • data skewed toward a small set of titles

enwiktionary> select tl_namespace, count(*) as links

from templatelinks group by tl_namespace;

+--------------+-----------+

tl_namespacelinks

+--------------+-----------+

01696
156
214901
3129
41462
512
8103
1023967908
111
145
154
903
10060
1012
1041546
1066
1101
828138128211

+--------------+-----------+

When hitting 10 or 828 and a title with millions of links MariaDB may fall back on an index scan on page. These mid-sized wikis liked having STRAIGHT_JOIN even if others didn't :-)

Could we tolerate the possibility of stale links and skip the JOIN on page altogther? Might need a denormalised tl_from_namespace field. Then pull out the page fields in a second batch WHERE page_id IN (...), or use a sub-query to do the same thing:

SELECT

page_id, page_namespace, page_title, rd_from

FROM (

SELECT tl_from, rd_from
FROM `templatelinks`
LEFT JOIN `redirect` 
   ON rd_from = tl_from
      AND rd_namespace = tl_namespace
      AND rd_title = tl_title
      AND (rd_interwiki = '' OR rd_interwiki IS NULL)
WHERE tl_namespace = '828'
   AND tl_title = 'languages/data3/i'
ORDER BY tl_from
LIMIT 100

) tmp
JOIN page ON tl_from = page_id
ORDER BY page_id
LIMIT 51;

Increase the inner LIMIT if stale links are a problem; 500 or 1000 would be fine.

This works now as long as the NS selector is not used. If it is, then we'd need an rd_from_namespace field to make this work.

(In reply to Aaron Schulz from comment #8)

This works now as long as the NS selector is not used. If it is, then we'd
need an rd_from_namespace field to make this work.

Sorry not rd_from_namespace but rather *_from_namespace for templatelinks, imagelinks, and pagelinks.

Change 117373 had a related patch set uploaded by Aaron Schulz:
Redid WhatLinksHere query and added a _from_namespace field

https://gerrit.wikimedia.org/r/117373

Change 117373 had a related patch set uploaded by Krinkle:
Redo WhatLinksHere query and add a *_from_namespace field to link tables

https://gerrit.wikimedia.org/r/117373

I am testing MariaDB 5.5.36 on db1034 with a patch to add a innodb_min_scan_time variable which allows ha_innodb::scan_time to be controlled, and by extension the apparent cost of an index scan (InnoDB tables scans are index scans on the clustered primary key). Historically this sort of thing could be done with MyISAM and max_seeks_for_key, but that isn't so effective for InnoDB.

SpecialWhatLinksHere::showIndirectLinks queries on skewed data is one group of the queries this setting helps. Also likely a number of other queries using FORCE INDEX could benefit too. So far no adverse impact on other traffic or disk IO patterns.

Not necessarily a reason to abandon Aaron's patch; just an update because the patch is blocked on a schema change that won't be scheduled until after the TechOps meet.

Change 117373 merged by jenkins-bot:
Redo WhatLinksHere query and add a *_from_namespace field to link tables

https://gerrit.wikimedia.org/r/117373