Page MenuHomePhabricator

API query for template usage involving multiple titles and namespace filter times out
Closed, ResolvedPublic

Description

Author: carnildo

Description:
An API query on multiple titles for template usage with a namespace restriction gets an HTTP timeout.

Steps to reproduce:

Visit the URL http://en.wikipedia.org/w/api.php?action=query&format=xml&prop=templates&tlnamespace=10&titles=User:Carnildo/sandbox3|User:Carnildo/sandbox6

Removing either the "tlnamespace" term or either of the titles eliminates the error.

Expected result:

<?xml version="1.0" encoding="utf-8"?><api><query><pages><page pageid="9945938" ns="2" title="User:Carnildo/sandbox3" /><page pageid="16027416" ns="2" title="User:Carnildo/sandbox6"><templates><tl ns="10" title="Template:Administrators&#039; noticeboard navbox" /><tl ns="10" title="Template:Archiveline" /><tl ns="10" title="Template:Editabuselinks" /><tl ns="10" title="Template:Nowrap" /><tl ns="10" title="Template:Nowrap begin" /><tl ns="10" title="Template:Nowrap end" /><tl ns="10" title="Template:Purge" /><tl ns="10" title="Template:Shortcut" /><tl ns="10" title="Template:Tnavbar" /><tl ns="10" title="Template:•w" /><tl ns="10" title="Template:•wrap" /></templates></page></pages></query></api>

Actual result:

504 Gateway Timeout


Version: unspecified
Severity: normal

Details

Reference
bz14102

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:08 PM
bzimport set Reference to bz14102.

Looks like this should be fast, not sure why it is slow.

Maybe this is another insane index choice, it may be scanning the (namespace, title, from) index and filesorting...

Not having this doing the query on TS

It's using the correct index, it just doesn't like it when you do:

SELECT stuff FROM templatelinks WHERE tl_title IN ('98', '107) AND tl_namespace=0 ORDER BY tl_from, tl_namespace, tl_title

It doesn't like the fact that you're sorting by tl_namespace while it's a constant in the WHERE clause. Dropping tl_namespace from the ORDER BY (it's pointless anyway) fixes that. The same happens for tl_title when there's only one title.

Fixed in r34720

(In reply to comment #4)

It's using the correct index, it just doesn't like it when you do:

SELECT stuff FROM templatelinks WHERE tl_title IN ('98', '107) AND
tl_namespace=0 ORDER BY tl_from, tl_namespace, tl_title

It doesn't like the fact that you're sorting by tl_namespace while it's a
constant in the WHERE clause. Dropping tl_namespace from the ORDER BY (it's
pointless anyway) fixes that. The same happens for tl_title when there's only
one title.

Fixed in r34720

If you say X = 'a' and sort by X,Y that shouldn't go that slow, unless there is a bug. Also, I did the two part query on TS myself, and it was fast (only like 9 results). If it was using the right index, it should be fast. The query took like an hour according to profiling, which strongly suggests wrong index usage. I hope the ORDER BY tweak avoids that perhaps.

"If you say X = 'a' and sort by X,Y that shouldn't go that slow". Of course, I meant to also say "in comparison to sorting by Y only".

(In reply to comment #5)

If you say X = 'a' and sort by X,Y that shouldn't go that slow, unless there is
a bug. Also, I did the two part query on TS myself, and it was fast (only like
9 results). If it was using the right index, it should be fast. The query took
like an hour according to profiling, which strongly suggests wrong index usage.
I hope the ORDER BY tweak avoids that perhaps.

I know that. Anyway, MySQL behaves very stupidly and insists on filesorting when it encounters pl_namespace twice. According to my EXPLAINs, WHERE pl_namespace=0 ORDER BY pl_from, pl_namespace, pl_title filesorts, but omitting pl_namespace from the ORDER BY doesn't. The right index (pl_from, pl_namespace, pl_title) is automatically picked. If this query still has issues after the server sync, we probably need to force the index.

Even if all it was doing was using the pl_from part of the right index, doing a WHERE and a filesort, it should still be fast since it was only for those two pages, and they didn't seem to have any huge link lists. I'm still inclined that it was doing the wrong index in this case, as nothing else seems to explain the slowness.

I'll add a FORCE INDEX just to be sure.