Page MenuHomePhabricator

Wikimedia redirect tables are missing many entries
Closed, ResolvedPublic

Description

Author: rainman

Description:
The redirect table for en.wiki has about ~1m entries, however,
parsing the xml dump gives around ~2m redirects. So we are
missing a lot of redirects. Having all the redirects in the
table is, among other things, needed for the lucene incremental
updater.

To fix this, run maintenance/refreshLinks.php --redirects-only
on all wikis. However, I urge someone smarter than me to review
if this script is doing what it's ought to be doing :)


Version: unspecified
Severity: normal

Details

Reference
bz10931

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:53 PM
bzimport set Reference to bz10931.
bzimport added a subscriber: Unknown Object (MLST).

robchur wrote:

*** Bug 9799 has been marked as a duplicate of this bug. ***

john.lehmann wrote:

I can confirm that the 2 million redirects from the XML dump are valid in the sense that they do resolve to existing, normal pages in the page table.

ayg wrote:

It would possibly be best to contact Brion on IRC about this, if you can find a rare moment when he's not busy. Or e-mail him, brion@pobox.com. He's (AFAIK) the sole maintainer of the DB dumps, as of so many other things.

The redirect table is updated on demand. Redirects which have not been changed since its introduction won't be listed in it.

brian wrote:

What's wrong? Why is this bug still open?

Probably because it would be nice if the table were up to date. :)

broken.arrow wrote:

*** Bug 12182 has been marked as a duplicate of this bug. ***

  • Bug 12507 has been marked as a duplicate of this bug. ***

ayg wrote:

It seems like refreshLinks.php could use a "fix only old redirects" mode. It's currently assuming that it has to check every single page for redirect-ness. If a mode were added so it just checked pages from

SELECT page_id FROM page LEFT JOIN redirect ON page_id=rd_from WHERE page_is_redirect=1 AND rd_from IS NULL

then it would be considerably faster, I imagine. It would still have to scan the page table, but at least it wouldn't have to do it a row at a time and send the page text for every row over the wire.

Maybe this should even be added to update.php.

brian wrote:

Bug 12507 was reopened with a comment that included the statement "Bug 10931 is about missing entries in the redirect table in a
particular wiki ..." The summary for this bug says that a redirect table is missing half its entries, without stating which table (though the description says it's en.wiki).

However, the most recent comments are about the software and not any particular wiki.1

What's this bug really about?

ayg wrote:

The most recent comments were about Wikimedia wikis, which is what this bug is about (note Product=Wikimedia). My comment remarked that currently the available scripts are not ideal for this and they should be improved for this to be done reasonably. I also added offhandedly that if that was done, update.php should maybe run that script, but that was a side point and not really related to this bug.

brian wrote:

I've updated the summary to remove the ambiguity, but I still think that the most recent "nontrivial" comments (comments 4 and 9) appear to be purely about the software. Given that the summary, description and product fields all state that this bug is about Wikimedia wikis, shouldn't the software issues be addressed in another bug?

brian wrote:

Why is the component field set to "Downloads"? This bug doesn't seem to have anything to do with downloads.

ayg wrote:

That's the only user-visible effect.

brian wrote:

According to bug 9799, this affects Special:DoubleRedirects. Is viewing a special page considered a "download" (which is apparently defined here as anything to do with "public data dumps at download.wikimedia.org")?

ayg wrote:

Okay, so it has other (indirect) user-visible effects too. Is any of this really relevant to fixing the problem?

brian wrote:

(In reply to comment #12)

I've updated the summary to remove the ambiguity, but I still think that the
most recent "nontrivial" comments (comments 4 and 9) appear to be purely about
the software. Given that the summary, description and product fields all state
that this bug is about Wikimedia wikis, shouldn't the software issues be
addressed in another bug?

Bugzilla's guidelines state, as one of their 5 principles, "one bug per report." I always took this to mean that issues with Wikimedia wikis and associated MediaWiki issues should be covered by separate reports.

ayg wrote:

Spamming ten people's e-mail boxes with this discussion is not very productive. If you really care, please e-mail any followups directly to me. Yes, a bug could be created for the software issue (blocking this) if someone wanted. It doesn't have to be.

We should be discussing solutions here, not BugZilla guidelines.

Anyway, why don't we change the procedure for viewing redirects? Currently it's:

  1. Check the page_is_redirect field
  2. If it's 1, fetch the page's content
  3. See if it contains #REDIRECT
  4. If so, check whether the redirect actually points somewhere
  5. If so, go there.

We could change that to:

  1. Check the page_is_redirect field
  2. If it's 1, query the redirect table (faster than fetching the page's content)
  3. If there is no redirect table entry, do the old page content thingy AND ADD A ROW TO THE REDIRECT TABLE

Thoughts?

I approve this idea. This will resolve all these redirect problems in the future. Are there any problems in implementing it?

ayg wrote:

(In reply to comment #19)

That sounds like an excellent idea.

Thanks for the praise so far. I'll submit a patch tomorrow (I think it's best that I wrote this thing, 'cause I wanna integrate it into the API too, see bug 13651)

aaron wrote:

sounds fine to me, and i look forward to your patch, roan.

Created attachment 4798
Proposed patch

The attached patch does quite a range of things:

Stuff related to this bug:

  • Introduce Article::getRedirectTarget(), which queries the redirect table to get the article's redirect target (if it has one). If the article doesn't have an entry in the redirect table, insertRedirect() is called
  • Introduce Article::insertRedirect(), which obtains the redirect target from the page text and inserts a row into the redirect table. Not meant to be called directly
  • Use getRedirectTarget() in some places in Article.php (except those
  • In ApiPageSet::getRedirectTargets(), call Article::insertRedirect() for every redirect that wasn't found in the redirect table

Stuff not related to this bug (sorry for these, but it's nothing complicated):

  • Kill a redundant DB query in ApiPageSet::getRedirectTargets()
  • Introduce ApiMain::scheduleCommit() which schedules a $dbw->immediateCommit() just before the end of ApiMain::execute()
  • Use scheduleCommit() in all API edit modules and ApiPageSet::getRedirectTargets()

Numbers of DB queries in various situations:
If the page is in the redirect table already: 1 SELECT
If the page isn't in the redirect table yet: 3 SELECTs, 1 INSERT
If the page isn't in the redirect table yet and is queried through the API: 6 SELECTs and 1 INSERT per page + 1 SELECT (yes, that sucks, caused by crappy code in the Article class which is full of FIXMEs)

If the page is in the redirect table already:
SELECT /* Article::getRedirectTarget */ rd_namespace,rd_title FROM redirect WHERE rd_from = '123'

If the page isn't in the redirect table yet:
SELECT /* Article::getRedirectTarget */ rd_namespace,rd_title FROM redirect WHERE rd_from = '123'
/* Revision::fetchRow */ SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,page_namespace,page_title,page_latest FROM page,revision WHERE (page_id=rev_page) AND rev_id = '466' LIMIT 1
/* Revision::loadText 127.0.0.1 */ SELECT old_text,old_flags FROM text WHERE old_id = '256' LIMIT 1
INSERT /* Database::insert 127.0.0.1 */ INTO redirect (rd_from,rd_namespace,rd_title) VALUES ('123','0','Foo')

If the page isn't in the redirect table yet and is queried through the API:
SELECT /* ApiPageSet::getRedirectTargets 127.0.0.1 */ rd_from,rd_namespace,rd_title FROM redirect WHERE rd_from = '123'
/* LinkCache::addLinkObj */ SELECT page_id,page_len,page_is_redirect FROM page WHERE page_namespace = '0' AND page_title = 'Foo' LIMIT 1
/* Article::pageData */ SELECT page_id,page_namespace,page_title,page_restrictions,page_counter,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len FROM page WHERE page_namespace = '0' AND page_title = 'Foo' LIMIT 1
SELECT /* Title::loadRestrictions */ * FROM page_restrictions WHERE pr_page = '123'
/* Title::loadRestrictionsFromRow */ SELECT page_restrictions FROM page WHERE page_id = '123' LIMIT 1
/* Revision::fetchRow */ SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,page_namespace,page_title,page_latest FROM page,revision WHERE (page_id=rev_page) AND rev_id = '123' LIMIT 1
/* Revision::loadText */ SELECT old_text,old_flags FROM text WHERE old_id = '102' LIMIT 1
INSERT /* Database::insert */ INTO redirect (rd_from,rd_namespace,rd_title) VALUES ('123','345','Foo')

Attached:

webboy wrote:

Fix disabled by r33381, reopening bug