Page MenuHomePhabricator

redirect, pagelinks and page.page_is_redirect table information out of sync
Open, MediumPublic

Description

Author: wikibugs

Description:
I imported the page, pagelinks and redirect tables from the 20120702 dump in order to analyze the pagelink structure. For this i need to resolve redirects and found some weirdness in the process:

There are some pages in the redirect table which don't have the page.page_is_redirect flag set:

select * from redirect join page on rd_from=page_id where page_is_redirect != 1 limit 100;

Results in 23 rows: http://p.defau.lt/?smGJYbXXlqwLPhreqbe8VA

Another weirdness: there are pages marked as page.page_is_redirect but don't appear in the redirect table, even though that might be explained here: http://www.mediawiki.org/wiki/Manual:Redirect_table (see the NOTE as of Aug 2007)

select * from page left join redirect on page_id=rd_from where page_is_redirect != 0 and rd_from is null limit 100;

Results in 5 rows: http://p.defau.lt/?pwJZMIVtTGGZYb69o8YWxg

Matthew_ from #wikimedia-tech ran these queries on the toolserver and seems the same problems exist in production.

Yet another weirdness can be found in the pagelinks table: there are ~25000 pagelinks for redirect pages which don't just link to the redirected page but to other pages alone in the article namespace:

select count(1) from page inner join redirect on page_id = rd_from inner join pagelinks on rd_from=pl_from and rd_namespace=pl_namespace where pl_namespace = 0 and rd_title != pl_title;

+----------+

count(1)

+----------+

25168

+----------+
1 row in set (1 hour 44 min 35.16 sec)

Would be cool if someone could have a look.


Version: unspecified
Severity: normal

Details

Reference
bz39011

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:55 AM
bzimport set Reference to bz39011.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to comment #0)

Matthew_ from #wikimedia-tech ran these queries on the toolserver and seems the
same problems exist in production.

s/production/the TS replica/

It's going to be a bit longer before Sam can get to this. Unassigning, though Sam is primarily responsible for "shell" bugs and will still have this implicitly on his list.

glciampagl wrote:

Hi,

looks like the second point has been fixed in the prod db:

mysql:giovanni@db1047.eqiad.wmnet [enwiki]> select count(*) from page left join redirect on page_id = rd_from where page_is_redirect != 0 and rd_from is null;

+----------+

count(*)

+----------+

0

+----------+
1 row in set (39.33 sec)

mysql:giovanni@db1047.eqiad.wmnet [enwiki]>

(In reply to comment #3)

Hi,

looks like the second point has been fixed in the prod db:

mysql:giovanni@db1047.eqiad.wmnet [enwiki]> select count(*) from page left
join
redirect on page_id = rd_from where page_is_redirect != 0 and rd_from is
null;

+----------+

count(*)

+----------+

0

+----------+
1 row in set (39.33 sec)

mysql:giovanni@db1047.eqiad.wmnet [enwiki]>

Agreed.

First one done too now. It only had 5 rows now. 1 was an erroneous redirect, presumably just never deleted. 1 was a redirect in user js files which I don't think ever worked. 2 were non existent pages.

mysql:wikiadmin@db60 [enwiki]> select * from redirect join page on rd_from=page_id where page_is_redirect != 1 limit 100;
Empty set (19.58 sec)

What actually wants doing for the 3rd query?

I'm currently running it in a screen session against a quiet slave (selecting all fields, not just counting), so will have a resultset to actually look at. I will see about attaching it to this bug when done, though I guess it's going to take upto a couple of hours to run

tomasz set Security to None.

Guess someone else will need to run that query again then.