Page MenuHomePhabricator

possible to create duplicate sitelinks
Closed, DuplicatePublic

Description

reported at http://www.wikidata.org/wiki/Wikidata:Contact_the_development_team#True_duplicate

Currently both Q12863749 and Q2618201 link to ka:დიდი ბრიტანეთი 1960 წლის ზაფხულის ოლიმპიურ თამაშებზე.

Same with Q12863758/Q146146/ka:დიდი ბრიტანეთი 1996 წლის ზაფხულის ოლიმპიურ თამაშებზე.


Version: master
Severity: critical

Details

Reference
bz48260

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:21 AM
bzimport set Reference to bz48260.
bzimport added a subscriber: Unknown Object (MLST).

pinkampersand.wikimedia wrote:

I should note that it's not possible to add either ka.wp link to any other items. (Try it for yourself @ [[d:Q4115189]].) Furthermore, it's not possible to edit any other fields on either item, as doing so generates a "Site link [[lang:page]] already used on [[Q####]]" error, even if you're just trying to set a label/description/alias. (Same if you try to use special pages instead of editing directly.)

byrial wrote:

Q12340897 and Q12343899 also both have [[da:Viborgvej (Aarhus)]] as links. I can edit labels and description in one of them (Q12343899), but not the other.

I have tried to investigate Q12863749 and Q2618201 a bit. Here is what I found:

  • Q12863749 was created on May 6 2013, with the ka link in place.
  • Q2618201 got the ka link two days later, on May 8 2013.

The edit to Q2618201 that added this link should not have worked, it should have been prevented by a uniqueness constraint implemented using the database table wb_items_per_site. However, looking at this table, it has an entry for the ka links on Q2618201, but not for Q12863749. This means that Q2618201 now essentially "owns" that link.

Consequently, Q2618201 can still be edited, while edits to Q12863749 will fail due to the uniqueness constraint.

The cause of the problem is probably that the edit that created Q12863749 was not fully completed, but failed for some reason half way through the process, after saving the primary data blob but before registering the site links in wb_items_per_site, causing an inconsistency in the database.

Note that the risk of such inconsistencies is considered acceptable in MediaWiki design, since enforcing full consistency using transactions would make it very hard to make page updates scale to the level we need on Wikipedia.

Marking wontfix, because we can't fix this without rewriting most of MediaWiki.

As to the issue at hand, Q12863749 should probably just be deleted, since it consists only of the duplicate links.

pinkampersand.wikimedia wrote:

{{Deleted|color=pink}} :)

Out of curiosity, does the fact that the first 2 were only 9 Q#s apart (and, to a lesser extent, that the third was only 500k away from them) mean anything? That is to say, is there any particular reason that the inconsistency occurred in so narrow a range?

(In reply to comment #4)

That is to say, is there any particular reason that the inconsistency
occurred in so narrow a range?

I don't think so, except for the fact that during a time where several bots are working on importing the same set of pages, this is more likely to happen.

For the record: this kind of thing should be *rare*. We can't avoid it completely, but it really shouldn't happen often.

If this happens frequently, please re-open this bug.

pinkampersand.wikimedia wrote:

(In reply to comment #5)

For the record: this kind of thing should be *rare*. We can't avoid it
completely, but it really shouldn't happen often.

If this happens frequently, please re-open this bug.

Okay. I've created [[d:Wikidata:True duplicates]] to monitor how often this happens. Thanks for figuring this out. :)

byrial wrote:

I wrote all 30.536.568 links in 2013-05-27 database dump to a file, and then used sort(1) and uniq(1) to find all duplicate links. The result is 38,764 duplicates, which is 1.3 per 1,000 links, so this is not a rare thing.

byrial wrote:

Correction: My default collation order did sort some different characters as the same. With LC_COLLATE=C there is only 3,182 duplicates (0.10 duplicates pr. 1,000 links), but it is still many I think.

(In reply to comment #8)

I wrote all 30.536.568 links in 2013-05-27 database dump to a file, and then
used sort(1) and uniq(1) to find all duplicate links.

To my knowledge, there can be no dupes in the wb_items_by_site table, because there is a primary key covering the relevant fields.

Can you show exactly what you did? What exactly the query looks like? Can you give some examples duplicates?

As far as I can see, the problem described in this report occurs when there are things *missing* from wb_items_by_site, and thus conflicts fail to be detected.

byrial wrote:

(Reply to comment #8)
I did not use the wb_items_by_site table (that would be impossible for me as it is not available in the public database dumps of Wikidata).

I downloaded http://dumps.wikimedia.org/wikidatawiki/20130527/wikidatawiki-20130527-pages-articles.xml.bz2, and parsed the stored JSON formatted page text for each item (that is all pages in namespace 0). There is 12,565,377 items in the file and they contain 30,536,568 links. 3.160 of the links are occur twice, for example:

als:Vorlage:Navigationsleiste Schweizer Gebirgspässe
an:Piedra (desambigación)
ar:تصنيف:بريطانيا
ar:تصنيف:تاريخ الشام
ar:تصنيف:تعليقات للرد
ar:تصنيف:جامعة محمد الخامس
ar:تصنيف:ولاية أريانة
ar:تصنيف:ولاية قابس
ar:تصنيف:ويكيبيديون رجال
ar:يحيى الفخراني
arz:تصنيف:بريطانيا
arz:يحيى الفخرانى
ast:Categoría:Botsuana
ast:Categoría:Llioneses
ast:La Caleya

22 of the links occur three times, for example:

az:Kateqoriya:1802-ci ildəki hadisələr
be-x-old:Катэгорыя:Падзеі 1802 году
be:Катэгорыя:Горад Мар'іна Горка
eo:Kategorio:139° U
map-bms:Kategori:Bangsalsari, Jember
map-bms:Kategori:Cibitung, Bekasi
map-bms:Kategori:Cilebak, Kuningan
os:Категори:139° н. д.
sr:Категорија:Босанскохерцеговачки вајари
ta:இடைக்குன்றூர் கிழார்
ta:உறையூர் மருத்துவன் தாமோதரனார்

I will later prepare a complete list of the items which contain the duplicate links so they can be deleted.

Thanks for investigating, Byrial!

(In reply to comment #10)

(Reply to comment #8)
I did not use the wb_items_by_site table (that would be impossible for me as
it is not available in the public database dumps of Wikidata).

Ah - I guess we should fix that.
It's available on the toolserver though. I assumed you were using that.

There is 12,565,377 items in the file and they contain
30,536,568 links. 3.160 of the links are occur twice, for example:

Could you please include the item IDs in that list? I can find one of the items in the database easily, but (by the nature of the bug) not the other (or the third).

I will later prepare a complete list of the items which contain the duplicate
links so they can be deleted.

That would be awesome, thank you!

Re-opening, so we can track the investigation and deletion of further duplicates.

byrial wrote:

Please see http://www.wikidata.org/wiki/User:Byrial/Duplicates

Each line contains one of the 3182 duplicate links and the 2 or 3 items which contains the link. (NB: It is not the original links in the list as the they appear in the databse dump file, as I have changed the original localized namespace names for some (but not all) languages when I originally parsed the database dump)

One item may be on the list several times when it contains several duplicated links for different languages. The list is sorted after item number.

We can't prevent this from happening, but we could:

  • we could try harder to detect incomplete saves
  • we could have a maintenance script that removes sitelinks from entities that don't have that sitelink stored in the database table.

*** This bug has been marked as a duplicate of bug 42325 ***