Page MenuHomePhabricator

Prevent creation of items having the same sitelinks (duplicates)
Open, HighPublic8 Estimated Story Points

Description

Main components:

  • Wikibase Repo

Problem:
Due to replication lag, or writes happening after requests finish there are situations in which multiple edits made in the same second can end up creating duplicate sitelinks
(see examples in the comments)

Suggested solution:
Idea for a approach is discussed in T44325#6976484 & T44325#6981283 which involves another lock for these sitelinks when being added before the secondary store gets updated.

Steps to reproduce:
See comment from Lucas (T44325#7355099).

Acceptance criteria:

  • implement suggested solution so that multiple edits cannot create duplicate sitelinks anymore (even if made in the same second)

Details

Reference
bz42325

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

We're still getting quite a few duplicates. In the 2016-02-15 dump I found 951 sitelinks that appear more than once, in the 2016-10-10 dump there are 3914. I haven't checked all of them, but I've already come across a bunch of examples with quite a bit of time between the creations, e.g.:

https://www.wikidata.org/wiki/Q23889992
https://www.wikidata.org/wiki/Q23890002
https://www.wikidata.org/wiki/Q23890013
https://www.wikidata.org/wiki/Q23890309

These are from April 2016. There's nearly half an hour between the first and last one.

https://www.wikidata.org/wiki/Q21066942
https://www.wikidata.org/wiki/Q23760872

The sitelink on the first item was updated in March 2016, the second item was created almost a month later.

https://www.wikidata.org/wiki/Q19656655
https://www.wikidata.org/wiki/Q19972345

The first item was created in March 2015, the second was created over two months later.

Also:

https://www.wikidata.org/wiki/Q3778334
https://www.wikidata.org/wiki/Q3778170
https://www.wikidata.org/wiki/Q3778109
https://www.wikidata.org/wiki/Q3777949
https://www.wikidata.org/wiki/Q3778306
https://www.wikidata.org/wiki/Q3777922

It seems that in May 2015 the pages were combined together and the histories merged. Those didn't involve creating a new item... should there be a different ticket for that?

Beta16 raised the priority of this task from Medium to High.Oct 14 2016, 7:27 AM

The problem here is that we do these constraints checks based on the wb_items_per_site table. This is (should be) updated immediately after an edit happened (thus the checks worked correctly after the change got picked up by all database replicas).

I've just fixed another occurrence of this: Q37637657 and Q37637661.

I think that was repo/maintenance/rebuildItemsPerSite.php which reports problematic items.

We can re-run it, but should probably only do so after the DC switchback (given the script runs for quite some time). I'll kick it off on Monday (in case everything is fine).

Mentioned in SAL (#wikimedia-operations) [2018-10-15T11:29:24Z] <hoo> Started rebuildItemsPerSite on mwmaint1002 (T44325). Can be killed at any time, if necessary.

@hoo did we ever get a result from that script run?

The list of more examples and a description of how such duplicates were generated unintentionally is now availyble on: https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2018/10#Items_created_unintentionally_twice_on_several_occasions

@hoo did we ever get a result from that script run?

I forgot about thisโ€ฆ most of this seems to still be unresolved, so I put it up on https://www.wikidata.org/wiki/Wikidata:True_duplicates#Items_with_conflicting_sitelinks.

Today there was again twice a creation of duplicate items in wikidata:

2019-04-25T06:23:34 diff hist +466โ€Ž N Category:2012 in Judaism (Q63323044) โ€Ž โ€ŽCreated a new Item Tag: PHP7
2019-04-25T06:23:34 diff hist +466โ€Ž N Category:2012 in Judaism (Q63323045) โ€Ž โ€ŽCreated a new Item current Tag: PHP7

2019-04-25T06:18:31 diff hist +463โ€Ž N Category:1896 in Poland (Q63323035) โ€Ž โ€ŽCreated a new Item Tag: PHP7
2019-04-25T06:18:31 diff hist +463โ€Ž N Category:1896 in Poland (Q63323036) โ€Ž โ€ŽCreated a new Item current Tag: PHP7

Today there was again on 7 occassions unintended creation of duplicate items in wikidata:

2019-05-15T23:02:42 diff hist +466โ€Ž N Category:1894 in Judaism (Q63864072) โ€Ž โ€ŽCreated a new Item current
2019-05-15T23:02:42 diff hist +466โ€Ž N Category:1894 in Judaism (Q63864074) โ€Ž โ€ŽCreated a new Item current

2019-05-15T23:01:47 diff hist +466โ€Ž N Category:1891 in Judaism (Q63864036) โ€Ž โ€ŽCreated a new Item current
2019-05-15T23:01:47 diff hist +466โ€Ž N Category:1891 in Judaism (Q63864037) โ€Ž โ€ŽCreated a new Item current

2019-05-15T22:59:02 diff hist +466โ€Ž N Category:1890 in Judaism (Q63863914) โ€Ž โ€ŽCreated a new Item
2019-05-15T22:59:02 diff hist +466โ€Ž N Category:1890 in Judaism (Q63863915) โ€Ž โ€ŽCreated a new Item current

2019-05-15T00:41:11 diff hist +466โ€Ž N Category:1861 in Judaism (Q63852412) โ€Ž โ€ŽCreated a new Item Tag: PHP7
2019-05-15T00:41:11 diff hist +466โ€Ž N (Q63852413) โ€Ž โ€ŽCreated a new Item Tag: PHP7

2019-05-15T00:38:09 diff hist +466โ€Ž N Category:1863 in Judaism (Q63852406) โ€Ž โ€ŽCreated a new Item Tag: PHP7
2019-05-15T00:38:09 diff hist +466โ€Ž N (Q63852407) โ€Ž โ€ŽCreated a new Item Tag: PHP7

2019-05-15T00:35:59 diff hist +466โ€Ž N Category:1864 in Judaism (Q63852400) โ€Ž โ€ŽCreated a new Item Tag: PHP7
2019-05-15T00:35:59 diff hist +466โ€Ž N (Q63852401) โ€Ž โ€ŽCreated a new Item Tag: PHP7

2019-05-15T00:30:44 diff hist +466โ€Ž N Category:1867 in Judaism (Q63852214) โ€Ž โ€ŽCreated a new Item Tag: PHP7
2019-05-15T00:30:44 diff hist +466โ€Ž N (Q63852215) โ€Ž โ€ŽCreated a new Item Tag: PHP7

Each of this happened when i proceeded like this on a category page on Commons I clicked on the menu point "In Wikipedia Add links" in the side menu bar for a category which was no yet linked to any wikidata item and linking it to a category on en-wikipedia which as well at this moment was not yet linked to any wikidata item. When the 'system' then creates the corresponding item on wikidata with links to commons and to en-wikipedia than this item is created twice.

maybe this can help a developper to reproduce this issue

And here are another 3 examples where creating wikidata items on base of categories on Commons lead to the creation of 2 items on wikidata linked to the same pages on Commons and on en-wikipedia

2019-05-15T23:02:42 diff hist +466โ€Ž N Category:1894 in Judaism (Q63864072) โ€Ž โ€ŽCreated a new Item
2019-05-15T23:02:42 diff hist +466โ€Ž N (Q63864074) โ€Ž โ€ŽCreated a new Item

2019-05-15T23:01:47 diff hist +466โ€Ž N Category:1891 in Judaism (Q63864036) โ€Ž โ€ŽCreated a new Item
2019-05-15T23:01:47 diff hist +466โ€Ž N (Q63864037) โ€Ž โ€ŽCreated a new Item

2019-05-15T22:59:02 diff hist +466โ€Ž N Category:1890 in Judaism (Q63863914) โ€Ž โ€ŽCreated a new Item
2019-05-15T22:59:02 diff hist +466โ€Ž N (Q63863915) โ€Ž โ€ŽCreated a new Item

A few more examples of double creations in wikidata which where done without that I intended to do so:

2019-06-28T23:26:49 diff hist +469โ€Ž N Adolph Schreiber House (Q64864961) โ€Ž โ€ŽCreated a new Item current
2019-06-28T23:26:49 diff hist +469โ€Ž N Adolph Schreiber House (Q64864962) โ€Ž โ€ŽCreated a new Item current

2019-06-26T23:47:59 diff hist +436โ€Ž N Klehm House (Q64833963) โ€Ž โ€ŽCreated a new Item current
2019-06-26T23:47:58 diff hist +436โ€Ž N Klehm House (Q64833962) โ€Ž โ€ŽCreated a new Item current

2019-06-26T23:42:05 diff hist +466โ€Ž N Kieldson Double House (Q64833830) โ€Ž โ€ŽCreated a new Item current
2019-06-26T23:42:04 diff hist +466โ€Ž N Kieldson Double House (Q64833829) โ€Ž โ€ŽCreated a new Item current

2019-06-05T22:35:56 diff hist +469โ€Ž N Category:Earls of Balfour (Q64409304) โ€Ž โ€ŽCreated a new Item current Tag: PHP7
2019-06-05T22:35:55 diff hist +469โ€Ž N (Q64409303) โ€Ž โ€ŽCreated a new Item Tag: PHP7

2019-05-22T20:07:25 diff hist +466โ€Ž N Category:1886 in Judaism (Q63985592) โ€Ž โ€ŽCreated a new Item
2019-05-22T20:07:25 diff hist +466โ€Ž N (Q63985594) โ€Ž โ€ŽCreated a new Item

2019-05-19T19:26:35 diff hist +478โ€Ž N Category:Fort Meade, Florida (Q63955820) โ€Ž โ€ŽCreated a new Item current
2019-05-19T19:26:34 diff hist +478โ€Ž N Category:Fort Meade, Florida (Q63955819) โ€Ž โ€ŽCreated a new Item

as I create most of the items I create in wikidata the same way ans as this does not occur systematically I could imagine that this just happens if certain parameters are apllicable to the database.

Unfortunately I am not able to reproduce the phenomena I can just locate them in my contributions list.

Concerning the duplicates mentionned in T229391: Two wikidata items with the same sitelink.

I noted from the respective Revision history in wikidata that we have for their creation the following:

Q64008110: 2019-05-24T17:49:07โ€Ž TAKASUGI Shinji talk contribsโ€Ž 595 bytes +595โ€Ž โ€ŽCreated a new Item
Q64008111: 2019-05-24T17:49:07โ€Ž TAKASUGI Shinji talk contribsโ€Ž 595 bytes +595โ€Ž โ€ŽCreated a new Item

so for these 2 items similar to the ones I listed in various previous comments 2 items were created within the same second with the same content.

Unfortunately my knowledge of databases and tools to analyse them does not allow me to create reports to find in the wikidata database more such exempls but i am quite confinced that there are many more of such double creations in wikidata either in the sam e second or within a few seconds.

i found in my contributions another unintentional multiple creation creation of items in wikidata:

2020-08-19T14:54:38 diff hist +530โ€Ž N Category:19th century in Castillaโ€“La Mancha (Q98489876) โ€Ž โ€ŽCreated a new Item current
2020-08-19T14:54:38 diff hist +530โ€Ž N Category:19th century in Castillaโ€“La Mancha (Q98489877) โ€Ž โ€ŽCreated a new Item current
2020-08-19T14:54:38 diff hist +530โ€Ž N Category:19th century in Castillaโ€“La Mancha (Q98489875) โ€Ž โ€ŽCreated a new Item current
2020-08-19T14:54:37 diff hist +530โ€Ž N Category:19th century in Castillaโ€“La Mancha (Q98489873) โ€Ž โ€ŽCreated a new Item

so there wer 4 items created within 1 second with just one click.

I would like to add that this issue is happening for me just once upon now and then and i am not able to reproduce it willingly.

So this seems to mainly be an issue around creating items with sitelinks immediately after another item has been created with the sitelink.
The description of this ticket seems to be probably out of date now talking about a 30 min windows which should no longer happen.

So assuming we are looking at the immediate time after item creation, and before the sitelinks enter the secondary index, we could add an extra layer of checks for this specific case.
This could for example be storing newly created item sitelinks in memcached for a short period of time (covering the time before they will be in the sql index).
This would then also be checked when new sitelinks are added.

Thoughts @hoo ?

So assuming we are looking at the immediate time after item creation, and before the sitelinks enter the secondary index, we could add an extra layer of checks for this specific case.
This could for example be storing newly created item sitelinks in memcached for a short period of time (covering the time before they will be in the sql index).
This would then also be checked when new sitelinks are added.

I had a similar idea, but I think just using the object cache is not going to entirely solve this (as we can't truly get an exclusive lock) also solely using IDatabase::lock might not work reliably (as those get dropped if the connection is closed). The following should work though:

Entity creation:

foreach ( $sitelink in $sitelinks ) {
	if ( IDatabase::lock( $sitelink ) && !BagOStuff::get( $sitelink ) ) {
		BagOStuff::set( $sitelink, true, 1234 );
	} else {
		// give up and clean up
	}
}

// release all IDatabase::lock locks (automatically happens once the connection closes)

When writing the sitelinks we can now simply remove the BagOStuff cache entries.

This poses a potential problem that all requests creating an entity at once might fail (as no single request might manage to get all locks), but I guess this is pretty pathological and not going to happen much.
The more relevant problem is probably going to be, we (successfully) claim all sitelinks, but then never actually commit them to the table (as the edit fails for some reason). We might be able to mitigate this by only placing the locks after all/most other checks are done (thus we can be sure the edit is going through).

i found in my contributions another unintentional multiple creation creation of items in wikidata:

2020-08-19T14:54:38 diff hist +530โ€Ž N Category:19th century in Castillaโ€“La Mancha (Q98489876) โ€Ž โ€ŽCreated a new Item current
2020-08-19T14:54:38 diff hist +530โ€Ž N Category:19th century in Castillaโ€“La Mancha (Q98489877) โ€Ž โ€ŽCreated a new Item current
2020-08-19T14:54:38 diff hist +530โ€Ž N Category:19th century in Castillaโ€“La Mancha (Q98489875) โ€Ž โ€ŽCreated a new Item current
2020-08-19T14:54:37 diff hist +530โ€Ž N Category:19th century in Castillaโ€“La Mancha (Q98489873) โ€Ž โ€ŽCreated a new Item

so there wer 4 items created within 1 second with just one click.

I would like to add that this issue is happening for me just once upon now and then and i am not able to reproduce it willingly.

Are these all created using Special:NewItem?
Is this literally just the button being clicked 4 times?
That sounds bad and like something that should be fixed in another way separate to this ticket

Addshore set the point value for this task to 8.Sep 15 2021, 10:55 AM

This is fairly simple to reproduce locally even without breakpoints; on your local repo wiki, create a new wiki page and then run the following code in the browser console:

const params = { action: 'wbeditentity', new: 'item', data: JSON.stringify( { sitelinks: { [ mw.config.get( 'wgDBname' ) ]: { site: mw.config.get( 'wgDBname' ), title: mw.config.get( 'wgTitle' ) } } } ) };
const api = new mw.Api();
( await Promise.all( [ api.postWithEditToken( params ), api.postWithEditToken( params ) ] ) ).map( response => response.entity.id );

(Assuming the repo is configured to link to itself using its db name, otherwise you may need to tweak the sitelinks in the payload.)

i found in my contributions another unintentional multiple creation creation of items in wikidata:

2020-08-19T14:54:38 diff hist +530โ€Ž N Category:19th century in Castillaโ€“La Mancha (Q98489876) โ€Ž โ€ŽCreated a new Item current
2020-08-19T14:54:38 diff hist +530โ€Ž N Category:19th century in Castillaโ€“La Mancha (Q98489877) โ€Ž โ€ŽCreated a new Item current
2020-08-19T14:54:38 diff hist +530โ€Ž N Category:19th century in Castillaโ€“La Mancha (Q98489875) โ€Ž โ€ŽCreated a new Item current
2020-08-19T14:54:37 diff hist +530โ€Ž N Category:19th century in Castillaโ€“La Mancha (Q98489873) โ€Ž โ€ŽCreated a new Item

so there wer 4 items created within 1 second with just one click.

I would like to add that this issue is happening for me just once upon now and then and i am not able to reproduce it willingly.

Are these all created using Special:NewItem?
Is this literally just the button being clicked 4 times?
That sounds bad and like something that should be fixed in another way separate to this ticket

All these where created by using the add links button on Commons.

I can exclude that ithe button was clicked 4 times

Another recent exemple:

2021-09-19T21:45:49 ร‹nnerscheed Versiounen +444โ€Ž N Category:Pilling (Q108603047) โ€Ž โ€ŽHuet en neit Element ugeluecht Tag: Sitelink Change from Connected Wiki
2021-09-19T21:45:47 ร‹nnerscheed Versiounen +444โ€Ž N Category:Pilling (Q108603046) โ€Ž โ€ŽHuet en neit Element ugeluecht aktuell Tag: Sitelink Change from Connected Wiki

and it was as well created by clicking on the add link sbutton on Commons

I add the exemple from T291502:

This bnwiki article (https://bn.wikipedia.org/wiki/เฆชเฆคเงเฆฐเฆพเฆฒเฆฟ_เฆšเฆŸเงเฆŸเง‹เฆชเฆพเฆงเงเฆฏเฆพเฆฏเฆผ) is connected with two wikidata item-
#Q108406003 (https://wikidata.org/wiki/Q108406003)
#Q108406004 (https://wikidata.org/wiki/Q108406004)
If I click the link named โ€˜Wikidata itemโ€™ from sidebar of the article, it takes me to the item Q108406003.

when investing further on: https://www.wikidata.org/w/index.php?target=Yahya+%28flood%29&namespace=all&tagfilter=&newOnly=1&start=&end=&limit=1000&title=Special%3AContributions

i refound these there like this:

2021-09-04T14:50:24 diff hist +843โ€Ž N (Q108406004) โ€Ž โ€ŽCreated a new Item: batch #63427 Tag: quickstatements [2.0]
2021-09-04T14:50:24 diff hist +843โ€Ž N (Q108406003) โ€Ž โ€ŽCreated a new Item: batch #63427 Tag: quickstatements [2.0]

I noticed that there was as well the following duplicate

2021-09-04T14:47:33 diff hist +777โ€Ž N (Q108405827) โ€Ž โ€ŽCreated a new Item: batch #63427 current Tag: quickstatements [2.0]
2021-09-04T14:47:33 diff hist +777โ€Ž N (Q108405826) โ€Ž โ€ŽCreated a new Item: batch #63427 current Tag: quickstatements [2.0]

and moreover:

2021-09-04T14:46:05 diff hist +815โ€Ž N (Q108405681) โ€Ž โ€ŽCreated a new Item: batch #63427 current Tag: quickstatements [2.0]
2021-09-04T14:46:05 diff hist +815โ€Ž N (Q108405682) โ€Ž โ€ŽCreated a new Item: batch #63427 current Tag: quickstatements [2.0]

and finally:

2021-09-04T14:46:04 diff hist +783โ€Ž N (Q108405679) โ€Ž โ€ŽCreated a new Item: batch #63427 current Tag: quickstatements [2.0]
2021-09-04T14:46:04 diff hist +783โ€Ž N (Q108405678) โ€Ž โ€ŽCreated a new Item: batch #63427 current Tag: quickstatements [2.0]

these 3 duplicates where created during a batch run so we can exclude that there was human clicking twice somewhere on an icon or a button.

For each of the pairs I verified in the respective pairs of wikidata-items that they point to the same page on bn-wikipedia

The first part of the mitigations in T291377 should be deployed now, and the second part will be deployed over the next week (with the usual MediaWiki train, i.e. on Wednesday if nothing goes wrong). It would be interesting to know if the first part alone already eliminates most of the true duplicates, though Iโ€™m not sure how we can determine that.

Addshore claimed this task.

I'm going to go ahead and mark this as resolved for now.
Please do let us know if it comes up again.

This comment was removed by Manuel.

The items were created in 2022, so probably not a current problem?