Page MenuHomePhabricator

json dumps have duplicate items (one for the redirect, one for the target)
Closed, ResolvedPublic

Description

from project chat:

https://www.wikidata.org/wiki/Wikidata:Project_chat#JSON_dump_has_duplicates

I've been working with the JSON dumps and notice that it has identical duplicate entries. For example, in the latest dump [3], line numbers 921522 and 16155575 are identical dumps of item Turi railway station (Q17100180). There are dozens of these duplicates. Should these be treated in a special way when processing the data dump? Jefft0 (talk) 01:17, 29 October 2014 (UTC)

:It looks like another item page [4] redirects to Turi railway station (Q17100180). I don't think the redirect should be in the dump as a duplicate, so seems like a bug. But the redirect probably should be represented somewhere and in some form. Aude (talk) 07:18, 29 October 2014 (UTC)


Version: unspecified
Severity: normal
Whiteboard: u=dev c=backend p=0

Details

Reference
bz72678

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:44 AM
bzimport set Reference to bz72678.
bzimport added a subscriber: Unknown Object (MLST).

I wonder whether we want to include information about redirects in there or simply leave redirects out? Just leaving them out will be easier to implement and not change the schema (thus not breaking b/c), so I'd suggest going down that road.

(In reply to Marius Hoch from comment #1)

I wonder whether we want to include information about redirects in there or
simply leave redirects out? Just leaving them out will be easier to
implement and not change the schema (thus not breaking b/c), so I'd suggest
going down that road.

An item may have a property value which is a redirected item. Does the JSON dump "resolve" the item value and dump the item ID of the redirect target? If yes, then the JSON dump can leave out redirect information. But if the JSON dump does not resolve a redirect, then there must be a dump for the redirects (maybe in the JSON dump or another file).

(In reply to Ori Livneh from comment #3)

See: http://tools.ietf.org/html/rfc6901

<hoo> thinking about it: That would horribly not work, although it might be semantically nice
<hoo> our JSON has close to 40gb now (or so), thus people usually read it line by line and not as a one json

Maybe we want a simple { id: "Q123", redirect: "Q1" } or something link this? People only care (if at all) for the id of the redirected entity and the target of the redirect.

Ok, we just talked about this and decided to simply leave redirects out of the dump for now.

Having a separate json dump with redirect information might be doable though, if there is desire to have this (please open a new bug then).

Change 173664 had a related patch set uploaded by Hoo man:
Don't include redirects in json dumps

https://gerrit.wikimedia.org/r/173664

Change 173664 merged by jenkins-bot:
Don't include redirects in json dumps

https://gerrit.wikimedia.org/r/173664

daniel claimed this task.

fix merged

On http://dumps.wikimedia.org/other/wikidata, JSON dumps are about 3GB. But today's dump is about half the size at only 763 MB. This issues was to remove redirects, but I doubt that would make the dump half the size. Is all the data in the dump?

On http://dumps.wikimedia.org/other/wikidata, JSON dumps are about 3GB. But today's dump is about half the size at only 763 MB. This issues was to remove redirects, but I doubt that would make the dump half the size. Is all the data in the dump?

Something obviously failed, I'm looking into that right now. The incomplete dump has been removed.

I'm afraid there may still be some millions of duplicates in Wikidata json dump.

According to its main page, there are 17.209.354 items in Wikidata. I downloaded a Wikidata entities json dump a few days ago and I expected to find the same number of items there. However, I counted about 20,568,190 lines, 20,565,957 of which are items (e.g. entities with an id starting with "Q"). I suppose this 3 million items excess is made of duplicate items.

Anyway, I haven't been able to find any actual duplicate, but it's hard to search for duplicates in a 69Gb file.

If it matters, the dump I used was latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/ , with date march 28 2016.

I'm afraid there may still be some millions of duplicates in Wikidata json dump.

According to its main page, there are 17.209.354 items in Wikidata. I downloaded a Wikidata entities json dump a few days ago and I expected to find the same number of items there. However, I counted about 20,568,190 lines, 20,565,957 of which are items (e.g. entities with an id starting with "Q"). I suppose this 3 million items excess is made of duplicate items.

Anyway, I haven't been able to find any actual duplicate, but it's hard to search for duplicates in a 69Gb file.

If it matters, the dump I used was latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/ , with date march 28 2016.

That is not a bug, the item count on the main page is only counting items with at least one sitelink or at least one statement (or something along these lines), but the dump contains all items. Also the number on the main page is slightly off all the time (due to caching and they way it is incremented/ decremented).

Thank you. Then the problem is just that Wikidata statistics (or dupm files) should be better documented. It's not just Wikidata main page: neither https://www.wikidata.org/wiki/Special:Statistics gives any clue about how many pages are in Wikidata main namespace.