Wikidata dumps contain old-style serialization.
Closed, InvalidPublic1 Estimated Story Points
Actions

Assigned To

Authored By

	daniel
	Oct 22 2014, 9:43 AM

Description

Some time ago, we changed the serialization format of wikidata items. For consistency, we implemented on-the-fly conversion to the new format in the exporter (using the ContentHandler::exportTransform facility).

This seems to work fine with Special:Export, and when I try it with dumpBackup.php locally. However, the dumps like wikidatawiki-20141009-pages-articles.xml.bz2 still contain revisions with the old style format, both .

Is this because new revisions get stitched into old dumps? That's the only explanation I currently have. If this is the case, how do we reset this, so all revisions get re-exported? If this is not the case, how can we investigate what is going wrong?

One alternative explanation would be if the host that generates the dump was running an old version of wikibase, I suppose.

Version: unspecified
Severity: major
Whiteboard: u=dev c=backend p=0

Details

Reference: bz72348

Related Objects
Search...

Status	Assigned	Task
Open	None	T88728 Improve Wikimedia dumping infrastructure
Open	None	T88991 improve Wikidata dumps [tracking]
Resolved	JanZerebecki	T91117 [Bug] Empty JSON maps are serialized as empty lists in XML dumps
Invalid	hoo	T74348 Wikidata dumps contain old-style serialization.
Resolved	hoo	T74361 Call ContentHandler::exportTransform in backupTextPass.inc and friends
Resolved	None	T74417 Put the <model> and <format> tags before the <text> tag in the XML dumps.

Event Timeline

• bzimport raised the priority of this task from to High.Nov 22 2014, 3:56 AM

• bzimport added a project: Datasets-General-or-Unknown.

• bzimport set Reference to bz72348.

daniel created this task.Oct 22 2014, 9:43 AM

Bumping to critical, since it may result in data loss for clients that cannot process the old style format. We really do not want them to implement that, we changed for a reason...

Btw: In order to check for old style serializations, grep for "entity". To detect new style serialization, check for "descriptions" (plural).

See also https://lists.wikimedia.org/pipermail/wikidata-l/2014-October/004843.html

Just confirming, this only applies to XML dumps, and not the new JSON dumps?

The reason seems to be backupTextPass.inc, see bug 72361.

Bug 72613 has been marked as a duplicate of this bug. ***

Can I please have a status update on this? Do we know why it is happening?

In T74348#787697, @Lydia_Pintscher wrote:

Can I please have a status update on this? Do we know why it is happening?

As far as I know the problem is that during dump creation content from the last dump is being scraped in case nothing changed. That's probably fine for wikitext, but of course that bypasses our on-the-fly serialization change.

hoo added a project: Wikidata.Nov 26 2014, 10:44 AM

hoo set Security to None.

hoo closed subtask T74361: Call ContentHandler::exportTransform in backupTextPass.inc and friends as Resolved.Nov 26 2014, 10:47 AM

Old revisions are indeed read from the old dump, as long as the length of the revision text is correct. And indeed this is a necessity; the db servers cannot handle requests for all revisions anew, and even if they could the dumps would take many times loger to generate as well. The only thing that can be done is a manual run of the specfic pass without prefetch, which will take... as long as it takes. I need to check with Sean (DBA) about it before doing so.

@Ariel: My patch for T74361 hooks into TextPassDumper::getText() and applies on-the-fly transformation regardless of whether the text comes from a previous dump (prefetch), or the database. Can you confirm that this will indeed fix the issu? The relevant diff is here: https://gerrit.wikimedia.org/r/#/c/168126/15/maintenance/backupTextPass.inc

daniel removed a subtask: T74478: SHA1 checksum in XML dump is wrong for revisions that had an export transformation applied..Nov 26 2014, 4:21 PM

Lydia_Pintscher awarded a token.Dec 11 2014, 11:37 AM

mark added a project: acl*sre-team.Dec 11 2014, 3:39 PM

mark subscribed.

jeremyb-phone subscribed.Dec 12 2014, 6:14 PM

Lydia_Pintscher moved this task from incoming to blocked on others on the Wikidata board.Dec 15 2014, 2:49 PM

Thanks for the patch! I will check it out in the next couple of days. I'm really sorry for the long delay; I've been out for medical reasons and am now trying to get caught up on everything.

Lydia_Pintscher added a parent task: T88991: improve Wikidata dumps [tracking].Feb 9 2015, 4:03 PM

I ran a series of tests locally and also checked production output. I can verify that the transform is actually applied, the output looks good to me for prefetch or from the database, but a consumer of the data should probably look at it for 5 seconds to verify that the output format is they way you want it.

Liuxinyu970226 subscribed.Feb 16 2015, 6:39 AM

Hello? Any wikidata dumps consumers on this ticket? Otherwise I'll ask in xmlatadumps-l.

@hoo: could you have a look?

In T74348#1062351, @Lydia_Pintscher wrote:

@hoo: could you have a look?

Just kicked of the download of a dump, I'll verify some old revisions once that's done (later today).

In T74348#1059331, @ArielGlenn wrote:

Hello? Any wikidata dumps consumers on this ticket? Otherwise I'll ask in xmlatadumps-l.

In T74348#768660, @daniel wrote:

Bumping to critical, since it may result in data loss for clients that cannot process the old style format. We really do not want them to implement that, we changed for a reason...

Btw: In order to check for old style serializations, grep for "entity". To detect new style serialization, check for "descriptions" (plural).

hoo@tools-dev:~$ grep -c '&quot;entity&quot;' wikidatawiki-20150207-pages-articles.xml 
129630

right. this is what you want; the old style 'entity' is gone, the new style 'descriptions' is present. or am I missing something?

In T74348#1069658, @ArielGlenn wrote:

right. this is what you want; the old style 'entity' is gone, the new style 'descriptions' is present. or am I missing something?

To me it seems like the old style entity is still present.

ugh, I stare it for an hour and I'm still blind. Let me look at it for another hour... sorry.

OK, I no longer feel as stupid. The number of items with the 'entity' format is small in comparison to the total number of qualities, we would expect the opposite if old revisions were being kept as is. And as I said I had checked with local testing that the export transform is indeed being called and changing the content. So I had a look at the problematic entries. It turns out that all but 27 are of the form

<text xml:space="preserve">{"entity":"Q547932","redirect":"Q6150957"}</text>

so I guess serializing of redirects needs work. I checked that newly added redirects are dumped with this format. The few remaining matches are likely discussions that happen to include the string; I spot checked some and found that to be the case.

Um, "with this format" means new redirects are dumped with {"entity" ... etc.

JanZerebecki added a parent task: T91117: [Bug] Empty JSON maps are serialized as empty lists in XML dumps.Mar 4 2015, 1:55 PM

Is anyone looking at the redirects serialization?

Jimkont subscribed.Mar 11 2015, 2:12 PM

JanZerebecki subscribed.Mar 24 2015, 4:25 PM

hoo added a project: § Wikidata-Sprint-2015-03-24.Mar 26 2015, 4:48 PM

@daniel: Could you have a quick look at this? Looks fixed to me, but I think you're the only one who can tell for sure.

Tobi_WMDE_SW added a project: Wikidata-Sprint-2015-04-07.Apr 7 2015, 4:21 PM

Tobi_WMDE_SW moved this task from Backlog to Review on the Wikidata-Sprint-2015-04-07 board.

Fore redirects, the encoding {"entity"} is correct. There is no "old" encoding for redirects, entity redirects didn't exist when we used the old serialization format.

So, searching for "entity" is not a good indicator for detecting old-style serialization.

Assigning to hoo, who said he'd look into this some more.

daniel moved this task from Review to Backlog on the Wikidata-Sprint-2015-04-07 board.Apr 20 2015, 7:10 PM

Tobi_WMDE_SW added a project: Wikidata-Sprint-2015-04-21.Apr 21 2015, 12:24 PM

Tobi_WMDE_SW moved this task from Backlog to Doing on the Wikidata-Sprint-2015-04-21 board.

What pattern can one search for to find old serialization?

Tobi_WMDE_SW reassigned this task from daniel to hoo.May 5 2015, 1:37 PM

Tobi_WMDE_SW added a project: Wikidata-Sprint-2015-05-05.

Tobi_WMDE_SW moved this task from Backlog to Doing on the Wikidata-Sprint-2015-05-05 board.May 5 2015, 1:39 PM

@JanZerebecki: Redirects are serialized like this:

{"entity":"Q23","redirect":"Q42"}

Old style serialization ends with this:

,"entity":"q207"}

So, if you egrep for ,"entity":"[qQpP][0-9]*"\}, you should find only old style serializations.

Also, old style serialization will contain "label":{, while new style should contain "labels":{ (using lables, plural).

Btw, if someone can tell me where to find a full history dump of wikidata, I'd be happy to check this myself. The annoying part here is to download and store the behemoth...

other examples of old serializations can be found here:
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/json/JsonWikiParser.scala#L62-L67

@Jimkont: broken serialization of empty lists is a separate issue, unrelated to unconverted old-style serializations.

I'm now running the following on tool labs to find "old" serializations:

daniel@tools-bastion-01:/public/dumps/public/wikidatawiki/20150330$ bzgrep ',&quot;entity&quot;:&quot;[qQpP][0-9]*&quot;\}' wikidatawiki-20150330-pages-meta-history.xml.bz2 | tee ~/wikidatawiki-20150330-pages-meta-history.bad-serialization.txt

The grep run did not turn up any old style serialization in the dump, so I'm closing this as "invalid". For good measure, I'm now double-checking by looking for the other pattern I suggested above, "label":{:

bzgrep '&quot;label&quot;:{' wikidatawiki-20150330-pages-meta-history.xml.bz2

Note that you may encounter the following when importing XML dumps:

redirects encoded as JSON
broken serialization of empty maps as lists ([] instead of {}).
entity serializations missing fields (e.g. no badges in sitelinks).

Generally, code processing old revisions should be robust, since fields may be serialized in a different order, fields may become optional, or fields can be added. But the overall structure should always be the same. You should however no longer encounter "old style serialization", which has a completely different structure.

Liuxinyu970226 unsubscribed.May 7 2015, 9:16 AM

Tobi_WMDE_SW moved this task from Doing to Done on the Wikidata-Sprint-2015-05-05 board.May 7 2015, 11:44 AM

The double-check didn't turn anything up either. The dump seems to be clean.

Great news!

Tobi_WMDE_SW edited a custom field.May 25 2015, 11:11 AM

ArielGlenn moved this task from Backlog to Done on the Datasets-General-or-Unknown board.May 9 2016, 3:56 PM