Page MenuHomePhabricator

Work on XML backup/export formats
Closed, DeclinedPublic

Description

So, a bit of poking around looking at the database dumps and Special:Export for wikidatawiki

For example, for Obama, we get something that starts like:

<text xml:space="preserve" bytes="8538">{&quot;label&quot;:{&quot;en&quot;:&quot;Barack Obama&quot;,&quot;fr&quot;:&quot;Barack Obama&quot;,&quot;ar&quot;:&quot;\u0628\u0627\u0631\u0627\u0643 \u0623\u0648\u0628\u0627\u0645\u0627&quot;,&quot;ru&quot;:&quot;\u0411\u0430\u0440\u0430\u043a \u041e\u0431\u0430\u043c\u0430&quot;,&quot;nb&quot;:&quot;Barack Obama&quot;,&quot;it&quot;:&quot;Barack Obama&quot;,&quot;de&quot;:&quot;Barack Obama&quot;,&quot;be-tarask&quot;:&quot;\u0411\u0430\u0440\u0430\u043a \u0410\u0431\u0430\u043c\u0430&quot;,&quot;nan&quot;:&quot;Barack Obama&quot;,&quot;ca&quot;:&quot;Barack Obama&quot;},&quot;description&quot;:{&quot;en&quot;:&quot;President of the United States of America

Full history for Q1-Q100 is currently 76.1MB. 7z turns that into 887KB

I'm going to have a poke around at some other larger exports done via shell.

I'm just wondering/thinking there might be a better way to represent this and override the backup handlers and produce a better backup format. Not high priority, but something to think about...


Version: unspecified
Severity: enhancement

Details

Reference
bz41790

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:14 AM
bzimport set Reference to bz41790.
bzimport added a subscriber: Unknown Object (MLST).

Well, JSON in XML will need to have quotes escaped... I can think of two ways to make this less painful:

  • use PHP serialization instead of JSON when generating XML. This only needs a small code change, since EntityHandler already supports PHP serialization. That sucks for portability, though - people want to process the dumps with Java, Python, etc.
  • use CDATA to wrap the JSON, instead of quoting. Nice and easy, the question is just: how does the exporter know when to do this? Or should we always use CDATA? But this may confuse tools that use regular expressions to process dumps, instead of properly parsing XML. But then, I guess such code is broken by design.

So... other ideas?

What if we replace the " in the JSON with '?

SO, no, ' cannot be used instead of ". Stupid JSON spec.

PHP serialization: Denny says no.

CDATA is yucky too, but I am afraid this is the best way probably. :(

Anyone other ideas? Otherwise, we could go for CDATA.

A side problem is that the unicode character don't need to be escaped per JSON spec so

the bit could be rewrited in

<text xml:space="preserve"
bytes="8538">{&quot;label&quot;:{&quot;en&quot;:&quot;Barack
Obama&quot;,&quot;fr&quot;:&quot;Barack
Obama&quot;,&quot;ar&quot;:&quot;باراك
أوباما&quot;,&quot;ru&quot;:&quot;Барак
Обама&quot;,&quot;nb&quot;:&quot;Barack
Obama&quot;,&quot;it&quot;:&quot;Barack Obama&quot;,&quot;de&quot;:&quot;Barack
Obama&quot;,&quot;be-tarask&quot;:&quot;Барак
Абама&quot;,&quot;nan&quot;:&quot;Barack
Obama&quot;,&quot;ca&quot;:&quot;Barack
Obama&quot;},&quot;description&quot;:{&quot;en&quot;:&quot;President of the
United States of America

which is 1.2 smaller in byte

but the big win is ofcourse CDATA escaping

<text xml:space="preserve"
bytes="8538"><CDATA[[{"label":{"en":"Barack
Obama","fr":"Barack
Obama","ar":"باراك
أوباما","ru":"Барак
Обама","nb":"Barack
Obama","it":"Barack Obama","de":"Barack
Obama","be-tarask":"Барак
Абама","nan":"Barack
Obama","ca":"Barack
Obama"},"description":{"en":"President of the
United States of America

which is twice smaller

if CDATA way is chosen one should take care transforming ]]> sequences in ]]]]><CDATA[[> as explained here:
http://stackoverflow.com/questions/223652/is-there-a-way-to-escape-a-cdata-end-token-in-xml

A possibility to embed JSON is YAML https://en.wikipedia.org/wiki/YAML but the tools for parsing it is less widespread

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).

We should compare the compressed size of the current encoding with the CDATA encoding to see if this actually makes a significant difference. If yes, we should indeed start using CDATA in XML exports. Otherwise, I don't think it 's worth the trouble.

Lydia_Pintscher subscribed.

I am going to close this because no-one has worked on it since 2014 and there doesn't seem to be a huge demand.