Page MenuHomePhabricator

Data corruption apparently related to recompressTracked.php on wikis with $wgLegacyEncoding set
Closed, ResolvedPublic

Description

Double UTF-8 conversion is turning up on a large number of edits on Danish wikis which have just been run through recompressTracked.php. Examples shown to me:

A widely-used template:
http://da.wikipedia.org/w/index.php?title=Skabelon:Standardstub&diff=prev&oldid=1936801

Various articles such as:
http://da.wikipedia.org/w/index.php?title=Lake_Torrens&diff=prev&oldid=1790478
http://da.wikipedia.org/w/index.php?title=Lind%C3%A5&diff=prev&oldid=1478894

I've stopped the jobs running on Hume pending Tim's investigation and fix. The job was partway into dewiki at the time.

If only wikis using $wgLegacyEncoding and running the recompressTracked script are affected, then dawiki and dawiktionary need cleanup.


Version: unspecified
Severity: critical
URL: http://da.wikipedia.org/w/index.php?title=Skabelon:Standardstub&diff=prev&oldid=1936801

Details

Reference
bz16841

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 21 2014, 10:26 PM
bzimport set Reference to bz16841.

Yeah, $wgLegacyEncoding.

I *think* I see the issue. It calls Revision::LoadRevisionText(), which converts to utf-8, then saves that blob in the concatenated diff blob but doesn't go back and mark old_flags with 'utf-8'. MW still thinks it is in legacy encoding then, and double encodes.

The affected revisions are after the conversion, and their old_flags includes 'utf8':

+---------+---------+------------------+---------------+

rev_idold_idold_textold_flags

+---------+---------+------------------+---------------+

1478894rECIR146874158beaDB://rc1/43592/0external,utf8
17904781777869DB://rc1/3210/0external,utf8
19368011923062DB://rc1/26644/4external,utf8

+---------+---------+------------------+---------------+

They're clearly getting run through without the flags at some step, though...

I've locked dawiki and dawiktionary to editing (wgReadOnly in InitialiseSettings) per Wegge's request until we get this sorted out, since any further edits on broken revisions are going to be pretty nasty and won't get automatically fixed by something that rolls back to the original ES entries for the old revs.

It needs to be 'utf-8', not 'utf8'

ahh, line 489 of recompressTracked.php:

$dbw->update( 'text',
array( // set
'old_text' => $url,
'old_flags' => 'external,utf8',
),

...it *does* in fact try to set utf-8, it just has a typo :)

Aaron fixed the code typo in r45205.

Should be possible to clean up the entries, then clear all the cache entries. :P

Revision cache, diff cache, parser cache, squid cache.......

dawiki:
+----------+---------------------+

count(*)old_flags

+----------+---------------------+

3785
10483external
6983external,gzip
2714external,object
461676external,utf-8
336780external,utf8<- borken
1094gzip
29477object
39973utf-8,gzip
1783011utf-8,gzip,external

+----------+---------------------+

dawiktionary:
+----------+---------------------+

count(*)old_flags

+----------+---------------------+

1818
1620external,utf-8
3744external,utf8<- borken
5gzip
2382object
1631utf-8,gzip
25576utf-8,gzip,external

+----------+---------------------+

Alternatively to the DB cleanup we could hack the loader to accept 'utf8' as well as 'utf-8'. Still requires cache cleanup...

Ok, all the automated cleanup should be done at this point. However pages which were edited from the corrupted views need to be fixed up, since they "legitimately" contain the broken chars.

marco wrote:

You mentioned that the job was partly into dewiki - what about this issue in de?

de does not have $wgLegacyEncoding.