Page MenuHomePhabricator

GWtoolset uploaded a file with non-normalized unicode characters causing subtle breakage
Closed, ResolvedPublic

Description

https://commons.wikimedia.org/wiki/File:A_new_and_accurate_plan_of_Blenheim_Palace_-_L--39-Art_de_Cre%CC%81er_les_Jardins_-1835--_pl.1_-_BL.jpg was uploaded as a part of the first batch upload with "GWToolset Batch Upload" tool. It has several strange properties:

  1. Original file page did not have any associated media or description and was deleted as a page "with no valid content". However the page did have a thumbnail and there was a full size image (https://upload.wikimedia.org/wikipedia/commons/f/f8/A_new_and_accurate_plan_of_Blenheim_Palace_-_L--39-Art_de_Cre%CC%81er_les_Jardins_-1835--_pl.1_-_BL.jpg) which was not accessible from the file page.
  2. The file had no history but in user contribution log one could find edits with full metadata (https://commons.wikimedia.org/w/index.php?oldid=118131591)
  3. The deleted file is still in several categories like https://commons.wikimedia.org/wiki/Category:Pages_using_Artwork_template_with_incorrect_parameter and can not be removed: deleted file can not be edited and the tools like Cat-a-lot or hot-cat crash spectacularly when used with this file.
  4. file is a duplicate of https://commons.wikimedia.org/wiki/File:A_new_and_accurate_plan_of_Blenheim_Palace_-_L%27Art_de_Cr%C3%A9er_les_Jardins_%281835%29,_pl.1_-_BL.jpg . It is picked up by "Process Duplicates" tool; however the tool crashes when applied.

See also discussion https://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2014/03#Disappearing_image.

Can someone delete the image for good (it was reuploded under correct name) so it does not show up in categories or duplicate tool?


Version: unspecified
Severity: normal

Details

Reference
bz62870

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:05 AM
bzimport set Reference to bz62870.
bzimport added a subscriber: Unknown Object (MLST).

King of like bug 32551 in some ways.

Like in that bug:

  1. incorrect negative entry in file memcache (upload seems to abort somewhere in the middle of doEdit function, which is before cache gets saved, so file doesnt show up for a little while unless someone does action=purge)

Unlike in that bug:

  1. actual edits for the file exist, although perhaps not associated entirely correctly
  2. links still exist. Which means page had to be saved at some point, and there is still an entry in page table.
  3. entry in image table still exists.

So some sort of referential integrity issue. Not sure if much else could be said without seeing original db records which are gone now.

Further investigation.

Note still accessible at https://commons.wikimedia.org/?curid=31451688

Basically, it appears somewhere along the lines gwtoolset didn't properly normalize the page title correctly, thus creating it with the letter 'é' (ie Using combining characters. A U+69 followed by a U+301), instead of doing a 'é' (The precomposed version - U+E9). Titles are supposed to be in NFC, so the various things subtly explode when the non-NFC U+69 U+301 is used.

All the symptoms mentioned are consistent with an incorectly normalized db entry, except maybe symptom 1 which seems to imply there was a page at one point using the other form of the é. Kind of unclear what happened there, given the page is now moved/deleted. Perhaps there were page entries for both variants, but the proper variant was broken (e.g. It was fully uploaded to the wrong é, but as part of the process, it was partially uploaded to the correct é too). Hard to know.

My previous comment (comment 1) seems to have been incorrect, and this has nothing to do with bug 32551.

(In reply to Bawolff (Brian Wolff) from comment #2)

Kind of unclear what happened there,
given the page is now moved/deleted. Perhaps there were page entries for
both variants, but the proper variant was broken (e.g. It was fully uploaded
to the wrong é, but as part of the process, it was partially uploaded to the
correct é too). Hard to know.

Sorry about "crime-scene contamination", I guess I was trying to fix the problems without calling the cavalry. Let me try to recall some of the actions related to this file:

*After half a day User:Jheald moved the file to it's present name with correct é
*Afterwards I deleted the redirect page associated with old name.

(In reply to Jarek Tuszynski from comment #3)

(In reply to Bawolff (Brian Wolff) from comment #2)

Kind of unclear what happened there,
given the page is now moved/deleted. Perhaps there were page entries for
both variants, but the proper variant was broken (e.g. It was fully uploaded
to the wrong é, but as part of the process, it was partially uploaded to the
correct é too). Hard to know.

Sorry about "crime-scene contamination", I guess I was trying to fix the
problems without calling the cavalry.

No worries. There's enough here to reproduce the problem if need be. If it turns out we really need to know exactly what happened, we could just try to make gwtoolset upload a non normalized title and see.

So in MediaWiki, we generally prefer to normalize unicode at input.

Thus that means that input should be run through $wgContLang->normalize() as it directly comes out of the XML file. So that would be in methods like XmlDetectHandler::createExampleDOMElement, XmlDetectHandler::findExampleDOMNodes, XmlMappingHandler::getFilteredNodeValue

Change 121097 had a related patch set uploaded by Dan-nl:
make sure unicode characters are normalized

https://gerrit.wikimedia.org/r/121097

(In reply to Jarek Tuszynski from comment #8)

https://commons.wikimedia.org/wiki/File:
Rencontres_Wikim%C3%A9dia_et_%C3%89ducation_2012_-
_De_la_production_a%CC%80_l--39-
utilisation_de_ressources_e%CC%81ducatives_libres_-_-1.webm.webm is another
file that seem the have the same issue. Can someone fix or delete this file?

I don't have filemover rights to move the file. However someone with filemover or admin rights, and some knowledge of the api can fix these files by using the API action=move module, and the fromid parameter (fromid takes the page id number. This is the same as the curid parameter on normal requests). Similarly they are deletable from the API too (Actually for deletion its possible via the normal web interface, but you need to do fancy stuff with something like firebug to add curid to the POST parameters of the confirmation screen)

The id for File:Rencontres Wikimédia et Éducation 2012 - De la production à l--39-utilisation de ressources éducatives libres - -1.webm.webm ( https://commons.wikimedia.org/wiki/?curid=31747120 ) is 31747120.

Interestingly enough, for that title, the first é and É are fine, its the last two à and é that are the issue.


Until the patch for this bug gets reviewed and deployed to commons (which should happen quite soon), may I suggest converting XML files to NFC before uploading them to gwtoolset. On linux if you have the libicu-dev package installed you can do this with the command

uconv -x any-NFC -o output.xml input.xml

(I have no idea how to do this on other operating systems)

(In reply to Bawolff (Brian Wolff) from comment #9)

(I have no idea how to do this on other operating systems)

You can use node.js with https://github.com/walling/unorm

C:\Users\XXX> npm install unorm

Create a script named "ps.js" at "C:\Users\XXX" with the following content

var fileName = 'sample.txt',
fs = require('fs'),
unorm = require('unorm');

fs.readFile(fileName, { encoding: 'utf-8' }, function (err, stData) {

if (err) throw err;
stData = unorm.nfc(stData);
fs.writeFileSync(fileName, stData, { encoding: 'utf-8' })

});

(assuming that "C:\Users\XXX\sample.txt" is the file you'd like to process)

and run node.js:

C:\Users\XXX> node pr.js

Change 121097 merged by jenkins-bot:
make sure unicode characters are normalized

https://gerrit.wikimedia.org/r/121097

The fix for this issue is scheduled to be deployed on commons on Tuesday, 8 April 2014

Gilles triaged this task as Unbreak Now! priority.Dec 4 2014, 10:11 AM
Gilles moved this task from Untriaged to Done on the Multimedia board.
Gilles lowered the priority of this task from Unbreak Now! to Needs Triage.Dec 4 2014, 11:20 AM