Page MenuHomePhabricator

API mangles certain UTF characters when querying 16 or more pages
Closed, ResolvedPublic

Description

This is a fairly bizarre error... it only appears when querying 16 or more pages on the API. If the page name has a certain UTF character in it ("-", a.k.a. E28093 in UTF-8 hex), the API mangles the title and then states that said mangled title doesn't exist.

Here is the reproduction:
*Mangled result, 16 files being queried: http://commons.wikimedia.org/w/api.php?action=query&titles=File:Krizius%204.jpg|File:Abraszewski%20Bayview.jpg|File:Abraszewski%20flowers%20small.jpg|File:Abraszewski%20gray%20mansion%20small.jpg|File:Saasveld04.jpg|File:Oyama-jinja%20004.jpg|File:Sila%20o%20Tonga%20-%20Coat%20of%20arms%20of%20the%20Kingdom%20of%20Tonga.svg|File:Ft-Banks-1946-1953-C.pdf|File:Ounguicularis.jpg|File:Royal%20Dublin%20Fusileers.jpg|File:Edsim%20Vascular.jpg|File:Flag%20Dubrovnik%E2%80%93Neretva%20County.gif|File:1824%20laver%20coral.jpg|File:1928%20new%20chambers.jpg|File:1933%20Thicknesse%20w480.jpg|File:2-10%20Armoured%20Regt%20(AWM%20043801).jpg&prop=imageinfo|revisions|templates&iiprop=sha1
Notice at the top the API returns the result: <page ns="6" title="File:Flag Dubrovnik–Neretva County.gif" missing="" imagerepository="" />
Thus the dash character has been mangled into mojibake
*Now, to create a non-mangled result, remove any one of the other files being queried in the above result (only 15 files being queried)
Removing the last of the files from the list(File:...(AWM%20043801).jpg): http://commons.wikimedia.org/w/api.php?action=query&titles=File:Krizius%204.jpg|File:Abraszewski%20Bayview.jpg|File:Abraszewski%20flowers%20small.jpg|File:Abraszewski%20gray%20mansion%20small.jpg|File:Saasveld04.jpg|File:Oyama-jinja%20004.jpg|File:Sila%20o%20Tonga%20-%20Coat%20of%20arms%20of%20the%20Kingdom%20of%20Tonga.svg|File:Ft-Banks-1946-1953-C.pdf|File:Ounguicularis.jpg|File:Royal%20Dublin%20Fusileers.jpg|File:Edsim%20Vascular.jpg|File:Flag%20Dubrovnik%E2%80%93Neretva%20County.gif|File:1824%20laver%20coral.jpg|File:1928%20new%20chambers.jpg|File:1933%20Thicknesse%20w480.jpg&prop=imageinfo|revisions|templates&iiprop=sha1
Removing the first of the files from the list (File:Krizius%204.jpg): http://commons.wikimedia.org/w/api.php?action=query&titles=File:Abraszewski%20Bayview.jpg|File:Abraszewski%20flowers%20small.jpg|File:Abraszewski%20gray%20mansion%20small.jpg|File:Saasveld04.jpg|File:Oyama-jinja%20004.jpg|File:Sila%20o%20Tonga%20-%20Coat%20of%20arms%20of%20the%20Kingdom%20of%20Tonga.svg|File:Ft-Banks-1946-1953-C.pdf|File:Ounguicularis.jpg|File:Royal%20Dublin%20Fusileers.jpg|File:Edsim%20Vascular.jpg|File:Flag%20Dubrovnik%E2%80%93Neretva%20County.gif|File:1824%20laver%20coral.jpg|File:1928%20new%20chambers.jpg|File:1933%20Thicknesse%20w480.jpg|File:2-10%20Armoured%20Regt%20(AWM%20043801).jpg&prop=imageinfo|revisions|templates&iiprop=sha1
**On both above results, the API correctly returns the result: <page pageid="25721149" ns="6" title="File:Flag Dubrovnik–Neretva County.gif" imagerepository="local">...</page>

I have literally never encountered this error for any other file, and my bot has queried a LOT of files, so I don't know how many different utf-8 characters the API will mangle.


Version: 1.20.x
Severity: minor

Details

Reference
bz36799

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:21 AM
bzimport set Reference to bz36799.
bzimport added a subscriber: Unknown Object (MLST).

... and the same issue is occurring with File:José de Ribera-St Sebastian.jpg (http://en.wikipedia.org/wiki/File:Jos%C3%A9_de_Ribera-St_Sebastian.jpg). Is there maybe a new feature that is bugging up English Wikipedia?

Maybe it is depending on the length of the request? See bug 36839

I know the dupe should really be the other way around, but it looks like everyone's attention is on the other bug, so I'll mark this one as duped instead.

  • This bug has been marked as a duplicate of bug 36839 ***