Page MenuHomePhabricator

Trimmed multibyte characters result in invalid XML
Closed, ResolvedPublic

Description

Author: bdanee88

Description:
I'm just started to write a statistics program for Hungarian Wikipedia. While I downloaded the deletion log from January 2008, my program encountered an exception: the XML loaded from the API was bad encoded. I wondered why, so I checked it, and really, there is an error:

http://hu.wikipedia.org/w/api.php?format=xml&action=query&list=logevents&letype=delete&lestart=2008-01-25T22:12:03Z&lelimit=30

In element 'item' with logid 142820, the comment contains an unknown character at the end. Probably it would be a two byte length UTF-8 character, but it has been trimmed. The problem is not so serious as I can get rid of the comment attribute with using &leprop= in the URL as I don't need it, but if someone needs it, he/she won't able to load the file.

The bad line (see also in the link):
<item logid="142820" pageid="0" ns="0" title="Borisz Szpasszkij" type="delete" action="delete" user="Bináris" timestamp="2008-01-25T21:19:30Z" comment="[[Wikipédia:Homokozó|teszt]]: a lap tartalma: „Boris Vasilievich Spassky [szerkesztés] A Wikipédiából, a szabad lexikonból. Ugrás: <small>NAVIGÁCIÓ</small>, <small>KERESÉS</small> Boris V Spassky () szovjet később francia...” (és csak �"/>


Version: unspecified
Severity: normal
URL: http://hu.wikipedia.org/w/api.php?format=xml&action=query&list=logevents&letype=delete&lestart=2008-01-25T22:12:03Z&lelimit=30

Details

Reference
bz15261

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:19 PM
bzimport set Reference to bz15261.

I don't see the problem. I opened the link in Firefox (which automatically parses XML and screams if there's something wrong with it), and I got no errors. I also confirmed that logid 142820 is in there, which it is. That means it's probably your XML parser's fault; closing as WORKSFORME.

http://validator.w3.org/check?uri=http%3A%2F%2Fhu.wikipedia.org%2Fw%2Fapi.php%3Fformat%3Dxmlfm%26action%3Dquery%26list%3Dlogevents%26letype%3Ddelete%26lestart%3D2008-01-25T22%3A12%3A03Z%26lelimit%3D30&charset=%28detect+automatically%29&doctype=Inline&group=0&user-agent=W3C_Validator%2F1.591

"Sorry, I am unable to validate this document because on line 44 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

The error was: utf8 "\xE2" does not map to Unicode"

Many XML parsers choke on broken UTF-8 entities. Of course, this is mostly a database problem, but the fact that API returns ill-formed data remains.

  • Bug 16101 has been marked as a duplicate of this bug. ***

Should be fixed in r45749: invalid UTF-8 chars are replaced with the UTF-8 replacement character (U+FFFD).

(In reply to comment #7)

Not fixed:
http://en.wikipedia.org/w/api.php?action=query&format=xml&iiprop=comment&prop=imageinfo&titles=Image:Shakerredraider.jpg
still outputs invalid UTF-8.

Argh, array_walk_recursive() doesn't work the way I expected it to. Fixed in r47090