Page MenuHomePhabricator

Accessing bug 9444 via XML RPC API crashes due to invalid byte sequence: "not well-formed (invalid token)"
Closed, ResolvedPublic

Description

[...]
body: '<?xml version="1.0" encoding="UTF-8"?><methodResponse><params><param><value><struct><member><name>bugs</name><value><struct><member><name>9444</name><value><struct><member><name>comments</name><value><array><data><value><struct><member><name>is_private</name><value><boolean>0</boolean></value></member><member><name>count</name><value><int>0</int></value></member><member><name>creator</name><value><string>papadako@csd.uoc.gr</string></value></member><member><name>time</name><value><dateTime.iso8601>20070329T08:11:13</dateTime.iso8601></value></member><member><name>bug_id</name><value><int>9444</int></value></member><member><name>author</name><value><string>papadako@csd.uoc.gr</string></value></member><member><name>text</name><value><string>A database error has occurred Query: SELECT\nmath_outputhash,math_html_conservativeness,math_html,math_mathml FROM math WHERE\nmath_inputhash = \'\xef\xbf\xbd\xef\xbf\xbd\xd7\xbe\xef\xbf\xbd\x1f\x11\xef\xbf\xbd\xef\xbf\xbd\x12@\x01\xcb\xb5\' LIMIT 1 Function: MathRenderer::_recall Error: 1\nERROR: invalid byte sequence for encoding "UTF8": 0xebc3d'
Traceback (most recent call last):

File "minimal.py", line 64, in <module>
  fetch(i)
File "minimal.py", line 49, in fetch
  com = server.Bug.comments(kwargs)['bugs'][bugid]['comments']
File "/usr/lib/python2.7/xmlrpclib.py", line 1224, in __call__
  return self.__send(self.__name, args)
File "/usr/lib/python2.7/xmlrpclib.py", line 1578, in __request
  verbose=self.__verbose
File "/usr/lib/python2.7/xmlrpclib.py", line 1264, in request
  return self.single_request(host, handler, request_body, verbose)
File "/usr/lib/python2.7/xmlrpclib.py", line 1297, in single_request
  return self.parse_response(response)
File "/usr/lib/python2.7/xmlrpclib.py", line 1467, in parse_response
  p.feed(data)
File "/usr/lib/python2.7/xmlrpclib.py", line 557, in feed
  self._parser.Parse(data, 0)

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 3, column 22


Version: wmf-deployment
Severity: major
See Also:
https://bugzilla.mozilla.org/show_bug.cgi?id=1055629

Details

Reference
bz69747

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:43 AM
bzimport set Reference to bz69747.

Should drop some stupid chars like via

$string =~ tr/\xea-\xef/-/;

somewhere before

text       => $self->type('string', $comment->body_full),

in

http://bzr.mozilla.org/bugzilla/4.4/view/head:/Bugzilla/WebService/Bug.pm#L296

I guess.
Late uneducated comment that might be blatantly wrong tomorrow morning.

[Mostly making comments here for myself.]

One problem here is that we have not 200% identified which actual chars are offending, we only guess.
Another problem is that I cannot easily create a local testcase.

Workaround in https://bugzilla.mozilla.org/show_bug.cgi?id=839023#c10 : Use
$initial =~ s/([\x01-\x08\x0b\x0c\x0f-\x1f])/sprintf "\\x%02x", ord($1)/ge;

http://perldoc.perl.org/perlebcdic.html#Quoted-Printable-encoding-and-decoding lists a similar example (also >x80 for stripping non-ascii entirely):
$qp_string =~ s/([=\x00-\x1F\x80-\xFF])/sprintf("=%02X",ord($1))/ge;

Above workaround is overkill though: if you replaced \x61 (letter: a) you'd end up with "Wrong/unsupported datatype 'boole\\x61n' specified" in the XMLRPC response. Hence slightly concerned about unwanted side effects, but above character range is nothing that should be used anyway.

So I tested the two-liner hack with the less commonly used letter \xc4\x8d (letter: č) in some comments, and the char replacement worked as expected in the XMLRPC response.

Helpful tables for conversion: http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=string-literal

Change 155732 had a related patch set uploaded by Aklapper:
When exporting Bugzilla tickets via Chase's script we run into an API bug with specific Unicode letters for https://bugzilla.wikimedia.org/show_bug.cgi?id=9444#c0. This is applying a hackish upstream workaround described in https://bugzilla.mozilla.org/sh

https://gerrit.wikimedia.org/r/155732

Change 156100 had a related patch set uploaded by Aklapper:
Work around Bugzilla XML RPC bug with special Unicode characters

https://gerrit.wikimedia.org/r/156100

Change 155732 merged by Dzahn:
Create copy of upstream file (for followup custom change)

https://gerrit.wikimedia.org/r/155732

Change 156100 merged by Dzahn:
Work around Bugzilla XML RPC bug with special Unicode characters

https://gerrit.wikimedia.org/r/156100

Now a script querying the XML RPC API does not drop out anymore at ticket #9444, the XML also looks still valid, and I have not experienced any other explosions or incidents yet.

Closing as FIXED, crossing fingers it'll stay like that.

Note: As this workaround is applied to *any* output if also damages binary attachment data. See https://phabricator.wikimedia.org/T815