Page MenuHomePhabricator

JSON encoding errors for characters outside the BMP
Closed, ResolvedPublic

Description

Consider the following query: http://localhost/w/api.php?action=query&format=xml&action=expandtemplates&text=%ef%bf%bd%f0%90%80%80%f3%b0%80%8fzzz

It contains 6 characters: U+fffd, U+10000, U+f000f, U+007a, U+007a, and U+007a. In json encoding, they should be \ufffd\ud800\udc00\udb80\udc0fzzz (U+10000 and U+f000f must be encoded as surrogate pairs).

If I change the format to jsonfm, the three characters are instead encoded as \ufffd\ud800dc00\udb80dc0fzzz, which cannot be decoded correctly. This should be relatively simple to fix, I think.

If I change the format to json, it's even worse: the first two are output correctly as \ufffd\ud800\udc00, but that's it! Apparently PHP's built-in json_encode silently screws up anything over U+1ffff: U+20000-U+3ffff, U+80000-U+bffff, and U+100000-U+10ffff seem to be incorrectly encoded as U+10000-U+1ffff, while U+40000-U+7ffff and U+c0000-U+fffff seem to cause the mentioned silent truncation. The only fix I can think of is to detect if these characters are present and use the fallback code instead.

I'll see about posting a patch later on.


Version: 1.14.x
Severity: normal

Details

Reference
bz16798

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:24 PM
bzimport set Reference to bz16798.

Created attachment 5625
Patch

The PHP bug has been reported at http://bugs.php.net/bug.php?id=46944

This patch adjusts the fallback JSON encoder to be able to handle UTF-16 surrogate pairs, and removes some of the support for invalid UTF-8 encoded characters above U+10FFFF.

It also adds a check to see if the PHP built-in json_encode is affected by PHP bug 46944, and uses our fallback code if so.

Attached:

Will try to review this soon.

On a side note, PHP reports this as being fixed now.

(In reply to comment #4)

On a side note, PHP reports this as being fixed now.

That's nice, but it means that older versions of PHP still have broken JSON formatters. At a quick glance, the patch seems to accommodate for that and only fall back to our own JSON formatter if PHP's is broken.

Slightly modified patch applied in r45674.