Page MenuHomePhabricator

api can't output binary data, like icu sortkeys
Closed, ResolvedPublic

Description

When doing an api query that could return binary data (icu sortkeys for example), if the binary data isn't valid utf-8 (which will happen most of the time) it will output an empty string instead.

For example:
http://translatewiki.net/w/api.php?action=query&list=categorymembers&cmprop=sortkey&cmtitle=Category:User%20en&cmlimit=1

Expected output is a big long binary string.

This is rather confusing when trying to debug stuff, although arguably much of the time this behaviour does kind of make sense. Anyways, caused by ApiResult::cleanUpUTF8.


Version: 1.18.x
Severity: enhancement
URL: http://translatewiki.net/w/api.php?action=query&list=categorymembers&cmprop=sortkey&cmtitle=Category:User%20en&cmlimit=1

Details

Reference
bz28541

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:27 PM
bzimport set Reference to bz28541.

This is rather confusing when trying to debug stuff, although arguably much of
the time this behaviour does kind of make sense. Anyways, caused by
ApiResult::cleanUpUTF8.

For reference, this was partially caused by a bug in UtfNormal::cleanUp when the intl extension was enabled (It returned false instead of the fixed string). Thats fixed in r86130. The larger issue of the api normalizing binary data (which reminds me, are you even allowed to put things like nulls into entity references in an xml document?) is still there.

If you need to output binary data in a structured text tree, it should be as hex strings (as with SHA-1 hashes of image files) or perhaps occasionally as base64.

It sounds here that something is overriding the category sort key (a manually-specified piece of text that replaces the page's own title for sorting purposes in category lists, which if not specified will take the original title's value) with a normalized binary sort key that's generated from it and used at a the low-level for sorting.

That should probably not be returned as cmsortkey (a well-known text field), but as an additional parameter which is specifically a post-processed binary key.

Probably whatever's changed the sort key is doing it directly in the database instead of adding a new field?

(If it's an extension that overrides saving of cl_sortkey with a value generated from the original value passed in, then it could probably also override the API output of the sort key; converting the output into a hex string will double the length of the returned data, but should lead to consistent behavior of clients that attempt to sort things based on the key values. Be careful of any other attempted use of the sort key in this situation...)

Actually both can be output now. the actual sortkey as specified by humans is called sortkey_prefix (may or may not have the underscore in the api, can't remember). The question is, what are users of the api using the original sortkey for - to show what its sorted under, or to sort items. One of the rationales for doing it as it is currently (binary data in sortkey field) is if people are using the info from the api to sort things, they would want to sort by comparing the binary field. So doing it this way somewhat preserves the semantics of the original field in a sense See bug 24650

hex might be the way to go here.

Fixed in r86257 by hex-encoding the sortkeys.

I experimented with armoring binary data and managed to get it to output correctly in JSON, but XML explicitly forbids non-printable ASCII characters. Even escaping them as  is forbidden.

Since sending binary data for sortkey presentation wasn't gonna work, that also presented a problem for sortkeys used as part of the cmcontinue parameter. For that reason I decided to hex-encode all sortkeys (in list=categorymembers and prop=categories output, and in cmcontinue) and decode hex sortkeys in the cmcontinue back to binary when using them in the SQL query.