Page MenuHomePhabricator

Incorrect decoding of QSON
Closed, ResolvedPublic

Description

16:43 <DarTar> ori-l: SELECT event_targetTitle FROM GettingStarted_5243394 WHERE uuid = "51c6149554d85e77b665f303f28adf25";
16:43 <DarTar> Héctor Elizondo


Version: unspecified
Severity: major

Details

Reference
bz45262

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:16 AM
bzimport set Reference to bz45262.

Event in question is in all-events-json.log-20130215.gz

It turns out this is not a bug in json2sql but in the instrumentation of GettingStarted, updating the ticket accordingly.

I tested again with a page called 'Some contrivéd page name!'()*~' (no quotes).

The JSON is:

{"event":{"action":"gettingstarted-click","funnel":"gettingstarted","targetTitle":"Some contrivéd page name!'()*~","experimentId":"ob3","userId":1,"isNew":false},"isValid":true,"revision":5219269,"schema":"GettingStarted","webHost":"127.0.0.1","wiki":"testwiki"}

Note that the logging for GettingStarted is in E3Experiments. So if it were a client-side bug, it would probably be there.

For the record, the page in question above is https://en.wikipedia.org/wiki/H%C3%A9ctor_Elizondo

à is http://www.fileformat.info/info/unicode/char/00c3/index.htm
© is http://www.fileformat.info/info/unicode/char/00a9/index.htm
é (the correct one) is http://www.fileformat.info/info/unicode/char/00e9/index.htm

If you follow the last link, you will see the UTF-8 is:

UTF-8 (hex) 0xC3 0xA9 (c3a9)

So it looks like the UTF-8 bytes are being separated and projected out to UTF-16 (the format that site happens to use for the URL).

But for now, back to EventLogging.

Nope, it wasn't GettingStarted. Fixed in change I0f4ea76b911e572405bcfbde23be74d29f7fd783.

Adding a bit of documentation for future reference. If we run into unicode / URL issues in the future, we can try replacing all code points above the ascii range with unicode escape sequences:

function escapeChar( char ) {

		var codePoint = '0000' + char.charCodeAt(0).toString(16);
		return "\\u" + codePoint.slice(-4);

}

function toSafeJSON( obj ) {

		var json = $.toJSON( obj );
		return json.replace( /[\u007f-\uffff]/g, escapeChar );

}

If this problem does crop up again, let's try to figure out the underlying cause before trying something like toSafeJSON.

[moving from MediaWiki extensions to Analytics product - see bug 61946]