Page MenuHomePhabricator

Bot encoding messed up: unicode characters (åö etc.) broken
Closed, ResolvedPublic

Description

For instance, from the logs:

[17:55:16] <wikibugs_> \x0303(mod)\x03 Wrong language code for Norwegian Bokm�l (Android) - \x0310https://bugzilla.wikimedia.org/49340\x03 +comment (\x02\x0310Niklas Laxström\x03\x02)
http://bots.wmflabs.org/~wm-bot/logs/%23mediawiki/20130608.txt

which however for Niklas (with automatic encoding detection) showed up as:

[20:54:42] wikibugs_> (mod) Wrong language code for Norwegian Bokmål (Android) - https://bugzilla.wikimedia.org/49340 +comment (Niklas Laxström)


Version: unspecified
Severity: normal

Details

Reference
bz49342

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:01 AM
bzimport set Reference to bz49342.

[18:15:27] <wikibugs_> \x0303(NEW)\x03 Bot encoding messed up: �� unicode characters broken - \x0310https://bugzilla.wikimedia.org/49342\x03 major; \x02Wikimedia\x02: wikibugs IRC bot; (\x02\x02)

so it doesn't need both summary and username to have non-ASCII characters as first suspected.

The bot has been semi-randomly messing up Unicode since ever. The "ń" in my last name is sometimes mangled as well (but not always, for me at least). I have been unable to determine the cause of this behavior.

(In reply to comment #2)

The bot has been semi-randomly messing up Unicode since ever.

To me it seems mostly a recent thing.

The "ń" in my
last name is sometimes mangled as well (but not always, for me at least). I
have been unable to determine the cause of this behavior.

The most obvious reason would be that summary and other headers have different encodings. The bot reads email, which is now HTML by default, so this may be the reason and would be fixed by upgrading to 4.4: https://bugzilla.mozilla.org/show_bug.cgi?id=777685
To test it, it may be enough for a bugzilla admin to change wikibugs' preferences so that it uses plain text notifications (unless that bug also affects those).

Scratch that, it's still broken due to Bugzilla not conforming to the e-mail RFC:

Wikimedia / wikibugs IRC bot: Bot encoding messed up: unicode characters (=?UTF-8?Q?=C3=A5=C3=B6=20etc=2E?=) broken

(=?UTF-8?Q? is only allowed to start after whitespace, not after a '(', so the Python email parser fails to recognise it by default)

Any testcase for such a Bugzilla notification? How common is that?

(In reply to Andre Klapper from comment #5)

Any testcase for such a Bugzilla notification?

This bug report is designed to be the test case for itself. :) Just edit something and check its summary on IRC.

Thanks. I'm stupid. :)

Subject: [Bug 49342] Bot encoding messed up: unicode characters
(=?UTF-8?Q?=C3=A5=C3=B6=20etc=2E?=) broken

Either an issue in Perl's Email::MIME or somewhere around http://bzr.mozilla.org/bugzilla/4.4/view/head:/Bugzilla/Mailer.pm#L74

Upatream I only found https://bugzilla.mozilla.org/show_bug.cgi?id=387860 which is different and fixed.

this is about old wikibugs bot right? try how pywikibugs handles it

  • Bug 64354 has been marked as a duplicate of this bug. ***

No, this is currently an issue with pywikibugs -- Bugzilla sends non-RFC-compliant mails and the python 3.4 mail parser does not handle this gracefully. I special-cased the "?=UTF-8?Q?=20" case (replace with " ?=UTF-8?Q?"), but this doesn't take care of the (?=UTF-8?Q? and "?=UTF-8?Q? cases.

Fixed by fully monkey-patching get_unstructured; the patch at bugs.python.org adds the following lines:

if "=?" in tok and not tok.startswith("=?"):

tok, rest = tok.split("=?", 1)
remainder.insert(0, "=?" + rest)

which makes sure any "=?"'s are sure to be parsed.

(In reply to Merlijn van Deen from comment #4)

Scratch that, it's still broken due to Bugzilla not conforming to the e-mail
RFC

FYI, I upstreamed it as https://bugzilla.mozilla.org/show_bug.cgi?id=1000988