Page MenuHomePhabricator

Incorrect Subject in e-mail notifications (=?UTF-8?Q?=)
Closed, InvalidPublic

Description

Author: shembel

Description:
My Wiki sending e-mail notifications whith Subject (example)

Subject: =?UTF-8?Q?=D0=A1=D1=82=D1=...

Maybe in my settings there is an error?

I use:
Linux wiki.* 2.6.28.10-vs2.3.0.36.11 #2 SMP Mon May 25 18:55:50 MSD 2009 i686 GNU/Linux
(Debian)

MediaWiki 1.14.0
$wgDBTableOptions = "ENGINE=MyISAM, DEFAULT CHARSET=binary";
$wgLanguageCode = "ru";

PHP 5.2.9-4 (apache2handler)
MySQL 5.0.51a-24

Extansions:
Confirm Users Email (v 2.0)
Password Reset (v 1.7)
Renameuser (v r39841)
User Merge and Delete (v 1.6)
CharInsert (v r36357)
ParserFunctions (v 1.1.1)
StringFunctions (v 2.0)

Other:
News (v r34306)
Newuserlog (v r36653)

Functionality addons
efConfirmUsersEmail, efUserMerge, wfNewsExtension, wfSetupParserFunctions и wfStringFunctions

Tags
<charinsert>, <news>, <newsfeed>, <newsfeedlink> и <pre>

Перехватчики функций синтаксического анализатора ??? :-)
anchorencode, defaultsort, displaytitle, explode, expr, filepath, formatnum, fullurl, fullurle, grammar, if, ifeq, iferror, ifexist, ifexpr, int, language, lc, lcfirst, len, localurl, localurle, ns, numberingroup, numberofadmins, numberofarticles, numberofedits, numberoffiles, numberofpages, numberofusers, numberofviews, pad, padleft, padright, pagesincategory, pagesize, plural, pos, rel2abs, replace, rpos, special, sub, switch, tag, time, timel, titleparts, uc, ucfirst, urldecode и urlencode

Please, help!


Version: 1.14.x
Severity: enhancement
OS: Linux

Details

Reference
bz19001

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:37 PM
bzimport added a project: MediaWiki-Email.
bzimport set Reference to bz19001.
bzimport added a subscriber: Unknown Object (MLST).

shembel wrote:

PEAR settings:

pear list

Installed packages, channel pear.php.net:

Package Version State
Archive_Tar 1.3.3 stable
Console_Getopt 1.2.3 stable
Mail 1.1.14 stable
Net_SMTP 1.3.2 stable
Net_Socket 1.0.9 stable
PEAR 1.8.1 stable
Structures_Graph 1.0.2 stable
XML_Util 1.2.1 stable

shembel wrote:

Newuserlog (v r36653) extension was removed, because it was included in 1.14.0.

This is NOT a bug, but the standard way to encode emails subject lines, when it does not contain only ASCII characters (here, this uses the Quoted-Printable encoding to indicate how to interpret the email subject line, only to specify that its encoded bytes must be decoded with UTF-8 and not with another charset (which cannot be specified anywhere else, given that historically the subject lines was only supporting ASCII).

Note that the email body is encoded separately with another email header (the "Mime-Type:" header) which applies ONLY to the email body (which does not need Quoted-Printable or Base-64 transport syntax as almost all SMTP and POP3 servers used today are 8-bit clean, as well as all IMAP servers.

Remember that throughout SMTP email headers, only the ASCII subset is safe (that's why the Quoted-Printable and Base-64 is still used today). Ideally, they should be 8-bit clean too, but there's still no standard way to specify which encoding is effectively used in the email subject line, notably because the Subject line is NOT sent to SMTP servers using MIME headers, but within SMTP command-lines that are still NOT 8-bit clean (there are still a lot of SMTP servers deployed worlwide (including from major ISPs) using the older RFCs for this protocol, and not its the ESMTP revision (they are afraid to deploy ESMTP due to possible additional security risks).

Note that it is IMPOSSIBLE to safely determine if the target SMTP server understands the ESMTP protocol or is 8-bit clean before starting a negociation, and most SMTP servers do not support negociation of the protocol (emails can only be sent in a single request that will be accepted or rejected immediately without explaining in a machine-readable response why this failed, and any failure will generally cause the SMTP server to close the connection immediately).

If your email agent does not support the standard Quoted-Printable (or Base64) encoding with this markup in the emails you receive, change it or upgrade it, as it is really out of date and does not support standard MIME RFC's.

Note that the presence of non-ASCII bytes in a subject line and that are not properly reencoded with a disambiguating transfer syntax like Quoted-Printable and Base64, should be assumled today to be encoded as UTF-8 by default. But many legacy email user agents do not do this assumption, and just assume their own local system encoding. The result is mojibake, where Cyrillic or Chinese texts get displayed as if it was Windows-1252 or ISO-8859-1, or the reverse.
The result is clearly unpredictable with old email agents.

Unfortunately, the same old email agents (including webmails of various ISPs) frequently do not support correct decoding of Quoted-Printable and Base64 as well!

In all cases you get unpredictable mojibake with old user agents. It's time for you to upgrade it (or to change your webmail provider). I relaly think that all modern emlail agents should be able to use UTF-8 as the default encoding of MIME headers (including subject lines) for all incoming mails, and should allow the user to force it to use another encoding (because guessing the encoding from a short subject line really does not work at all like it does on email bodies and web pages), and should also support the Quoted-Printable and Base-64 explicit markup.

And in your case where you receive many emails in Russian with Cyrillic letters (and not Latin) in most chracters of subject lines, the Quoted-printable encoding is a bad choice, as MediaWiki should probably better use Base64 (which will be shorter), even if this appears still as mojibake for you. MediaWiki could test the string to see which of Base64 or Quoted-Printable is shorter, and should avoid multiple Quoted-Printable sections in the same subject line (when it contains spaces or other ASCII characters between Cyrillic words).

Google Mail uses another strategy when sending emails: not only it tries both transport syntax, but also it parses which characters are used to use some common ISO-8859 or CJK encodings, and then reencoed it with one of the two transfer syntax (if there are non-ASCII characters).

Google Mail uses various tricks to detect target ISPs in order to select an encoding that its webmail will support and display properly, and monitors the emails received from people in your contact books, so that you'll reply to him using the same encoding he used when sending emails to you (unfortunately, this technic cannot be used by MediaWiki that does not have a database to record what various ISPs in the world will support, and it does not have access to your web contact list).

All what MediaWiki COULD do is to include a preference in your user account to specify an encoding that you can read with YOUR email agent, and that will be used by default if subject lines contain only characters from your preferred selected charset (otherwise it will still fallback to UTF-8, using Base64 or Quoted-Printable also according to your preferences, or using a transliteration into your preffered charset if its possible without excessive losses).

shembel wrote:

Philippe Verdy! Thanks a lot! Very detailed answer!

Closing as invalid based on Verdy's answer and that the reporter seems satisfied with it.