Page MenuHomePhabricator

Use only ASCII characters in email confirmation links
Closed, ResolvedPublic

Description

Localized versions of Special:Confirmemail might contain non-ascii characters, and not all email clients and/or browsers handle such characters reliably. When this happens, the link will open a new article in the main namespace, and the user will be unable to register (and even might up writing junk articles while trying to do so). Thus either use the unlocalized name of Special:Confirmemail in the link in the email, or use proper urlencoding. (The latter seems less reliable, because a misconfigured client may still decode it, and the browser interpret it as something else than UTF-8.)


Version: unspecified
Severity: trivial

Details

Reference
bz11547

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:59 PM
bzimport set Reference to bz11547.
bzimport added a subscriber: Unknown Object (MLST).

Got another mail from a user unable to register his email just now. Please fix this; it should be trivial.

henry wrote:

Repost from OLPC rt bug 1632:

I tried webmaster, but that didn't work. The confirmation message came from
you so I'll try that.

I gave the wiki signup screen an email address of:

hgm+olpc@ip-64-139-1-69.sjc.megapath.net

It sent the confirmation to:

olpc@ip-64-139-1-69.sjc.megapath.net

That name might have been too long, but I expect some parser chopped things
off after the "+" rather than understanding that it's a valid character in
email names.

I also tried "-" rather than "+". Same results.

A warning message might avoid some confusion. Most people probably won't
know enough (or have access to) their mail server's log files.

I assume you are familiar with using "+" to make tagged addresses. If not,
I'll say more.

These are my opinions, not necessarily my employer's. I hate spam.

URL encoding is definitely NOT the correct way to make the "user@" part of emails address valid. Read the RFCs:
URL encoding just applies to the hierarchical page name within a domain space (and under a hierarchical protocol like "http(s):" and "ftp(s):"),
as well as in query parameters (when they are supported in those protocols).

Valid user names in email addresses also use a "safe" alphabet different from that for domain names (which also DO NOT use URL encoding but the encodings supported in IDNA, if they are internationized, and DNS specifications otherwise).

For example, the underscore character "_" (which is part of my own email address and cannot be subtituted into a "+" or "-" and not even into "%7E") or the exclamation punctuation mark "!" is perfectly safe (and standard) in the "user@" part (which in fact is not really described as a user name, but as an identity specifier whose internal syntax may contain a user name and some other authorization data, that cannot be safely stripped out or separated (some sites will use the colon ":" instead of the exclamation mark).

Mapping any Unicode characters with UTF-8 or other representations into a valid "user@" part of an email address is completely unspecified (there's absolutely no reliable algorithm to do this, as the mapping is completely domain-dependant and may even be different from the mapping used for encoding usernames in URI schemes other than "mailto:"). All that can be done is to check that the "user@" part provided uses the valid ASCII subset which is specific to the "mailto:" URI scheme (and distinct from the ASCII subsets used: either in the DNS protocol for domain names; or in the server-local address part of HTTP/FTP URLs).

Note also that "user@" parts in email addresses are normally CASE-SIGNIFICANT (even if most target SMTP servers, will accept emails using any case, and if some RFCs require that users provide an email address containing a user name that can be used as a valid label in a DNS subdomain, in order to activate some functionality) ; STMP relay agents (as well as senders) MUST NOT change the letter case in a pseudo-canonicalization (because they can't realiably know if the recipient server makes the case distinction) : this could simply break the authorization data which is part of the "user@" part (for example it could contain Base64-encoded binary data, in addition to representing the user identity on the target server where it will be delivered to the target POP3/IMAP/WebMail user's mailbox).

(In reply to comment #2)
(In reply to comment #3)

You should probably open a new bug to discuss that; this one is about the lack of urlencoding in the confirmation link, which is a wholly unrelated issue.

karun.84 wrote:

I do not think we should be using ASCII only. Rather we should use UTF-8 due to Mediawiki needing to support more than just english.

Please do not abuse the bug tracking system by changing the summary to subvert the entire point of a bug.

karun.84 wrote:

This looks like a upstream problem, if browsers and email clients cannot support characters.
What browsers and email clients does this occur with?

AFAICT, the actual issue behind this bug was fixed way back in r35505, and this is actually a dupe of bug 6957

  • This bug has been marked as a duplicate of bug 6957 ***