Page MenuHomePhabricator

mediawiki.Uri crashes in pages with '@' in their url
Closed, ResolvedPublic

Description

It isn't just that this would throw an exception:

new mw.Uri('http://krinkle.dev/mediawiki/alpha/index.php?title=VisualEditor:S@box&action=edit');

Although that is also bad (when done on non-@ pages), it is especially bad that on pages containing an @ in the url, the module won't even finish loading because of this line:

defaultUri = new Uri( documentLocation );

So instead, mw.Uri stays undefined.


Version: 1.19
Severity: critical

Details

Reference
bz42513

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:03 AM
bzimport set Reference to bz42513.

Krinkle: As dependency bug 42306 is assigned to, do you work on this?

Not yet, but if nobody does I will eventually.

I googled a little for a reasonable JS URI parsing library that could replace those half-baked regexes, and found this: https://gist.github.com/2428561

It seems genius and is 100% correct. Is it a good idea to use something like this? If it is, I can implement it.

It should be noted that those are HTML5-specific properties.

Found the problem. Expect a patch in the near future.

(In reply to comment #3)

I googled a little for a reasonable JS URI parsing library that could replace
those half-baked regexes, and found this: https://gist.github.com/2428561

It seems genius and is 100% correct. Is it a good idea to use something like
this? If it is, I can implement it.

That's not a URI parsing library, that's an unreliable hack using <a href=""> parsing. See the comments there.

On further investigation and following links from that gist, it looks like the regexes in mw.Uri aren't so half-baked.

They actually come from here:
http://blog.stevenlevithan.com/archives/parseuri

(In reply to comment #6)

(In reply to comment #3)

I googled a little for a reasonable JS URI parsing library that could replace
those half-baked regexes, and found this: https://gist.github.com/2428561

It seems genius and is 100% correct. Is it a good idea to use something like
this? If it is, I can implement it.

That's not a URI parsing library, that's an unreliable hack using <a href="">
parsing. See the comments there.

On further investigation and following links from that gist, it looks like
the
regexes in mw.Uri aren't so half-baked.

They actually come from here:
http://blog.stevenlevithan.com/archives/parseuri

Indeed. You'll also notice that the user and password aren't parsed at all in that implementation.

Also, https://gerrit.wikimedia.org/r/44665

(In reply to comment #6)

On further investigation and following links from that gist, it looks like
the
regexes in mw.Uri aren't so half-baked.

They actually come from here:
http://blog.stevenlevithan.com/archives/parseuri

If they weren't half-baked, mw.Uri wouldn't crash on '@' in the URL.

These regexes are already awful (177 characters? Seriously?), and are bound to get worse as we discover more edge-cases.

What about password with a '@' in it? (Yeah, this is allowed as far as I know, and works.) What about whitespace in the authority part? (This isn't allowed, but those regexes match it just fine.)

While on second though the <a>-abuse I linked isn't a good idea, we really need a serious parsing library.

(In reply to comment #8)

(In reply to comment #6)

On further investigation and following links from that gist, it looks like
the
regexes in mw.Uri aren't so half-baked.

They actually come from here:
http://blog.stevenlevithan.com/archives/parseuri

If they weren't half-baked, mw.Uri wouldn't crash on '@' in the URL.

These regexes are already awful (177 characters? Seriously?), and are bound
to
get worse as we discover more edge-cases.

What about password with a '@' in it? (Yeah, this is allowed as far as I
know,

No, those need to be url escaped afaik. At least in Chrome and in the nodejs modules I used it the password (especially the @ symbol) needs to be url escaped or it won't work.

While on second though the <a>-abuse I linked isn't a good idea, we really
need a serious parsing library.

I'm open to suggestions, though it'll need to be flexible and be replaceable in-place to allow smooth transition (same API and and unit tests, different internal implementation).

The module is generic enough to easily allow a different parser.

But there is some ambiguity on the fields to be extracted and their meaning[1], so even a "perfect" library may need modification to work for us and our interpretation of those terms.

[1] http://tantek.com/2011/238/b1/many-ways-slice-url-name-pieces

I wouldn't attempt to make a comprehensive URI parsing library in JavaScript. The first reason is because it's unnecessary. With the fix I just uploaded to Gerrit, there should not be any more bugs in this function. Secondly, the regular expression would be absurdly complicated. URIs have specific sets of characters that can be included only in certain sections. I wrote a regex for an official URI once when I was working on the PHP version of this class (still in Gerrit), and it was so confusing it took days to get it working properly.

(In reply to comment #10)

I wouldn't attempt to make a comprehensive URI parsing library in JavaScript.

The first reason is because it's unnecessary. With the fix I just uploaded to
Gerrit, there should not be any more bugs in this function.

I just pointed out a glaring one above.

(In reply to comment #8)

What about whitespace in the authority part? (This isn't allowed,
but those regexes match it just fine.)

I'm sure there are more, waiting to be uncovered.

Secondly, the
regular expression would be absurdly complicated. URIs have specific sets of
characters that can be included only in certain sections. I wrote a regex for
an official URI once when I was working on the PHP version of this class
(still
in Gerrit), and it was so confusing it took days to get it working properly.

Of course, I know this. That's why I'm suggesting using the work of someone who has already spent time pulling his hair out to get this right; or at least just taking those crazy regexes in from some other library.

Yes, but is it truly necessary to validate that much? I mean, for example, take the (invalid) URI "https://www.google.com/search?q=test test". The URI is invalid because it has a space character in the query, but if you were to do something like "location.href = <that uri>", it would work perfectly fine, because the browser automatically percent-encodes the space.

Basically, I'm wondering for what the mw.Uri class is being used for that requires such strict validation.

Change-Id: I4954bfd3750d5c990e91cc0dc8a175225ccbad1e

BTW, for whoever's curious, here is what an official URI regular expression would look like (it does't work, needs debugging; it's probably missing a parentheses or something):

/^[A-Za-z][A-Za-z0-9\+\-\.]*:(\/\/(([:A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2})@)?(\[(|v[A-Fa-f]+\.[:A-Z0-9-\._~!\$&'\(\)\*\+,;=]+)\]|([0-9]|[1-9][0-9]|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|2[0-4][0-9]|25[0-5])|([A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))(:[0-9]*)?(\/(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))*)*|\/(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))+(\/(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))*)*|(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))+(\/(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2}))*)*|)\?((([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2})|[\/\?])*)?(#(([:@A-Z0-9-\._~!\$&'\(\)\*\+,;=]|%[A-Fa-f]{2})|[\/\?])*)?$/

Tyler, this looks way too short on the first glance and doesn't work. It also explicitly doesn't support protocol-relative URIs.

Here's one that works (this is used by Ruby's URI parsing library). Of course it's not hardcoded in the code like this, but constructed on the fly from regexes for specific URI parts. I could port this to JS in a reasonable way, if anybody except for me actually wants this.

/^(?:[a-zA-Z][\-+.a-zA-Z\d]*:(?:(?:\/\/(?:(?:(?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*@)?(?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\])(?::\d*)?|(?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)?|\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)(?:\?(?:(?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))?|(?:[\-_.!~*'()a-zA-Z\d;?:@&=+$,]|%[a-fA-F\d]{2})(?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*)|(?:\/\/(?:(?:(?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*@)?(?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\])(?::\d*)?|(?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)?|\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*|(?:[\-_.!~*'()a-zA-Z\d;@&=+$,]|%[a-fA-F\d]{2})+(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)?)(?:\?(?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*)?)?(?:#(?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*)?$/

I know mine doesn't work, because I put it together in five minutes, but it's definitely proper. Here's the code I used to construct it: http://pastebin.com/ELJ4LdKy