Page MenuHomePhabricator

MediaWiki cannot recognize urls pointing to ipv6 hosts
Closed, ResolvedPublic

Details

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:53 PM
bzimport set Reference to bz21261.
bzimport added a subscriber: Unknown Object (MLST).

jon wrote:

As described in RFC 2732: http://www.ietf.org/rfc/rfc2732.txt

IPv6 addresses in URLs are surrounded in [ ] characters. For instance, in the wiki:

http://[2404:130:0:1000::187:2]/index.php

should produce a hyperlink, but currently does not. Surrounding it in brackets doesn't work either, e.g.:

Click [http://[2404:130:0:1000::187:2]/index.php here]

This should be a simple patch to the parser somewhere - I'll have a go some time in the next day or two and attach a patch.

Cheers, J

jon wrote:

This does not fix it, as it leaves the path off the IPv6 link. Needs a regex genius.

Okay so this bug boils down to tweaking the mExtLinkBracketedRegex in includes/parser/Parser.php

I'm not sure how to alter this to include either passive groups for the IPv6 address brackets, or use a pipe to supply an almost identical alternative.

I'm continuing to fiddle, but there might be a regex genius out there who's seen this before.

Cheers, J

attachment not-a-fix-but-almost.patch ignored as obsolete

peter017 wrote:

This fixes the problem but might not be safe

I have a very simple solution which consists in removing the brackets ('[' and ']') from the list of forbidden entities in a URL.

This modifies 3 regular expressions: 2 in Parser.php and 1 in Sanitizer.php

However, it should be checked by someone competent because I don't know the possible implications in terms of security (there might be a good reason why the brackets were forbidden).

I also note that currently, we do not check whether a URL is valid before adding a link to it (eg. http://thisisan\invalidtest is transformed to a link). But I found no explanation on mediawiki.org about what should be a valid (or invalid) URL.

Attached:

It is a bit of a kludge but the colons and brackets used for IPv6 addresses can be replaced with their full-width forms. The following code assumes that $hostName
has been set.

$host = ( $hostName == '['.substr( $hostName, 1, strlen( $hostName ) - 2 ).']' ) : substr( $hostName, 1, strlen( $hostName ) - 2 ) ? $hostName;
if ( filter_var( $host, FILTER_VALIDATE_IP, FILTER_FLAG_IPV6 ) ) {
str_replace( '[', '[', $hostName );
str_replace( ':', ':', $hostName );
str_replace( ']', ']', $hostName );
}

I the meantime this wiki template code may be useful.

[{{{1|}}}:{{{2|}}}:{{{3|1}}}{{#if:

{{{8|}}}|:{{{4|}}}:{{{5|}}}:{{{6|}}}:{{{7|}}}:{{{8}}}|{{#if:
{{{7|}}}|:{{{4|}}}:{{{5|}}}:{{{6|}}}:{{{7}}}|{{#if:
{{{6|}}}|:{{{4|}}}:{{{5|}}}:{{{6}}}|{{#if:
{{{5|}}}|:{{{4|}}}:{{{5}}}|{{#if:
{{{4|}}}|:{{{4}}} }} }} }} }} }}]

Created attachment 9543
PHP script to demonstrate how an IPv6 hostname can be modified.

The PHP code I posted in my previous comment was untested and defective. The attachment should show how an IPv6 address enclosed in brackets can be modified before being parsed for linking. The reverse can be done after the HTML markup has been generated. I hope someone can come up with a better way to do this as this seems to be a bit of a kludge.

Attached:

That doesn't help, since you would still need to detect IPv6 addresses written with normal brackets.

The patch by Peter looks the way to go, but I'm afraid adding brackets to EXT_LINK_URL_CLASS would allow too much recursion. And that wouldn't stop at the ] of a url.
The change to the sanitizer is needed to avoid converting http://[2404:130:0:1000::187:2]/index.php to http://%5B2404:130:0:1000::187:2%5D/index.php but although that doesn't break any parser test, I suspect there was a reason they were added as control characters in r14930, but I don't know which one (avoid treating the content like? that would no longer hold...).

Note that the change to mExtLinkBracketedRegex won't apply, but it's now constructed from EXT_LINK_URL_CLASS so no problem there.

I added a parser test in r104314.

(In reply to comment #7)

That doesn't help, since you would still need to detect IPv6 addresses written
with normal brackets.

I am not sure what you mean. Are you referring to my php script? (Corrected lines 11-16 below.)

$host = ( $hostName == '['.substr( $hostName, 1, strlen( $hostName ) - 2 ).']' ) ? substr( $hostName, 1, strlen( $hostName ) - 2 ) : $hostName;
if ( filter_var( $host, FILTER_VALIDATE_IP, FILTER_FLAG_IPV6 ) ) {
$hostName = str_replace( '[', '[', $hostName );
$hostName = str_replace( ':', ':', $hostName );
$hostName = str_replace( ']', ']', $hostName );
}

Or are you referring to the template code. (See http://www.mediawiki.org/wiki/User:Allen4names)

(In reply to comment #8)

I added a parser test in r104314.

You may want to check line 1211 of revision 104314 for missing brackets.

(In reply to comment #9)

I am not sure what you mean. Are you referring to my php script? (Corrected
lines 11-16 below.)

Wikitext is written in text. Replacing [ and ] with something else which doesn't conflict could work but... How do you choose which [ ] to replace?

(In reply to comment #8)

I added a parser test in r104314.

You may want to check line 1211 of revision 104314 for missing brackets.

Thanks, they were missing in the <a> contents. Fixed in r105356.

(In reply to comment #10)

(In reply to comment #9)
Wikitext is written in text. Replacing [ and ] with something else which
doesn't conflict could work but... How do you choose which [ ] to replace?

Let me start by rephrasing your question. How do you find the host name in a URL? In the examples below I have omitted the port number, path name, and query string.

ftp://$hostname => ftp://[::1]
http://$hostname => http://[::1]
https://$hostname => https://[::1]
mailto:$username@$hostname => mailto:$username@[::1]

In each case the IPv6 address (enclosed in brackets) comes after two slashes "//" or an at sign "@". The filter_var() function is used in my script to test for an IPv6 address. No other brackets or colons should be replaced with their full width equivalents and these full width characters should be replaced by their ascii equivalents after the HTML links have been created.

As I said before this is something of a kludge and I would prefer some better way of fixing this.

So you are proposing to replace [] when following // or @.

We could add something like (\[(?:[0-9a-fA-F])|:){2,10}\])* to mExtLinkBracketedRegex after wfUrlProtocols()
(that would still fail for usernames and passwords in the protocol, and matches too much, but it's a start).

(In reply to comment #12)

So you are proposing to replace [] when following // or @.

It may not be enough to replace the brackets surrounding the IPv6 address because a colon is used to separate the host name from the port number so the colons in the IPv6 addresses (in host names) should be replaced as well. The reason I want the full width brackets and colons replaced by their ascii equivalents after the wiki text is converted to HTML is that Internet Explorer will not follow the resulting links if this is not done.

sumanah wrote:

Marking patch reviewed because the author will need to revise patch and resubmit to take care of suggestions from the last several comments. Thanks for the patch!

A related side effect of this bug is that setting $wgServer in LocalSettings.php to an RFC2732 address breaks some auto-generated urls.

The unknown page message has broken links for search and edit:

"There is currently no text in this page. You can search for this page title in other pages, [http://[::1]/wiki/index.php?title=Special:Log&page=DOESNOTEXIST search the related logs], or [http://[::1]/wiki/index.php?title=DOESNOTEXIST&action=edit edit this page]."

(In reply to comment #15)

A related side effect of this bug is that setting $wgServer in
LocalSettings.php to an RFC2732 address breaks some auto-generated urls.

The unknown page message has broken links for search and edit:

"There is currently no text in this page. You can search for this page title in
other pages, [http://[::1]/wiki/index.php?title=Special:Log&page=DOESNOTEXIST
search the related logs], or
[http://[::1]/wiki/index.php?title=DOESNOTEXIST&action=edit edit this page]."

So it is more important than a missing parser feature?

This will have to be fixed in parsoid as well as PHP; they share the same URL character classes.

It would probably be better to special case a match at the front of the string against http://vernon.mauery.com/content/projects/linux/ipv6_regex (or something similar, that regexp looks way too complicated) rather than just add '[]' to the permitted character class, since brackets are rather special during parsing.

Change 179231 had a related patch set uploaded (by Cscott):
WIP: support IPv6 addresses in URLs.

https://gerrit.wikimedia.org/r/179231

Patch-For-Review

Core patch is in: https://gerrit.wikimedia.org/r/183705 (gerritbot is lazy about adding links).

Patch is merged in core, whoo. Not long before this bug turned 6 years old.
Now I just need to fix it in Parsoid as well.

Change 179231 merged by jenkins-bot:
Support IPv6 addresses in URLs

https://gerrit.wikimedia.org/r/179231

Krenair claimed this task.
Krenair removed Krenair as the assignee of this task.
Krenair set Security to None.

Maybe cherry pick to REL1_25? (There is at least one conflict, if gerrit doesn't lies :P)

The bug has been open since Oct 2009, surely after 6 years there's no rush that necessitates bypassing the normal train?