Page MenuHomePhabricator

When a string such as foobar//barfoo is found, //barfoo should not be linked to http://barfoo
Closed, ResolvedPublic

Description

Author: headbomb_v

Description:
Basically, when a string such as foobarbarfoo is found, barfoo is linked to http://barfoo

I can't think of any reason for this behaviour.


Version: 1.18.x
Severity: normal

Details

Reference
bz30269

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:56 PM
bzimport set Reference to bz30269.
bzimport added a subscriber: Unknown Object (MLST).

This is most likely caused by the recent changes to enable protocol relative links.

It thinks this is a protocol-relative link.

Caused by r91663. Related to bug 29497
I don't think it should be converted to a link unless the double slash appears after a [
The case where people use {{SERVER}}/foo in the source is a bit harder. I think that -for a protocol-relative wiki- it should give a protocol-relative link but still prefix it with a http: protocol

This is very strange, because //barfoo doesn't trigger it, and I would expect the opposite behavior.

I'll investigate this today or tomorrow.

"easy, parser" is almost an oxymoron :D . Removing "easy"

Created attachment 8913
Option 1: somehow remove '//' from the URL protocols list used by doMagicLinks()

Attached:

Created attachment 8914
Option 2: add a lookbehind for ^|\W\b

Attached:

I've added two patches that fix this bug in different ways, neither of which I'm particularly happy with.

Option 1 is to create another version of $this->mUrlProtocols that somehow excludes ''. The way I've done it in my patch is particularly hacky; it could be done a bit more cleanly in wfUrlProtocols() or something, but I didn't feel like making that effort today. I think this option, changed to exclude '' in a more elegant way, is probably best.

Option 2 adds a lookbehind for ^|\W\b to the regex, which insures that free URLs are only picked up if they are either at the very beginning of the page, or are preceded by a non-word character and a word boundary. In practice, this ensures the URL starts with a word character (because it's preceded by \W\b), which happens to exclude all protocols except '//'. Not only is this hackier than option 1 in principle, it also lets protocol-relative URLs at the very start of the page through (oops!).

Thoughts?

(In reply to comment #8)

Option 1 is to create another version of $this->mUrlProtocols that somehow
excludes ''. The way I've done it in my patch is particularly hacky; it could
be done a bit more cleanly in wfUrlProtocols() or something, but I didn't feel
like making that effort today. I think this option, changed to exclude '
' in
a more elegant way, is probably best.

Yeah, it kinda sucks that wfUrlProtocols returns a string rather than an array. Otherwise you could use unset or array_diff here pretty cleanly.

Option 2 adds a lookbehind for ^|\W\b to the regex, which insures that free
URLs are only picked up if they are either at the very beginning of the page,
or are preceded by a non-word character and a word boundary. In practice, this
ensures the URL starts with a word character (because it's preceded by \W\b),
which happens to exclude all protocols except '//'. Not only is this hackier
than option 1 in principle, it also lets protocol-relative URLs at the very
start of the page through (oops!).

To me, this option is a complete non-starter. It'll present all sorts of edge cases and weirdness. Imagine a template that contains only a protocol-relative URL being transcluded elsewhere. Will it be linked? Unlinked? Who knows.

(In reply to comment #9)

(In reply to comment #8)

Option 1 is to create another version of $this->mUrlProtocols that somehow
excludes ''. The way I've done it in my patch is particularly hacky; it could
be done a bit more cleanly in wfUrlProtocols() or something, but I didn't feel
like making that effort today. I think this option, changed to exclude '
' in
a more elegant way, is probably best.

Yeah, it kinda sucks that wfUrlProtocols returns a string rather than an array.
Otherwise you could use unset or array_diff here pretty cleanly.

You could use array_diff() inside wfUrlProtocols() to massage $wgUrlProtocols before implode()ing it. That requires a bit more code, though (wfUrlProtocols() caches its return value, so you need to account for that too) and would've modified a second file, so I took the quick&dirty route for this so I could get the point across more quickly.

Option 2 adds a lookbehind for ^|\W\b to the regex, which insures that free
URLs are only picked up if they are either at the very beginning of the page,
or are preceded by a non-word character and a word boundary. In practice, this
ensures the URL starts with a word character (because it's preceded by \W\b),
which happens to exclude all protocols except '//'. Not only is this hackier
than option 1 in principle, it also lets protocol-relative URLs at the very
start of the page through (oops!).

To me, this option is a complete non-starter. It'll present all sorts of edge
cases and weirdness. Imagine a template that contains only a protocol-relative
URL being transcluded elsewhere. Will it be linked? Unlinked? Who knows.

Yeah, I didn't really realize how bad this was until I wrote the Bugzilla comment, and I didn't think about the template case.

So… option 1 is going to be applied?

(In reply to comment #11)

So… option 1 is going to be applied?

Not in its current form (it needs to be made nicer, and I know how to do that; I was just lazy) and not before I get Tim to look at it.

Fixed in r94502 using a variant of option 1 after discussing with Tim on IRC. Will add a parser test shortly.