Page MenuHomePhabricator

Parser doesn't support protocol relative external links in single-bracketed syntax
Closed, ResolvedPublic

Description

Parser doesn't support protocol relative external links of type [//example.com]. Adding a new bug because 20342 is too vague on the specific issue it tries to address.


Version: 1.20.x
Severity: normal

Details

Reference
bz29497

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:27 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz29497.
bzimport added a subscriber: Unknown Object (MLST).

Updated summary to clarify that this is about inline links in wiki text documents.

Bug 20342 is primarily about URLs generated by/for user interface components and whatnot, where currently we would tend to select either 'http' or 'https' forms but are looking for ways to avoid splitting caches so the same HTML output can be stored and used for both.

For wikitext-formatted messages, it _might_ be useful to be able to pass in such URLs directly into things using '[$1 blah]'.

For documents in general, that's a bit more fragile; when we use protocol-relative links we're saying "we know FOR SURE that both this site and the other site we're talking about are available on both http and https, and that the correct thing to do is to send you to the same protocol on the other site".

IMO that's a bit flaky -- folks are probably more likely to accidentally put in a link that doesn't actually work in one mode or the other without testing it correctly -- but it might be a necessary evil.

Many bits of the software pass external links to interface messages, which are then parsed.

There's also stuff like {{fullurl:}} which returns absolute URLs including a protocol prefix, pointing to stuff that's at the local wiki. Now 1) it should be possible to configure this to spit out protocol-relative URLs (is it already?) and 2) the parser should be able to handle them when using the [] with fullurl combo as in:

[{{fullurl:{{FULLPAGENAME}}|action=edit}} Edit this page]

a.d.bergi wrote:

(In reply to comment #1)

For documents in general, that's a bit more fragile; when we use
protocol-relative links we're saying "we know FOR SURE that both this site and
the other site we're talking about are available on both http and https, and
that the correct thing to do is to send you to the same protocol on the other
site".

There are sites we know for sure: our own. In the situation as it is, you will have to use [wikipedia.org wikipedia.org] instead of just //wikipedia.org. Example: http://test2.wikipedia.org/wiki/Special:ExpandTemplates?input=%7B%7BSERVER%7D%7D%0A%0A%5B%7B%7BSERVER%7D%7D%5D%0A%0A%5B%7B%7BSERVER%7D%7D%20%7B%7BSERVER%7D%7D%5D%0A%0A%7B%7Bfullurl%3A%7B%7BPAGENAME%7D%7D%7D%7D%0A%0A%5B%7B%7Bfullurl%3A%7B%7BPAGENAME%7D%7D%7D%7D%5D%0A%0A%5B%7B%7Bfullurl%3A%7B%7BPAGENAME%7D%7D%7D%7D%20%7B%7Bfullurl%3A%7B%7BPAGENAME%7D%7D%7D%7D%5D%0A%0A%7B%7Bfullurl%3A%7B%7BPAGENAME%7D%7D%7Caction%3Dedit%7D%7D%0A%0A%5B%7B%7Bfullurl%3A%7B%7BPAGENAME%7D%7D%7Caction%3Dedit%7D%7D%5D%0A%0A%5B%7B%7Bfullurl%3A%7B%7BPAGENAME%7D%7D%7Caction%3Dedit%7D%7D%20%7B%7Bfullurl%3A%7B%7BPAGENAME%7D%7D%7Caction%3Dedit%7D%7D%5D
This is worse, and explizitly the fullurl:-thing will break a lot. So I think, at least for our own domain(s) we have to enable un-bracketed links.

IMO that's a bit flaky -- folks are probably more likely to accidentally put in
a link that doesn't actually work in one mode or the other without testing it
correctly -- but it might be a necessary evil.

And I'd say it is necessary. Allowing only for spezific domains (settable in config?) would make it more complex than it must be. wgUrlProtocols (the js variable) would need to provide the domains for which protocol and which link syntax will work. Urghh. I think it is much cleaner to allow every site, even if there may happen accidents.

(In reply to comment #5)

(In reply to comment #1)

For documents in general, that's a bit more fragile; when we use
protocol-relative links we're saying "we know FOR SURE that both this site and
the other site we're talking about are available on both http and https, and
that the correct thing to do is to send you to the same protocol on the other
site".

There are sites we know for sure: our own. In the situation as it is, you will
have to use [wikipedia.org wikipedia.org] instead of just //wikipedia.org.

[snip]

This is worse, and explizitly the fullurl:-thing will break a lot. So I think,
at least for our own domain(s) we have to enable un-bracketed links.

Yes, using {{fullurl:}} to produce a clean link doesn't work any more. This is known and deliberate.

IMO that's a bit flaky -- folks are probably more likely to accidentally put in
a link that doesn't actually work in one mode or the other without testing it
correctly -- but it might be a necessary evil.

And I'd say it is necessary. Allowing only for spezific domains (settable in
config?) would make it more complex than it must be. wgUrlProtocols (the js
variable) would need to provide the domains for which protocol and which link
syntax will work. Urghh. I think it is much cleaner to allow every site, even
if there may happen accidents.

We do allow every site, where are you getting this idea that we're not?

a.d.bergi wrote:

(In reply to comment #6)

(In reply to comment #5)

(In reply to comment #1)
In the situation as it is, you will
have to use [wikipedia.org wikipedia.org] instead of just //wikipedia.org.
This is worse, and explizitly the fullurl:-thing will break a lot. So I think,
at least for our own domain(s) we have to enable un-bracketed links.

Yes, using {{fullurl:}} to produce a clean link doesn't work any more. This is
known and deliberate.

Deliberated? It don't think this is good practice. Apart from breaking existing links, it will make linking more user-unfriendly. Who would use [{{fullurl:xyz|abc}} http(s)://xyz?abc]? As a only-fullurl-link doesn't work any more, users will copypaste a protocoll-absolute, correctly (better: as intended) parsed link. Is that userfriendly?

Allowing only for spezific domains (settable in
config?) would make it more complex than it must be. wgUrlProtocols (the js
variable) would need to provide the domains for which protocol and which link
syntax will work. Urghh. I think it is much cleaner to allow every site, even
if there may happen accidents.

We do allow every site, where are you getting this idea that we're not?

Yes, but not both link formats. I said that enabling the bracketless format for just a configurable set of sites (the proposed knowing-for-sure domains) wouldn't be better, why don't we allow just everything?

(In reply to comment #7)

(In reply to comment #6)

(In reply to comment #5)

(In reply to comment #1)
In the situation as it is, you will
have to use [wikipedia.org wikipedia.org] instead of just //wikipedia.org.
This is worse, and explizitly the fullurl:-thing will break a lot. So I think,
at least for our own domain(s) we have to enable un-bracketed links.

Yes, using {{fullurl:}} to produce a clean link doesn't work any more. This is
known and deliberate.

Deliberated? It don't think this is good practice. Apart from breaking existing
links, it will make linking more user-unfriendly. Who would use
[{{fullurl:xyz|abc}} http(s)://xyz?abc]? As a only-fullurl-link doesn't work
any more, users will copypaste a protocoll-absolute, correctly (better: as
intended) parsed link. Is that userfriendly?

Yeah, you're right that it's confusing. I figured that it would also be confusing if //anyWordThatStartsWithTwoSlashes would be linkified automatically, so I disabled that behavior deliberately. But I didn't consider the fullurl use case you brought up.

Yes, but not both link formats. I said that enabling the bracketless format for
just a configurable set of sites (the proposed knowing-for-sure domains)
wouldn't be better, why don't we allow just everything?

I think you may be misinterpreting Brion's words, and I think those words themselves were unclear to begin with. I never suggested limiting protocol-relative URLs to select domains, although it may be a way out of the we-don't-want-every-word-beginning-with-slash-slash-to-be-linked problem.

An instance where you're generating a complete URL to embed as (potentially clickable) full text in email or a web page probably should be in canonical form.

(In reply to comment #9)

An instance where you're generating a complete URL to embed as (potentially
clickable) full text in email or a web page probably should be in canonical
form.

Yes. To expand on that: we now have {{canonicalurl:}} that always outputs a fully-qualified HTTP URL, even when saved or viewed using HTTPS. Earlier this week , Sam and I went through [[MediaWiki:Enotif body]] on all wikis that had overridden it and changed all instances of {{fullurl:}} and {{SERVER}}{{localurl:foo}} to {{canonicalurl:}} because e-mail clients also don't automatically link protocol-relative URLs in text.

  • Bug 31284 has been marked as a duplicate of this bug. ***

Is there the ability to update {{canonicalurl}} to generate an absolute url that is protocol relative to the login type? Alternatively if that is problematic, can there be (yet) another parser function that undertakes the task to generate the url relative to the user? Having to do those sorts of hacks ''ad infinitum'' is surely just courting disaster against a simple ability especially as that {{canonicalurl}} will get used by the lazy, or someone will write templates to get around the issue. Thanks.

(In reply to comment #12)

Is there the ability to update {{canonicalurl}} to generate an absolute url
that is protocol relative to the login type? Alternatively if that is
problematic, can there be (yet) another parser function that undertakes the
task to generate the url relative to the user? Having to do those sorts of
hacks ''ad infinitum'' is surely just courting disaster against a simple
ability especially as that {{canonicalurl}} will get used by the lazy, or
someone will write templates to get around the issue. Thanks.

What do you mean, exactly? If you want a protocol-relative URL, use {{fullurl:}}. This seems to be what you mean with "protocol relative to the login type". There isn't a parser function that outputs http:// URLs for people viewing over http and https:// URLs for people viewing over HTTPS because that would mean we'd have to split the parser cache, but this is exactly what protocol-relative URLs are for.

What do you mean with a "URL relative to the user"? Does that mean generating fully-qualified URLs, e.g. for e-mails that use https if the user logs in over https and http otherwise? Would this be based on the not-yet-existing "Always use HTTPS when I'm logged in" preference?

(In reply to comment #13)

(In reply to comment #12)

Is there the ability to update {{canonicalurl}} to generate an absolute url
that is protocol relative to the login type? Alternatively if that is
problematic, can there be (yet) another parser function that undertakes the
task to generate the url relative to the user? Having to do those sorts of
hacks ''ad infinitum'' is surely just courting disaster against a simple
ability especially as that {{canonicalurl}} will get used by the lazy, or
someone will write templates to get around the issue. Thanks.

What do you mean, exactly? If you want a protocol-relative URL, use
{{fullurl:}}. This seems to be what you mean with "protocol relative to the
login type". There isn't a parser function that outputs http:// URLs for people
viewing over http and https:// URLs for people viewing over HTTPS because that
would mean we'd have to split the parser cache, but this is exactly what
protocol-relative URLs are for.

What do you mean with a "URL relative to the user"? Does that mean generating
fully-qualified URLs, e.g. for e-mails that use https if the user logs in over
https and http otherwise? Would this be based on the not-yet-existing "Always
use HTTPS when I'm logged in" preference?

I meant what you explained in the first paragraph.

Sometimes it is easiest and more appropriate to display a url. To display a full protocol relative url is problematic, so for
https://en.wikisource.org/w/index.php?title=Wikisource:Sandbox&oldid=3501464

I cannot code

  • {{fullurl:Wikisource:Sandbox|oldid=3501464}} as it doesn't give a protocol
  • {{canonicalurl:Wikisource:Sandbox|oldid=3501464}} and takes me out of secure protocol

I have to code it as

  • [{{fullurl:Wikisource:Sandbox|oldid=3501464}} {{fullurl:Wikisource:Sandbox|oldid=3501464}] ;
  • [en.wikisource.org/w/index.php?title=Wikisource:Sandbox&oldid=3501464 en.wikisource.org/w/index.php?title=Wikisource:Sandbox&oldid=3501464]

It would be fantastic if I could code either

  • {{canonicalurl:Wikisource:Sandbox|oldid=3501464}}; or
  • {{protocolrelativeurl:Wikisource:Sandbox|oldid=3501464}}

and they could exhibit the protocol relative urls
http://en.wikisource.org/w/index.php?title=Wikisource:Sandbox&oldid=3501464
https://en.wikisource.org/w/index.php?title=Wikisource:Sandbox&oldid=3501464
depending on how I login.

If it relates to the second paragraph, then maybe I don't comprehend what happens with http:// after the change. I just see that I keep getting forced into http:// in so many places and don't see an easy (lazy) solution.

(In reply to comment #14)

I have to code it as

  • [{{fullurl:Wikisource:Sandbox|oldid=3501464}}

{{fullurl:Wikisource:Sandbox|oldid=3501464}] ;

  • [//en.wikisource.org/w/index.php?title=Wikisource:Sandbox&oldid=3501464

//en.wikisource.org/w/index.php?title=Wikisource:Sandbox&oldid=3501464]

Yes, unfortunately you do have to. Protocol-relative URLs aren't linked magically because I thought that would be too error-prone, see comment 8.

It would be fantastic if I could code either

  • {{canonicalurl:Wikisource:Sandbox|oldid=3501464}}; or
  • {{protocolrelativeurl:Wikisource:Sandbox|oldid=3501464}}

and they could exhibit the protocol relative urls
http://en.wikisource.org/w/index.php?title=Wikisource:Sandbox&oldid=3501464
https://en.wikisource.org/w/index.php?title=Wikisource:Sandbox&oldid=3501464
depending on how I login.

That would be fantastic, yes. It would also be fantastically cache-breaking :(

If it relates to the second paragraph, then maybe I don't comprehend what
happens with http:// after the change. I just see that I keep getting forced
into http:// in so many places and don't see an easy (lazy) solution.

canonicalurl will continue to output http:// URLs, unless and until HTTPS actually becomes the preferred protocol (but my understanding is we don't have the infrastructure for HTTPS-by-default right now). At some point soonish, we will want to introduce a preference with which users can indicate that they want to log in over HTTPS only. Enabling this preference will have the following consequences:

  • Login attempts over HTTP for that username will be refused and the user will be redirected to the HTTPS login form
  • An insecure (i.e. shared between HTTPS and HTTP) cookie will be set upon login that indicates that the user prefers HTTPS. This cookie should persist for a long time, even after logout or session expiry. If a user makes an HTTP request and this cookie is present, they will immediately be redirected to HTTPS, even if they're not logged in any more
  • Of course login cookies created through an HTTPS login would always be so-called secure cookies (meaning HTTP doesn't have access to them) regardless of this preference

The one that's of particular interest for this bug is:

  • URLs in e-mails sent to users with the HTTPS preference enabled would point to https:// instead of http://
  • Bug 40369 has been marked as a duplicate of this bug. ***

[//example.com] works now. If something else needs doing, it's topic of another bug.

The "free link syntax" mentioned in the bug title do not works: "//example.com" generates plain text. See comment 5.

(In reply to comment #18)

The "free link syntax" mentioned in the bug title do not works: "//example.com"
generates plain text. See comment 5.

Yes, but:

(In reply to comment #6)

Yes, using {{fullurl:}} to produce a clean link doesn't work any more. This is
known and deliberate.

A plain //foo anywhere will break pages, I'm sure there are countless mentions of it out there where it is not intended as a link but simply as two slashes (e.g. on talk pages about a programming-related subject).

Aside from breaking things, it is also arguably bad user facing. I'm not sure whether the average user is ready to be looking at //example.org and know that it is a link.