Page MenuHomePhabricator

Protocol-relative URLs are poorly supported or unsupported by a number of HTTP clients
Open, LowPublic

Description

Protocol-relative URLs might be various kinds of awesome, and they work with every major browser. The problem with them is, they break many things that are not browsers. It's easy to write an app that fetches some HTML with HTTP, but less easy to correctly interpret that HTML. Using obscure features like protocol-relative URLs causes the less carefully-written HTTP clients to break.

Access logs demonstrate that there are many broken clients:
http://paste.tstarling.com/p/HonYcW.html

Note that the browser-like UA strings might not be fake -- Flash and Java apps running under the browser send the UA string of their host. Of the UA strings without "Mozilla" in them, we have:

  • Three versions of perl, two Java libraries
  • Microsoft BITS (a background download helper, probably used by an offline reader app)
  • KongshareVpn (website defunct)
  • Instapaper (an offline downloader/reader for Android)
  • Googlebot-Image (IP confirmed to be Google)
  • A couple of phone models (Dorado, Symbian)

The length of this list is limited by the sample size, not by the actual number of broken scripts. It's a long, long tail.

This is just a gripe bug, I don't have any concrete plan for replacing protocol-relative URLs in our infrastructure. Krinkle asked me about it at I4155c740 , so I thought I may as well document the problem.


Version: 1.22.0
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=44647 T46647
https://bugzilla.wikimedia.org/show_bug.cgi?id=20342 T22342

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:02 AM
bzimport set Reference to bz52253.
bzimport added a subscriber: Unknown Object (MLST).

I'm tempted to say this is an issue with said badly written clients, not MediaWiki, and therefore this is invalid...

(In reply to comment #1)

I'm tempted to say this is an issue with said badly written clients, not
MediaWiki, and therefore this is invalid...

It's not about technical correctness, it's about courtesy. Yes, there are hundreds of clients that are broken in this way, and it is the fault of the hundreds of developers who individually wrote those clients, but the easiest place to fix the problem is in MediaWiki and the WMF frontend.

In any case, client compatibility is an essential part of developing web server software. If the site was completely broken in IE or Firefox, you wouldn't say the bug was invalid. The only difference is scale -- whether we should care about the long tail of badly-written HTML parsers. I am saying that we should care.

(In reply to comment #0)

Protocol-relative URLs might be various kinds of awesome, and they work with
every major browser.

This bug somewhat annoyingly fails to mention why protocol-relative URLs are being used. Bug 20342 comment 0 gives a decent overview of why protocol-relative URLs were implemented on Wikimedia wikis.

In reality, most (see rough stats below) URLs in articles are simply relative ("/wiki/Foo" or "/w/index.php?title=Foo"), not protocol-relative ("//wikiproject.org/w/index.php?title=Foo"). Resources (bits.wikimedia.org and upload.wikimedia.org) rely on protocol-relativity more than anything else (sending mixed content causes ugly warnings in browsers).

Switching bits and upload to always be HTTPS would resolve a good portion of the issue being discussed here. A few minor edits to interface messages would also go a long way in resolving most of this bug.

(In reply to comment #1)

I'm tempted to say this is an issue with said badly written clients, not
MediaWiki, and therefore this is invalid...

We sometimes accomodate stupid clients (as well as stupid users, for that matter). I'm was somewhat inclined to say that this bug is a duplicate of bug 20342 ("Support for protocol-relative URLs (tracking)") or bug 47832 ("Force all Wikimedia cluster traffic to be over SSL for all users (logged-in and anon)"), but assuming the numbers I ran below are correct, this bug could simply be a tracking bug with a few (relatively easy) dependencies.


Recently featured articles on the English Wikipedia [[Main Page]]:

  • [[Barber coinage]]
  • [[Harold Davidson]]
  • [[War of the Bavarian Succession]]

For [[Barber coinage]]:

$ curl -s --compressed "https://en.wikipedia.org/wiki/Barber_coinage" | grep -o '="//' | wc -l

76

$ curl -s --compressed "https://en.wikipedia.org/wiki/Barber_coinage" | egrep -o '="/[^/]' | wc -l

330

For [[Harold Davidson]]:

$ curl -s --compressed "https://en.wikipedia.org/wiki/Harold_Davidson" | grep -o '="//' | wc -l

46

$ curl -s --compressed "https://en.wikipedia.org/wiki/Harold_Davidson" | egrep -o '="/[^/]' | wc -l

233

For [[War of the Bavarian Succession]]:

$ curl -s --compressed "https://en.wikipedia.org/wiki/War_of_the_Bavarian_Succession" | grep -o '="//' | wc -l

91

$ curl -s --compressed "https://en.wikipedia.org/wiki/War_of_the_Bavarian_Succession" | egrep -o '="/[^/]' | wc -l

362

Summary:

  • [[Barber coinage]]: 81.28% of links are simply relative; of the protocol-relative links (76), 71.05% are from bits (18) or upload (36)
  • [[Harold Davidson]]: 83.51% of links are simply relative; of the protocol-relative links (46), 63.04% are from bits (15) or upload (14)
  • [[War of the Bavarian Succession]]: 79.91% of links are simply relative; of the protocol-relative links (91), 56.04% are from bits (15) or upload (36)

Of the protocol-relative links, there are lots of very low-hanging fruit that could easily be made to always use HTTPS that are deflating the numbers directly above. For example:

  • <a href="//wikimediafoundation.org/wiki/Privacy_policy" title="wikimedia:Privacy policy">Privacy policy</a> [twice]
  • <a href="//en.wikipedia.org/wiki/Wikipedia:Contact_us">Contact Wikipedia</a>
  • <a href="wikimediafoundation.org/"><img src="bits.wikimedia.org/images/wikimedia-button.png" width="88" height="31" alt="Wikimedia Foundation"/></a>
  • <a href="www.mediawiki.org/"><img src="bits.wikimedia.org/static-1.22wmf11/skins/common/images/poweredby_mediawiki_88x31.png" alt="Powered by MediaWiki" width="88" height="31" /></a>
  • <a rel="license" href="//en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License">Creative Commons Attribution-ShareAlike License</a>
  • <a rel="license" href="//creativecommons.org/licenses/by-sa/3.0/" style="display:none;"></a>
  • <a href="//donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&amp;utm_medium=sidebar&amp;utm_campaign=C13_en.wikipedia.org&amp;uselang=en" title="Support us">Donate to Wikipedia</a>
  • <link rel="copyright" href="//creativecommons.org/licenses/by-sa/3.0/" />

These amount to 9 links. Re-running our numbers again, we find:

  • [[Barber coinage]]: protocol-relative links: 76 bits.wikimedia.org: 18 upload.wikimedia.org: 36 interface: 9 easily eliminated: 82.89%
  • [[Harold Davidson]]: protocol-relative links: 46 bits.wikimedia.org: 15 upload.wikimedia.org: 14 interface: 9 easily eliminated: 82.61%
  • [[War of the Bavarian Succession]]: protocol-relative links: 91 bits.wikimedia.org: 15 upload.wikimedia.org: 36 interface: 9 easily eliminated: 65.93%

Basically, as I see it, if instead of griping, you simply provisioned a few more resources servers (upload and bits) and submitted a few edits to MediaWiki messages or changesets to Gerrit, you could resolve somewhere between two-thirds and four-fifths of the problem you're describing in comment 0 without breaking a sweat.

That number gets even higher (probably somewhere around 90–95%) with an additional single edit to [[m:Interwiki map]] and an adjustment to the interlanguage links output, which make up most of the remaining protocol-relative URLs.

For what it's worth.

(In reply to comment #3)

Switching bits and upload to always be HTTPS would resolve a good portion of
the issue being discussed here. A few minor edits to interface messages would
also go a long way in resolving most of this bug.

My estimate on bug 51002 was that sending all traffic through HTTPS would require the HTTPS cluster to be expanded by a factor of 10. Since the relevant metric is connection rate, not object rate, bits and upload would probably be most of that, since browsers open multiple concurrent connections to those servers during a normal request. So you'd be looking at maybe 80 additional servers. Maybe it would be worthwhile, but it wouldn't be cheap, either in terms of capital cost or staff time.

Writing an nginx module to rewrite the URLs would probably be simpler than setting up a cluster of 80 servers.

Maybe worth mentioning, due to protocol-relative URLs our feeds do not validate: "id must be a full and valid URL", e.g. http://validator.w3.org/feed/check.cgi?url=https%3A%2F%2Fwww.mediawiki.org%2Fw%2Findex.php%3Ftitle%3DSpecial%3ARecentChanges%26feed%3Datom%26limit%3D50

This seems to have caused problems with some version of Thunderbird (bug 44647) and at some point I was unable to use our feeds in Facebook (or maybe FriendFeed; I've not tested recently).

(In reply to comment #4)

My estimate on bug 51002 was that sending all traffic through HTTPS would
require the HTTPS cluster to be expanded by a factor of 10. Since the
relevant metric is connection rate, not object rate, bits and upload would
probably be most of that, since browsers open multiple concurrent connections
to those servers during a normal request. So you'd be looking at maybe 80
additional servers.

This may be a stupid question, but I got asked today and I didn't know the answer: if Wikimedia currently has a fairly large number of Web servers providing HTTP access, couldn't most of those servers be re-provisioned to serve HTTPS instead? I'm not sure why you would need 80 additional servers (not that the Wikimedia Foundation couldn't easily afford them, in any case).

(In reply to comment #6)

This may be a stupid question, but I got asked today and I didn't know the
answer: if Wikimedia currently has a fairly large number of Web servers
providing HTTP access, couldn't most of those servers be re-provisioned to
serve HTTPS instead? I'm not sure why you would need 80 additional servers
(not that the Wikimedia Foundation couldn't easily afford them, in any case).

The reason you need more servers is because serving HTTPS is more expensive than serving HTTP, because of the encryption overhead. You could have the same servers doing both HTTP and HTTPS, and I believe that is indeed the plan, but you would still need more servers because the CPU cost of serving an HTTPS connection is higher, so you need more cores to do the same connection rate.

Note that the 80 figure is just the current host count (9) multiplied by 9 and rounded off. Ryan pointed out to me that 4 of those 9 servers are old and have only 8 cores, whereas the newer servers have 24 cores. Since CPU is the limiting factor, it's really the core count that should be multiplied by 9. We have 152 cores currently doing HTTPS, so we would need an extra 1368, implying we need 57 additional 24-core servers.

(In reply to comment #7)

The reason you need more servers is because serving HTTPS is more expensive
than serving HTTP, because of the encryption overhead. You could have the
same
servers doing both HTTP and HTTPS, and I believe that is indeed the plan, but
you would still need more servers because the CPU cost of serving an HTTPS
connection is higher, so you need more cores to do the same connection rate.

Didn't Google produce stats claiming that the CPU cost was basically negligible?

I just found out that there is an intermittent a bug in IE with protocol relative stylesheet inclusion, that causes IE to sometimes download the url twice if it is not yet cached.

http://www.stevesouders.com/blog/2010/02/10/5a-missing-schema-double-download/

A Microsoft employee responds in that thread:
"Internal to Trident, the download queue has “de-duplication” logic to help ensure that we don’t download a single resource multiple times in parallel.
Until recently, that logic had a bug for certain resources (like CSS) wherein the schema-less URI would not be matched to the schema-specified URI and hence you’d end up with two parallel requests.
Once the resource is cached locally, the bug no longer repros because both requests just pull the cached local file."

I wonder how the deprecation of HTTP on Wikimedia wikis affects this report. Do we now care more, as we can switch everything to https? Or less, because some clients are being broken anyway when they only support http?

Leaving this open about MediaWiki in the general case, but obviously it's not really as much an issue for WMF now. We could eventually just go to absolutes again and always use 'https://' too.

Switching to https:// links would probably be a net benefit at this point, based on the information we have here. Can anyone (dis)confirm?

Well, I said "eventually" because I fear there will be internal corner-cases we'd break if we universally flipped the URLs to absolutes with https:// all over the output. For instance, there's no HTTPS listener at all on the internal service endpoints like http://appservers.svc.eqiad.wmnet/ that other internal code might hit. Eventually we'll get things fixed up to the point that, even internally in our own networks, all HTTP is HTTPS, but we're not there yet.

Change 332707 had a related patch set uploaded (by Chad):
Swap from protocol-relative urls to https everywhere

https://gerrit.wikimedia.org/r/332707

Change 332707 merged by Jcrespo:
Swap from protocol-relative urls to https everywhere

https://gerrit.wikimedia.org/r/332707

[…]

chrome-save-page.png (1×2 px, 708 KB)
firefox-save-page.png (1×2 px, 733 KB)

Note that the images are broken as well. […] The browser (both Firefox and Chrome), store styles and images (regardless of domain) in the saved directory. But, for images it is only saving the file referenced in <img src>. […] The files referenced in <img srcset> are ignored and left as-is. […] This too is an upstream browser bug, considering that […] they already correctly do this for URLs in <img src>. […] On the other hand, this is also something we could relatively easily fix on our end.