Percent-escaped slashes and colons should not be alternative page URLs
Open, LowPublic
Actions

Assigned To

None

Authored By

	• bzimport
	Feb 19 2014, 6:49 PM

Description

Author: wikipedia

Description:
The Apache or MediaWiki configuration for Wikipedia appears to decode percent-encoded "/"s and ":"s in URLs.

This means that, for example, https://en.wikipedia.org/wiki/Wikipedia%3AAbout shows the same page as https://en.wikipedia.org/wiki/Wikipedia:About.

Similarly, https://en.wikipedia.org/wiki/Wikipedia_talk%3AArbitration%2FRequests%2FCase%2FShakespeare_authorship_question%2FProposed_decision is a valid URL for https://en.wikipedia.org/wiki/Wikipedia_talk:Arbitration/Requests/Case/Shakespeare_authorship_question/Proposed_decision.

The escaping results in incredibly-verbose robots.txt rules, see https://en.wikipedia.org/robots.txt , but even our existing rules don't account for %2Fs in place of "/"s.

We should either redirect or reject these URLs.

Version: wmf-deployment
Severity: normal

Details

Reference: bz61553

Event Timeline

• bzimport raised the priority of this task from to Low.Nov 22 2014, 3:06 AM

• bzimport added a project: Wikimedia-Apache-configuration.

• bzimport set Reference to bz61553.

• bzimport added a subscriber: Unknown Object (MLST).

• bzimport created this task.Feb 19 2014, 6:49 PM

(In reply to LFaraone from comment #0)

The escaping results in incredibly-verbose robots.txt rules

...which is not a problem per-se.

but even our existing rules don't account for %2Fs in place of "/"s.

So these rules could be fixed to support that too?

We should either redirect or reject these URLs.

I cannot follow yet the advantage of this proposal.

What is the problem you would like to solve in this bug report?

LFaraone: Can you please answer comment 1?

wikipedia wrote:

(In reply to Andre Klapper from comment #1)

The escaping results in incredibly-verbose robots.txt rules

...which is not a problem per-se.

but even our existing rules don't account for %2Fs in place of "/"s.

So these rules could be fixed to support that too?

We should either redirect or reject these URLs.

I cannot follow yet the advantage of this proposal.

What is the problem you would like to solve in this bug report?

Yes, we could try and guess *every* *single* *possible* encoding of a URL and include it in robots.txt.

So, for WT:BLP/N, that means we'll need to have these entries:

Disallow: /wiki/Wikipedia_talk:Biographies_of_living_persons/Noticeboard/
Disallow: /wiki/Wikipedia_talk:Biographies_of_living_persons/Noticeboard%2F*
Disallow: /wiki/Wikipedia_talk:Biographies_of_living_persons%2FNoticeboard/
Disallow: /wiki/Wikipedia_talk:Biographies_of_living_persons%2FNoticeboard%2F*
Disallow: /wiki/Wikipedia_talk%3ABiographies_of_living_persons/Noticeboard/
Disallow: /wiki/Wikipedia_talk%3ABiographies_of_living_persons/Noticeboard%2F*
Disallow: /wiki/Wikipedia_talk%3ABiographies_of_living_persons%2FNoticeboard/
Disallow: /wiki/Wikipedia_talk%3ABiographies_of_living_persons%2FNoticeboard%2F*

This way leads madness.

We want to express one specific thing, disallowing access to pages that are below an article name path. To accomplish that, we need 8 (!!) rules. This makes the list hard to manage, especially since it is edited by hand. We'll almost certainly miss things.

Is there any reason to believe that more aggressive URL canonicalization will affect robots.txt entries? I'm not sure there's a valid use-case here.

In reply to comment 3, I'd suggest that you could turn each of those underscores into " " or "%20" or "__" and come up with thousands more permutations. :-)

Given that Squid caching is prefix-based, more aggressive URL canonicalization would have been (or would be) helpful in that context. That is, as I understand it, Squid viewed "/wiki/Wikipedia_talk%3AB" and "/wiki/Wikipedia_talk:B" as distinct URLs and would cache both separately.

I'm not sure the same is true of Varnish (which is what Wikimedia wikis now use), though improving Squid behavior alone might make this a valid request.

As the question whether (our) Varnish can handle this was raised: Yes it can,
there's normalize_path (which essentially does the same as MediaWiki's wfUrlencode) in modules/varnish/templates/vcl/wikimedia.vcl.erb.

Glaisher subscribed.Apr 23 2015, 5:46 PM

Percent-escaped slashes and colons should not be alternative page URLsOpen, LowPublicActions

Description

Details

Event Timeline

Percent-escaped slashes and colons should not be alternative page URLs
Open, LowPublic
Actions