Page MenuHomePhabricator

Javascript escapes in URLs ("\x" rather than "%") are not decoded
Closed, DeclinedPublic

Description

Author: rybec

Description:
requests logged on 2012-06-09 for hour 19:00

Instead of HTML percent encodings, pages are sometimes requested through Javascript-encoded URLs. The difference is that "\x", rather than the "%" symbol, is used to indicate the start of an escape sequence. These requests are not decoded by the Mediawiki software. For example, a request for

https://en.wikipedia.org/w/index.php?title=Robinson_Can%C3%B3

is correctly decoded (the "%C3%B3" is transformed to an accented "o"), whereas a request for

https://en.wikipedia.org/w/index.php?title=Robinson_Can\xC3\xB3

is not decoded and we're told the page doesn't exist.

As I noted at https://en.wikipedia.org/wiki/Wikipedia:Redirects_for_discussion/Log/2013_December_9#.5Cx22Weird_Al.5Cx22_Yankovic there's been a tremendous increase in the amount of this traffic reaching the WMF projects, from about one request per hour in September 2011 to millions of requests per day in November 2013.

Perhaps it would be desirable to transform "\x" to "%" before passing URLs to rawurldecode() so that these requests will reach the intended pages.


Version: unspecified
Severity: normal

Attached:

Details

Reference
bz58316

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:41 AM
bzimport set Reference to bz58316.
bzimport added a subscriber: Unknown Object (MLST).

Are you sure the requests are not being handled ? Isn't it just that the log is written differently for those requests ?

I mean I see people are reasoning that https://en.wikipedia.org/w/index.php?title=Robinson_Can\xC3\xB3 should be reachable trough their browser. But that is not correct I think.

It is the technical representation of the input https://en.wikipedia.org/w/index.php?title=Robinson_Canó (a unicode url that is NOT percent encoded)

This technical representation is however not a valid input method in browser URL fields if I remember correctly. I suspect people are making assumptions based on an incorrect interpretation of the logs.

In summary:

  • Entries in the log of apache that look like: Robinson_Can\xC3\xB3

which is a UTF-8 encoded (Likely a representation of the not percent encoded request containing Robinson_Canó, [possibly even an IRI request?])

  • Log entries are NOT canonical on this front. A request for Robinson_Canó is logged differently then a request for Robinson_Can%C3%B3.
  • The statistics of stats.grok.se might not handle these properly (collating them, ignoring them, or just not accessible ?)
  • Someone else made a tool to detect red links, that does make the \x entries accessible/visible.
  • Someone is making mass redirects of \x entries to what they consider to be 'proper' entries. This seems to cause effect in the statistics, but I would say that if the statistics/tools are broken, you are only influencing the statistics most likely, not per se actually fixing something
  • There seems to have been a large increase of these kinds of requests (newer browsers or google/bing.com changing their defaults can easily account for this).
  • You cannot input a utf-8 sequence in the url field of a browser (because there is no need for this, you would just input ó).
  • People can't figure out who is wrong and who is right.

Does that sum it up a bit ?

(In reply to comment #0)

Created attachment 14056 [details]
requests logged on 2012-06-09 for hour 19:00

If you logged something 18 months ago, why do you file a bug report now?

Attached:

rybec wrote:

logged requests for titles containing "Robinson_Can" (case-insensitive), from 18 November 2013 and the first hour of 19 November 2013 (from "zcat pagecounts-2013111*z | grep -i Robinson_Can")

Attached:

rybec wrote:

The first attachment is an extract from http://dumps.wikimedia.org/other/pagecounts-raw/2012/2012-06/pagecounts-20120609-190000.gz , a log provided by the WMF of incoming requests for that hour. I've uploaded another attachment, which shows how requests for Robinson_Can\xC3\xB3, Robinson_Can%C3%B3 and Robinson_Canó appear as separate entries in the logs.

rybec wrote:

Someone has put a redirect at my Robinson_Can\xC3\xB3 example page, but this bug can be confirmed by noting the "redirected from" or by comparing the responses to these two URLs:

https://commons.wikipedia.org/w/index.php?title=File:\x22Holy_Sheykh_Cotton\x22_\x281890\x29_-_TIMEA.jpg

https://commons.wikipedia.org/w/index.php?title=File:%22Holy_Sheykh_Cotton%22_%281890%29_-_TIMEA.jpg

The first brings up an error page, whereas the second gets decoded and brings up a content page.

There's a magical \x syntax hidden in Parser.php, I believe, for Esperanto and such. It was a workaround (a hack) for browsers that used to handle Unicode poorly, as I recall. I'm reminded of it in this bug report.

I'm not sure this is a valid bug.

Change 103241 had a related patch set uploaded by QChris:
Add test to guard against encoding mangling of filter

https://gerrit.wikimedia.org/r/103241

(In reply to comment #8)

There's a magical \x syntax hidden in Parser.php, I believe, for Esperanto

That's true, but not related here.

(In reply to comment #0)

Instead of HTML percent encodings, pages are sometimes requested through
Javascript-encoded URLs.

There are indeed some requests to \x-encoded URLs.
But they are mostly confused bots/clients. They are far from being
page views, and they are really few.
For example in October 2013 we had 20 such request in total in the
sampled-1000 logs.

However, you are correct that we see a lot of \x encoded URLs in
webstatscollector output. Webstatscollector processes udp2log data
unaltered (see comment #9). It seems \x-encoded URLs all stem from
SSL endpoints, and it looks as if those SSL endpoints would throw
misencoded URL requests into udp2log stream. Since that is a
sufficiently different issue, I filed bug 58876 about it.

A solution of bug 58876 will not address the current call for
MediaWiki to decode \x-encoded URLs. But it will make \x-encoded URLs
disappear from the webstatscollector output (thereby also dissappear
from stats.grok.se, and other consumers).

(In reply to comment #10)

(In reply to comment #8)

There's a magical \x syntax hidden in Parser.php, I believe, for Esperanto

That's true, but not related here.

True, I wasn't really replying to anyone in particular. I was just reminded of it here. :-)

This particular bug falls into the category of "should we try to catch various URL munging?" I think. For example, we probably get _a lot_ of requests that inappropriately omit a trailing ) or inappropriately include a trailing > or ,. Should we try to auto-correct those requests as well? Dunno.

(In reply to comment #11)

A solution of bug 58876 will not address the current call for
MediaWiki to decode \x-encoded URLs. But it will make \x-encoded URLs
disappear from the webstatscollector output (thereby also dissappear
from stats.grok.se, and other consumers).

The fix for bug 58876 just went live, so \x encoded Urls should soon
mostly dissappear.

Change 103241 merged by Ottomata:
Add test to guard against encoding mangling of filter

https://gerrit.wikimedia.org/r/103241

brion claimed this task.
brion subscribed.

Per above comments, such URLs are incorrect and should not be decoded.