Page MenuHomePhabricator

Special characters in target of mmv url are hex encoded in url bar
Closed, DeclinedPublic

Description

Author: Gerard.meijssen

Description:
the MediaViewer has gives us https://en.wikipedia.org/wiki/Milutin_Dostani%C4%87#mediaviewer/File:MilutinDostani%C4%87.jpg while Commons has it as https://commons.wikimedia.org/wiki/File:MilutinDostani%C4%87.jpg

Thanks,

GerardM

Version: unspecified
Severity: normal

Details

Reference
bz68372

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:38 AM
bzimport added a project: MediaViewer.
bzimport set Reference to bz68372.
bzimport added a subscriber: Unknown Object (MLST).

Can you clarify what result you would expect? The MediaViewer URL hash contains the exact same string as the file page URL, so I'm not sure how is it not "what we usually do".

Gerard.meijssen wrote:

Try the URL, you will find that what Bugzilla also shows does NOT show like shit from the Commons URL; it has a special c..
Thanks,

GerardM

I'm afraid I have no idea what you are saying. Can you describe how expected and actual behavior differs? See [[mw:How to report a bug]].

No clue either. Please provide clear exact output, screenshots, browser information. For future reference, https://bugzilla.wikimedia.org/enter_bug.cgi?format=guided might also help providing more useful hints.

per discussion on irc:

*Go to https://en.wikipedia.org/wiki/Milutin_Dostani%C4%87#mediaviewer/File:MilutinDostani%C4%87.jpg

Expected behaviour:

Actual behaviour

Behaviour varies with browser:

  • [on chrome] This only happens in path part of url, not fragment, giving a

url like
https://en.wikipedia.org/wiki/Milutin_Dostanić#mediaviewer/File:
MilutinDostani%C4%87.jpg

  • [on firefox 3.5] ć is in both places like expected.

To be clear, chrome simply does not adjust the fragment part of the url but leaves it as is. If you type in https://en.wikipedia.org/wiki/Milutin_Dostani%C4%87#mediaviewer/File:MilutinDostani%C4%87.jpg it will display as https://en.wikipedia.org/wiki/Milutin_Dostanić#mediaviewer/File:MilutinDostani%C4%87.jpg. If you type in https://en.wikipedia.org/wiki/Milutin_Dostanić#mediaviewer/File:MilutinDostanić.jpg it will display as https://en.wikipedia.org/wiki/Milutin_Dostanić#mediaviewer/File:MilutinDostanić.jpg . Firefox (3.5) will always convert the url to https://en.wikipedia.org/wiki/Milutin_Dostanić#mediaviewer/File:MilutinDostanić.jpg

Thanks for the explanation! This should be a bug/feature request for non-firefox browsers. Closing as worksforme since what MediaViewer is doing is, as far as I can see, the correct way of representing non-ASCII characters in an URI. See my mail some time ago about standards and other considerations: http://lists.wikimedia.org/pipermail/wikitech-l/2014-April/076069.html

I reported this for Chromium at the time: https://code.google.com/p/chromium/issues/detail?id=367505
Tried to report for Safari/IE as well but lost motivation before getting halfway through the crap that's needed to report bugs for those browsers. (Opera partially got this right at the time; since then they switched engines, so that might have changed.) If someone is more persistent or already has the right kind of account, more upstream bug reports would be helpful.

Gerard.meijssen wrote:

expected URL

Attached:

expected_behaviour.png (32×597 px, 8 KB)

Gerard.meijssen wrote:

The URL with a malformed string

This is not human readable as you would expect it.

Attached:

rubbish.png (35×853 px, 10 KB)

Again, you should report this to your browser vendor. The URL follows the standard method of encoding non-ASCII bytes, and including them unencoded would result in more serious issues, as I outlined in the mail.

Gerard.meijssen wrote:

Why then is there a difference between how Commons does things and how the Multi Media Viewer does it... Consistent behaviour may be expected and has nothing to do with browser "vendors".
Thanks,

GerardM

Because some browsers treat percent-encoded characters differently in the path/query part and the fragment part of the URL. As can be seen from the URLs you posted in comment 0, the actual representation is consistent.

Gerard.meijssen wrote:

As can be seen by the screenshots (from the same browser & the same session) this is not the case.
Thanks,

GerardM

(In reply to Gerard Meijssen from comment #14)

As can be seen by the screenshots (from the same browser & the same session)
this is not the case.
Thanks,

GerardM

That doesn't make sense. Tgr said that some browsers (chrome) treat the part of the url after the '#' different from the part before the '#', and that you should complain to your browser maker. Your screenshots seem to agree with what tgr said.

Could media viewer just not urlencode the fragment? I suspect that's allowed in html5, and chrome seems to handle it fine.

Could, but should not, IMO. I'll quote the relevant part from the mail linked in comment 8:

  1. Just put the file name as-is (with spaces replaced by underscores) in the URL fragment part. Pro: readable file names in URLs, easy to generate. Con: technically not a valid URI. [2] (It would be a valid IRI, probably, but browser support for that is not so great, so non-ASCII bytes might get encoded in unexpected ways.) Creates nasty usability and security issues (injection vulnerabilities, RTL characters, characters which break autolinking). Would make it very hard to introduce more complex URL formats later, as file names can contain pretty much any character.
  1. Use percent encoding (with underscores for spaces). Pro: this is the standard way of encoding fragments. [2][3] Always results in a valid URI. Readable file names in Firefox. Easy to generate on-wiki (e.g. with {{urlencode}}) Con: Non-Latin filenames look horrible in any browser that's not Firefox.

[2] http://tools.ietf.org/html/rfc3986#section-3.5
[3] https://tools.ietf.org/html/rfc3987#section-3.1

  1. Just put the file name as-is (with spaces replaced by underscores) in the URL fragment part. Pro: readable file names in URLs, easy to generate. Con: technically not a valid URI. [2] (It would be a valid IRI, probably, but browser support for that is not so great, so non-ASCII bytes might get encoded in unexpected ways.)

Yeah, that sounds like the making of some not so fun bugs.

Creates nasty usability

and security issues (injection vulnerabilities, RTL characters, 
characters which break autolinking).

What sort of injection vulnerabilities do you mean ( < and > are disallowed in titles. Things should be escaped before injecting into html anyways). I doubt RTL characters would cause major problems. The annoying characters (bidi override, rtl mark, etc) are banned from file names anyways.

Would make it very hard to

introduce more complex URL formats later, as file names can contain 
pretty much any character.

true.

(In reply to Bawolff (Brian Wolff) from comment #18)

What sort of injection vulnerabilities do you mean ( < and > are disallowed
in titles. Things should be escaped before injecting into html anyways).

Quotes are allowed and can be used to break out from HTML attributes. The goal of having a custom URL in the first place is that people can copy-paste it, so escaping would be up to the reuser. People don't escape URLs they paste into blog posts.

I doubt RTL characters would cause major problems. The annoying characters
(bidi override, rtl mark, etc) are banned from file names anyways.

Here is an example: https://he.wikipedia.org/wiki/קובץ:תוכנית הפדרציה.png
Press "reply" and try to interact with it in the edit box (like deleting some character, adding ASCII characters). Not a major problem but an annoyance.

Plus, tofu in the editbox for more exotic scripts.

Autolinking is a bigger concern though. MediaWiki (and Gmail, Facebook, pretty much anything else) tends to end links characters like ")" which are pretty frequent in file names.

I am closing this report again - see comment 8 and comment 17 for reasons to not change current behavior.