Page MenuHomePhabricator

Anchors to section names for non-ASCII letters are encoded in the URL
Closed, ResolvedPublic

Assigned To
Authored By
Sunpriat
Nov 6 2014, 8:25 PM
Referenced Files
None
Tokens
"Doubloon" token, awarded by Liuxinyu970226."Doubloon" token, awarded by RandomDSdevel."Barnstar" token, awarded by Thibaut120094."Doubloon" token, awarded by whym."Orange Medal" token, awarded by Krinkle.

Description

RuWiki uses the Cyrillic alphabet.
In the browser url-bar links partially displayed on the Cyrillic.

for
[[:ru:Википедия:Форум/Новости]] (Википедия:Форум/Новости = Wikipedia:Forum/News)
displayed
https://ru.wikipedia.org/wiki/Википедия:Форум/Новости

for
[[:ru:Википедия:Форум/Новости#New Wikipedia Library Accounts Now Available (November 2014)]]
displayed
https://ru.wikipedia.org/wiki/Википедия:Форум/Новости#New_Wikipedia_Library_Accounts_Now_Available_.28November_2014.29

for
[[:ru:Википедия:Форум/Новости#Новый инструмент для исправления интервики-связей]]
displayed
https://ru.wikipedia.org/wiki/Википедия:Форум/Новости#.D0.9D.D0.BE.D0.B2.D1.8B.D0.B9_.D0.B8.D0.BD.D1.81.D1.82.D1.80.D1.83.D0.BC.D0.B5.D0.BD.D1.82_.D0.B4.D0.BB.D1.8F_.D0.B8.D1.81.D0.BF.D1.80.D0.B0.D0.B2.D0.BB.D0.B5.D0.BD.D0.B8.D1.8F_.D0.B8.D0.BD.D1.82.D0.B5.D1.80.D0.B2.D0.B8.D0.BA.D0.B8-.D1.81.D0.B2.D1.8F.D0.B7.D0.B5.D0.B9

Is it possible convert to Cyrillic the section link? Uncomfortable and hard to see that encoding in the link in url-bar browser.


Version: unspecified
Severity: minor

Details

Reference
bz73092

Related Objects

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 3:54 AM
bzimport set Reference to bz73092.
bzimport added a subscriber: Unknown Object (MLST).

I'm not sure if there is some technical reason why non-ASCII anchor names in URLs end up as #.D0.9D.D0.BE in the browser's URL bar while the rest of the URL is displayed "non-encoded".
I assume that's how MediaWiki encodes anchor names. Or the browser. Or some RFC.

I hope somebody who is more tech-savvy can explain that here.

<cscott> andre__: yeah, that's how it's been and I don't think we can change it now without breaking lots of existing links to our content
<cscott> presumably once upon a time some browser didn't like #%D0%9D%D0%Be

It's HTML4 baggage: ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".") (HTML 4.01 spec). Not relevant anymore and it's a crappy format for multiple reasons (this email about choosing anchor encoding formats for MediaViewer has details); we would have to keep it as an alternative anchor for B/C of course, but that does not seem hard.

OTOH the properly encoded Unicode URL would be https://ru.wikipedia.org/wiki/Википедия:Форум/Новости#D0%9D%D0%BE%D0%B2%D1%8B%D0%B9_%D0%B8%D0%BD%D1%81%D1%82%D1%80%D1%83%D0%BC%D0%B5%D0%BD%D1%82_%D0%B4%D0%BB%D1%8F_%D0%B8%D1%81%D0%BF%D1%80%D0%B0%D0%B2%D0%BB%D0%B5%D0%BD%D0%B8%D1%8F_%D0%B8%D0%BD%D1%82%D0%B5%D1%80%D0%B2%D0%B8%D0%BA%D0%B8-%D1%81%D0%B2%D1%8F%D0%B7%D0%B5%D0%B9 which is not all that much nicer (last I checked Firefox was the only that transformed it to human-readable form for display; it follows the standard though so maybe other browser vendors could be convinced).

To clarify, is this just asking for $wgExperimentalHtmlIds to be set to true?

Parsoid has supported "HTML5" ids, which are much more permissive than HTML4 ids. Ironically enough, we just reverted this to have Parsoid use "compatible" HTML4 ids in f051b2620a276c4c9d1e43444f28d9a88a56ba6e since T152540: Migrate to HTML5 section ids has been stalled for a long time.

We would love for PHP to migrate to HTML5 section ids.

I think @kaldari said yesterday that there was a browser compatibility issue, but if so, it is not mentioned here or on T152540. Does anyone know more about browser compatibility?

According to @Tgr, only Firefox automatically converts percent-encoded fragments into human-readable characters on display. Without this feature, it doesn't seem like migrating to HTML5 section ids really makes much difference (as it's just switching from one form of encoding to another).

According to @Tgr, only Firefox automatically converts percent-encoded fragments into human-readable characters on display. Without this feature, it doesn't seem like migrating to HTML5 section ids really makes much difference (as it's just switching from one form of encoding to another).

OK, but that's not a blocker. We can lead the browsers by switching to percent encoding first, and then users will see the benefits as soon as the browsers roll out the feature. The browser developers will have more of a reason to implement the feature if it is already deployed on WMF websites.

@tstarling it would be nice to get some sort of commitment from browser vendors first. If they never do percent decoding of fragments, we would probably have to use a different URL-encoding scheme (only encode characters which are likely to break autolinking), and any time we change the encoding scheme, we need to support B/C forever.

Hi. I saw a little problem in this feature. Is it tracked somewhere or I should open a new task? The section address does not work with spaces between the words, only with underscores. Thank you.

@IKhitron In order to make fragments with spaces work as underscores at the first page load, the following JS code could be used in personal common.js or elsewhere:

var hash = navigator.userAgent.indexOf( 'Firefox' ) === -1 ? location.hash :
    decodeURIComponent( location.hash );
if ( hash.indexOf( ' ' ) !== -1 ) {
    var restoredFragment = hash.substring( 1 ).replace( / /g, '_' );
    var targetElement = document.getElementById( restoredFragment );
    if ( targetElement ) {
        if ( typeof targetElement.scrollIntoView !== 'undefined' ) {
            targetElement.scrollIntoView();
            if ( typeof history !== 'undefined' ) {
                history.replaceState( {}, '', '#' + restoredFragment );
            }
        } else {
            location.hash = '#' + restoredFragment;
        }
    }
}

@IKhitron In order to make fragments with spaces work as underscores at the first page load, the following JS code could be used in personal common.js or elsewhere:

Sure, you can do a lot in javascript. I'm talking about adding this to the engine, for everyone.

Why would there be a space to begin with ?

Why would there be a space to begin with ?

TheDJ, I'm talking about spaces between the words, not in the beginning.

That's what I meant. Why are there spaces ? Spaces are not valid characters in the anchor, we use underscores. There is no ToC entry with a space as far as I can tell.

I'm talking about section names. They can include spaces exactly as page names, and page names underscores can be replaced by spaces in urls.

I'm talking about section names. They can include spaces exactly as page names, and page names underscores can be replaced by spaces in urls.

Perhaps an example link will help.

Perhaps an example link will help.

Sure. The link

https://ru.wikipedia.org/wiki/История_Википедии#Формулирование_концепции

works. The link

https://ru.wikipedia.org/wiki/История Википедии#Формулирование_концепции

works. The link

https://ru.wikipedia.org/wiki/История_Википедии#Формулирование концепции

does not work. The link

https://ru.wikipedia.org/wiki/История Википедии#Формулирование концепции

does not work. The last open just the page, not the section. Thank you.

Perhaps an example link will help.

Sure. The link

https://ru.wikipedia.org/wiki/История_Википедии#Формулирование_концепции

works. The link

https://ru.wikipedia.org/wiki/История Википедии#Формулирование_концепции

works. The link

https://ru.wikipedia.org/wiki/История_Википедии#Формулирование концепции

does not work. The link

https://ru.wikipedia.org/wiki/История Википедии#Формулирование концепции

does not work. The last open just the page, not the section. Thank you.

How would you ever reach such a link? A link by definition does not have any spaces.

The link

https://ru.wikipedia.org/wiki/История_Википедии#Формулирование концепции

does not work.

As far as I know this has always been the case in any language. E.g. https://en.wikipedia.org/wiki/Bob_Dylan#Life_and_career vs https://en.wikipedia.org/wiki/Bob_Dylan#Life and career. The server will normalize the page title (e.g. https://en.wikipedia.org/wiki/Bob Dylan becomes https://en.wikipedia.org/wiki/Bob_Dylan) but I think everything after the fragment identifier (#) is interpreted only by the browser.

Just as @TheDJ said:

Spaces are not valid characters in the anchor, we use underscores

How would you ever reach such a link? A link by definition does not have any spaces.

Manually in the address field of the browser.

The server will normalize the page title (e.g. https://en.wikipedia.org/wiki/Bob Dylan becomes https://en.wikipedia.org/wiki/Bob_Dylan) but I think everything after the fragment identifier (#) is interpreted only by the browser.

So, you mean it's not possible in the same way as page name conversion, did I understand right?

So, you mean it's not possible in the same way as page name conversion, did I understand right?

Correct. In other words, I don't think this is a bug. If this is a big concern, consider adding the JavaScript in T75092#3753064 to your wiki.

Thank you all for the help.

Tried the script on Android, does not work. )-: