Page MenuHomePhabricator

inconsistent treatment of character entities and invalid chararcters in titles/links
Closed, InvalidPublic

Description

Author: timwi

Description:
BUG MIGRATED FROM SOURCEFORGE
http://sourceforge.net/tracker/index.php?func=detail&aid=830206&group_id=34373&atid=411192
Originally submitted by Luc Van Oostenryck (looxix) 2003-10-25 20:39

On fr: ther is an article (a stub in fact) with the
name [[Fonction δ de Dirac]].
It's impossible to rename it and worse; the soft
doesn't detect that the renaming failed so
theredirection page is still created with a bad name
[[Fonction %CE%B4 de Dirac]].

  • Looxix
  • Additional comments ------------------------

Date: 2003-12-10 11:36
Sender: SF user vibber

Copying text of #856267, marked as duplicate of this:

There are several ways to write a wikilink with a
superscript-2 in the destination article text:

[[User:Finlay McWalter:sandbox:m²]]

[[User:Finlay_McWalter:sandbox:m%26sup2]]

[[User:Finlay_McWalter:sandbox:m%26sup2;]]

[[User:Finlay_McWalter:sandbox:m%26sup2%3b]]

Of these, the top two resolve to the same page, and
each of the latter two resolves to a brand new page.
All three have the same article title, despite being
different articles as far as the database is concerned.

So the creating the two latter pages in the above list
produced the following watchlist fragment:

NM 15:09 User:Finlay McWalter:sandbox:m² (cur; hist) .
. Finlay McWalter (Talk) (another tmp page)
M 15:08 Current events (cur; hist) . . Menchi (Talk)
(typo)
NM 15:08 User:Finlay McWalter:sandbox:m² (cur; hist) .
. Finlay McWalter (Talk) (created (superscript in URLs
thing))

So it sure looks like the "new article" code should
resolve the escaping of characters to produce the
canonical article name.

I'm [[User:Finlay McWalter]] on the english wikipedia.


Version: unspecified
Severity: normal

Details

Reference
bz337

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 6:49 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz337.
bzimport added a subscriber: Unknown Object (MLST).
  • Bug 462 has been marked as a duplicate of this bug. ***

apb wrote:

See test cases at [[:test:Bug462]]

  • Bug 631 has been marked as a duplicate of this bug. ***

wmahan_04 wrote:

(In reply to comment #2)

See test cases at [[:test:Bug462]]

I fixed your self links example in HEAD. It looks like all your other
examples either have been fixed, or are arguably expected behavior. I
think I disagree "Foo bar" and "Foo_bar" should ever refer to different
articles.

apb wrote:

(In reply to comment #4)

(In reply to comment #2)

See test cases at [[:test:Bug462]]

I fixed your self links example in HEAD. It looks like all your other
examples either have been fixed, or are arguably expected behavior. I
think I disagree "Foo bar" and "Foo_bar" should ever refer to different
articles.

Most of the bugs described at [[:test:Bug462]] are still present. I have
updated the page in an atempt to make it more clear.

I don't want "[[Foo bar]]" and "[[Foo_bar]]" to be different. I do want
http://server/wiki/Foo_bar and http://server/wiki/Foo%20bar to be different,
with the latter page being accessible via "[[Foo_bar]]".

(In reply to comment #5)

I don't want "[[Foo bar]]" and "[[Foo_bar]]" to be different. I do want
http://server/wiki/Foo_bar and http://server/wiki/Foo%20bar to be different,
with the latter page being accessible via "[[Foo_bar]]".

Would you mind explaining the logic behind this? I'm quite boggled.

apb wrote:

Suggested fix

  1. The parser should first examine the raw wikitext, looking for links in square brackets.
  2. For each link, the canonicalisation algorithm should be performed (ignore leading and trailing

spaces, treat space and underline as the same, etc.).

  1. After that canonicalisation step, HTML entities (&, {, etc) should be mapped to

the corresponding unicode characters.

The existing observed behaviour is consistent with step 3 being done first instead of last.

Entity to unicode conversion must come before canonicalization on internal links in order to perform whitespace matching
and case conversion.

apb wrote:

(In reply to comment #6)

(In reply to comment #5)

I don't want "[[Foo bar]]" and "[[Foo_bar]]" to be different. I do want
http://server/wiki/Foo_bar and http://server/wiki/Foo%20bar to be different,
with the latter page being accessible via "[[Foo_bar]]".

Would you mind explaining the logic behind this? I'm quite boggled.

Major premise: All characters should be allowed in page names,

even if it difficult to use some characters.

Minor premise: Numeric entity refs are a good way of referring to

characters that are otherwise difficult to include in a page name.

Almost all my other arguments else follow from that.

rowan.collins wrote:

One of the major arguments for "%20" being treated the same as "_" (and this may
apply for some other examples, too) is that typing "en.wikipedia.org/wiki/Name
of a page" into the address bar of a web browser will be converted, by the
browser, to "en.wikipedia.org/wiki/Name%20of%20a%20page". Now we can be
99.9999999% sure that what the user was after was the page
"en.wikipedia.org/wiki/Name_of_a_page", since that is what they would get if
they typed [[Name of a page]] in the text of an article; thus, it's pretty clear
to me that we should never have an article whose literal title is
"Name%20of%20a%20page".

As far as I can see, the treatment of spaces and underscores is currently a)
completely consistent; and b) consistent in a very useful manner: it is
impossible to create an article whose title looks different from the only way of
actually linking to it. Such an article would be an absolute nightmare to
maintain (page moves, deletion, just plain trying to link there and not
happening to use the same escape sequence as the original author). In my
opinion, this goes for the other "problem" characters too: if they're illegal in
titles, they should be illegal in titles; but I grant that some, like leading
'/' or '#' could conceivably be useful. It seems to me, though, that having to
use some unnatural escape sequence whenever you need to refer to an article is
going to create more head-aches than it will solve (think newbies...).

Re-casting the problem, I wonder if a mechanism to display the page's title (in
the HTML output) as something different from its name (in the database) could be
created, which showed the real name (as needed for linking to the article)
underneath:
<h1>C#</h1>
<p><small>[Article title: C_sharp]</small></p>
Except I'm not sure how to label the second line so that it would make sense to
inexperienced users. My thought is that this could be a magic word at the
beginning of the article: '#TITLE C#'; similarly, one could use '#TITLE h2g2' to
display the lower-case leading letter on a wiki where this was otherwise not
possible.

apb wrote:

(In reply to comment #10)

One of the major arguments for "%20" being treated the same as "_" (and this may
apply for some other examples, too) is that typing "en.wikipedia.org/wiki/Name
of a page" into the address bar of a web browser will be converted, by the
browser, to "en.wikipedia.org/wiki/Name%20of%20a%20page". Now we can be
99.9999999% sure that what the user was after was the page
"en.wikipedia.org/wiki/Name_of_a_page", since that is what they would get if
they typed [[Name of a page]] in the text of an article; thus, it's pretty clear
to me that we should never have an article whose literal title is
"Name%20of%20a%20page".

OK, I see your point, but I would expect to get an error if I attempted
to browse to the wrong URL by using %20 instead of underline as a word
separator.

Re-casting the problem, I wonder if a mechanism to display the page's title (in
the HTML output) as something different from its name (in the database) could be
created, which showed the real name (as needed for linking to the article)
underneath:

Yes, that would be fine. If, in the wikitext for http://en.wikipedia.org/wiki/C_plus_plus
and http://en.wikipedia.org/wiki/H2gh,
I could say "#TITLE C++" and "#TITLE h2gh", and if that modified the
<TITLE> and <H1> elements of the HTML output, then I wouldn't mind that
the articles are filed in the database under slightly incorrect names.

<h1>C#</h1>
<p><small>[Article title: C_sharp]</small></p>
Except I'm not sure how to label the second line so that it would make sense to
inexperienced users.

Perhaps "To link to this article, use [[C sharp]]." Put it as close to the H1
heading as possible, and use a stylesheet to hide it in print media.

See [[:en:Template:Wrongtitle]] and [[:en:Wikipedia:Naming conventions (technical restrictions)]]
(and the corresponding talk pages) for relevant discussion.

Please continue the alternate title display discussion at bug 496, where it is on-topic.

rowan.collins wrote:

In the discussion for bug 707, someone spotted that (in 1.3.x) one can use links
such as [[foo<nowiki>+</nowiki>bar]], and they will be treated as valid links,
with the characters in question not being escaped in any way. This is rather
handy for interwiki-links (as discussed there) but it hints at something rather
odd going on, and creates strange behaviour for an internal link:
[[foo<nowiki>+</nowiki>bar]] produces an edit link to [[Foo_bar]], for instance.
What's more, the version running on the test server doesn't deal at all well
with this markup, leaving un-replaced placeholders: see
http://test.wikipedia.org/wiki/Bug707

I know this isn't exactly the same as what we've been talking about so far, but
it's certainly a related issue: how *should* such markup be treated?

(In reply to comment #11)

OK, I see your point, but I would expect to get an error if I attempted
to browse to the wrong URL by using %20 instead of underline as a word
separator.

But that's a developer's way of seeing it, not a user's: as far as the user is
concerned, words are seperated by spaces in links, and so they will type them
seperated by spaces in the URL. They may never notice that in one " " becomes
"_" and in the other " " becomes "%20", and certainly don't care; they have no
conception that they are "using %20 instead of underline as a word separator."

(In reply to comment #12)

Please continue the alternate title display discussion at bug 496, where it is

on-topic.

My apologies: I should have thought to search for existing bugs relating to this
suggestion; I've copied those comments there.

wmahan_04 wrote:

(In reply to comment #13)

[[foo<nowiki>+</nowiki>bar]] produces an edit link to [[Foo_bar]], for instance.
What's more, the version running on the test server doesn't deal at all well
with this markup, leaving un-replaced placeholders: see
http://test.wikipedia.org/wiki/Bug707

This should be fixed in HEAD; thanks for pointing that out.

bugzillas+padREMOVETHISdu wrote:

http://test.wikipedia.org/wiki/Bug707 currently produces this HTML:

<ul>
<li>[[foo+bar]]</li>
<li>[[C++]]</li>
<li><!--IWLINK 0--></li>
<li>[[meta:foo+bar]]</li>
</ul>

The third line is obviously a bug irrespective of how the others are treated.

bugzillas+padREMOVETHISdu wrote:

[[en:User:SirJective/Parenthesis]] has another example of a problematic
link/title. I've described the workarounds in the talk page. IMHO the user
shouldn't've been allowed to create the page:
613 commandments ( ''mitzvot'' )
in the first place. Having a page with such a title which must be linked only as:
613 commandments %28 %27%27mitzvot%27%27 %29
or similar is undesirable.

PS: probably my previous comment adds nothing more to what was already said
(though I couldn't understand what "unreplaced placeholder" meant). Sorry about
that.

bugzillas+padREMOVETHISdu wrote:

Oops! Another goof up & another spam from me :(. The link is
[[en:User:SirJective/Parenthesis/other]]. Bugzilla should also have a preview
feature like mediawiki :).

  • Bug 2096 has been marked as a duplicate of this bug. ***

gangleri wrote:

This bug is still open:

See [[en:User:Gangleri/tests/bugzilla:00337]] about [[&rlm;]] (this is
[[&amp;rlm;]]) and generates http://en.wikipedia.org/wiki/%E2%80%8F .

gangleri wrote:

&lrm; &rlm; &#8234; &#8235; &#8236; &#8237; &#8238; alone does not make much
sense for titles. I would say this is more or less "whitespace".

Regards Reinhardt [[user:gangleri]]

gangleri wrote:

changed Component to "Page rendering"
bug 462: numeric entity references for problematic characters
is no longer a duplicate of this bug

opened an unsolved issue at
bug 4250: Escaped generation of [[foo|bar]] does not render properly
Please read comments about it at bug 462 coment 2.

best regards reinhardt [[user:gangleri]]

gangleri wrote:

(In reply to comment #10)

... Such an article would be an absolute nightmare to
maintain (page moves, deletion, just plain trying to link there and not
happening to use the same escape sequence as the original author). In my
opinion, this goes for the other "problem" characters too: if they're illegal in
titles, they should be illegal in titles; but I grant that some, like leading
'/' or '#' could conceivably be useful. It seems to me, though, that having to
use some unnatural escape sequence whenever you need to refer to an article is
going to create more head-aches than it will solve (think newbies...).

I agree: "would be an *absolute* *nightmare* to *maintain* (page moves,
deletion, just plain trying to link there and not happening to use the same
escape sequence as the original author)."
regarding "ilegal characters" see below.

I agree: "if they're illegal in titles, they should be illegal in titles; but I
grant that some, like leading '/' or '#' could conceivably be useful."

Re-casting the problem, I wonder if a mechanism to display the page's title (in
the HTML output) as something different from its name (in the database) could be
created, which showed the real name (as needed for linking to the article)
underneath:

I can not (I do not like to) provide / propose a "markup" here but with the
examples from below it should be possible to solved this with <charinsert>.


I made some testcase for the original links from comment 0 at
http://yi.wiktionary.org/w/index.php?title=project:bugzilla/00337&oldid=6893#m.C2.B2
.

During the various previews I made in order to generate the testcase I realised
that it *is* possible to generate titles containing characters which are *not*
allowed in titles. I know also the method to create them as first character (and
also to generate titles starting with lowercase letters).

Please do not understand me wrong. I do not like to *hack* MediaWiki - I only
want to report what I have seen. I also want to refer at various requests (can
not find all bug numbers now)

  • allow titles starting with lowercase letters
    • bug 496: Override title text and formatting from page markup
    • bug 2118: patch to let mediawiki display the title lowercase in

wgCapitalLinks mode

  • allow titles containing the characters which are *not* allowed in titles

Thise are requests made by others not by me.

Before describing the method I want to point at two issues:

  1. Would the "normalisation function" be stable enough to be aplied multiple

times because of how the code / implementation of the whole package is *now*?
Else changing and maintaining the code would me a *nighmare* as Rowan said.

  1. What benefit would have the users if there is a tricky way to generate titles

that they want (all using %nn coding) but they would not have the keyboard /
knowledge / skills to generate these easely and / or to refer / link to them easily?

The *new* issue for me was that %nn is a method to generate the characters which
are not alloed in titles. &nn alone would not work as "first characters" but you
/ we could use for exampe *one* and *only* one heading Unicode Character ZERO
WIDTH SPACE - U+200B
http://www.fileformat.info/info/unicode/char/200b/index.htm
HTML Entity (decimal) &#8203; (hex) &#x200b;
UTF-8 (hex) 0xE2 0x80 0x8B (e2808b) %E2%80%8B %e2%80%8b

There are requirements (bug reports) to disallow certain characters. If ZERO
WIDTH SPACE would be disalowed also it mide be whise to allow it *only*
a) before the character characters which are *not* allowed in titles
b) before a lower case letter
These are simple rules.

Made some tests at http://test.leuksman.com/view/Category:Bugzilla/00337 .
The titles there "look" like "/", starting "?", starting ":" etc.
Was not able to find a way to generate a title that "looks" like "/".

best regards reinhardt [[user:gangleri]]

avarab wrote:

changed summary: "illegal" => "invalid", the characters in question are invalid,
they are not a violation of the law.

gangleri wrote:

(In reply to comment #22)

I made some testcase for the original links from comment 0 at

http://yi.wiktionary.org/w/index.php?title=project:bugzilla/00337&oldid=6893#m.C2.B2

http://yi.wiktionary.org/w/index.php?title=project:bugzilla/00337&oldid=6904#m.C2.B2

Was not able to find a way to generate a title that "looks" like "/".

Was not able to find a way to generate a title that "looks" like "#".
There is an example which should *not* break apache's using "&#8203;/" [[&#8203;/]].

  • Bug 5731 has been marked as a duplicate of this bug. ***
  • Bug 6932 has been marked as a duplicate of this bug. ***

I think all the relevant bits got separated out to other bugs (and most if not all fixed) over the years. The core premise of this bug seems to have been a request to do things the *opposite* order from what we want to be doing (comment 7, 9).

Resolving INVALID.