Page MenuHomePhabricator

Order of title recognition and bold / italic rendering should be reversed
Closed, ResolvedPublic

Description

Author: gangleri

Description:
Hallo!

This bug is about titles as
[[de:G''t]], [[foo'''bar]], [[usage of ''''']] etc.

test cases are available at
http://de.wikipedia.org/wiki/user:Gangleri/tests/G%27%27t
http://de.wikipedia.org/wiki/user:Gangleri/tests/foo%27%27%27bar
http://de.wikipedia.org/wiki/user:Gangleri/tests/usage_of_%27%27%27%27%27

I assume that bold / italic rendering should be done *after* title recognition;
that title recognition should *not* be influenced (should not depend on) by bold
/ italic rendering wiki syntax.

Please create also corresponding parser tests.

best regards reinhardt [[user:gangleri]]


Version: unspecified
Severity: minor
URL: http://de.wikipedia.org/wiki/Benutzer:Gangleri/tests/G%27%27t

Details

Reference
bz4598

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 9:04 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz4598.
bzimport added a subscriber: Unknown Object (MLST).

gangleri wrote:

Bug 4601: Exact search for titles using php meta / escape characters
is about searching these titles.

MediaZilla allows wiki links to the testcases:
[[de:user:Gangleri/tests/G''t]]
[[de:user:Gangleri/tests/foo'''bar]]
[[de:user:Gangleri/tests/usage of ''''']]

would be great to get a parsertest and backport the fix in 1.5

Created attachment 1638
patch swapping quote and links renderers

A simple patch that swap the renderers. The parser tests looks ok.

Attached:

  • Bug 5454 has been marked as a duplicate of this bug. ***

commited in branches/hashar@14022

Parser test added on trunk in r14425.
Patch fails, breaking existing valid markup.

  • Bug 6783 has been marked as a duplicate of this bug. ***

wmahan_04 wrote:

intrusive fix using new Placeholders class

To summarize the problem: we currently parse quotes before links, so something
like [[foo ''bar'']] becomes [[foo <i>bar</i>]] after parsing quotes, which is
wrong.

But if we parsed links before quotes, things like ''[[foo bar|formatting
''here'']]'' would break, because parsing links removes the whole link and
replaces it by a placeholder; we get ''<!--LINK 0-->'' with no chance to parse
all the quotes together.

One idea for fixing this is to use more flexible link placeholders that can
contain text. So ''[[foo bar|formatting ''here'']]'' becomes ''<placeholder
i="0">formatting ''here''</placeholder>'', and we can then parse the quotes.
Text is only put inside the placeholders for piped links.

This patch creates a new class called Placeholders implementing what I
described above. Parser.php currently implements several different placeholder
mechanisms, so the new class is intended to be a step toward organizing them.
With the patch, internal links are parsed before quotes.

The patch fixes the test case for this bug without breaking any of the other
parser tests. It appears to be a little slower than without the patch on pages
with lots of links, but I haven't attempted to optimize it.

attachment 4598.diff ignored as obsolete

wmahan_04 wrote:

intrusive fix using new Placeholders class

To summarize the problem: we currently parse quotes before links, so something
like [[foo ''bar'']] becomes [[foo <i>bar</i>]] after parsing quotes, which is
wrong.

But if we parsed links before quotes, things like ''[[foo bar|formatting
''here'']]'' would break, because parsing links removes the whole link and
replaces it by a placeholder; we get ''<!--LINK 0-->'' with no chance to parse
all the quotes together.

One idea for fixing this is to use more flexible link placeholders that can
contain text. So ''[[foo bar|formatting ''here'']]'' becomes ''<placeholder
i="0">formatting ''here''</placeholder>'', and we can then parse the quotes.
Text is only put inside the placeholders for piped links.

This patch creates a new class called Placeholders implementing what I
described above. Parser.php currently implements several different placeholder
mechanisms, so the new class is intended to be a step toward organizing them.
With the patch, internal links are parsed before quotes.

The patch fixes the test case for this bug without breaking any of the other
parser tests. It appears to be a little slower than without the patch on pages
with lots of links, but I haven't attempted to optimize it.

attachment 4598.diff ignored as obsolete

ayg wrote:

(In reply to comment #0)

I assume that bold / italic rendering should be done *after* title recognition;
that title recognition should *not* be influenced (should not depend on) by bold
/ italic rendering wiki syntax.

I'm not sure about that. In particular, if one word in a title would be
italicized in normal prose, and another wouldn't, then something like
[[Criticism of ''Harry Potter'']] would likely be intended to be interpreted as
[[Criticism of Harry Potter|Criticism of ''Harry Potter'']], not [[Criticism of
''Harry Potter''|Criticism of ''Harry Potter'']]. Do we want to assume the
latter? Why? A good patch should allow either to be typed out in the full
[[link|display]] form regardless, so this should be a relatively minor point.

(I know Neapolitan wants to link to '' a lot, but those problems extend beyond
links.)

(In reply to comment #9)

One idea for fixing this is to use more flexible link placeholders that can
contain text. So ''[[foo bar|formatting ''here'']]'' becomes ''<placeholder
i="0">formatting ''here''</placeholder>'', and we can then parse the quotes.
Text is only put inside the placeholders for piped links.

Clever solution. If we want to parse the display text, don't hide it! I'm
surprised no bugs crop up when you shift around the parsing order like that,
though; the patch will need to be tested on a wide range of pages.

wmahan_04 wrote:

(Sorry for the double post and poor formatting above.)

(In reply to comment #10)

[[Criticism of ''Harry Potter'']] would likely be intended to be interpreted as
[[Criticism of Harry Potter|Criticism of ''Harry Potter'']], not [[Criticism of
''Harry Potter''|Criticism of ''Harry Potter'']]. Do we want to assume the
latter? Why? A good patch should allow either to be typed out in the full
[[link|display]] form regardless, so this should be a relatively minor point.

With my approach it's possible to have either interpretation, but after
thinking about it I chose the latter because 1) it matched the existing parser
test case and 2) I think the former wouldn't allow straightforward linking
to an article that happened to have four quotes in the title, without putting
unwanted formatting in the link. I think it's logical and in line with
existing practice to insist that if formatting is desired within a link, it must
be a piped link. I don't feel strongly about this, though.

surprised no bugs crop up when you shift around the parsing order like that,
though; the patch will need to be tested on a wide range of pages.

Indeed. As I mentioned, the patch passes the existing tests. The switch in style
from <!--LINK 0--> to <placeholder i="0"></placeholder> was to resolve one issue
that came up with definition lists. We have to be careful to ignore colons within
links in cases like

;[[foo|bar: baz]]: definition

Fortunately, there was already a mechanism to skip colons within matching pairs
of tags.

wmahan_04 wrote:

Updated fix that corrects undefined variable notices

Attached:

  • Bug 3749 has been marked as a duplicate of this bug. ***
  • Bug 11707 has been marked as a duplicate of this bug. ***