Page MenuHomePhabricator

Linebreaks are mishandled in <blockquote> and <li>
Closed, ResolvedPublic

Description

Author: ayg

Description:
See the wikitext of the URL. I asked about it on [[WP:VPT]], and Rob implied
this had been discussed before, but I couldn't find anything either here or
there, and he didn't clarify when asked, so I'm filing this as a bug.


Version: unspecified
Severity: minor
URL: http://en.wikipedia.org/wiki/User:Simetrical/6200
See Also:
T53086
T7718

Details

Reference
bz6200

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:22 PM
bzimport set Reference to bz6200.

michael wrote:

Definitely a bug. Blockquotes shouldn't interfere with
wikitext parsing, just as divs don't.

ayg wrote:

Clarification of incorrect behavior: linebreaks are completely ignored when
placed directly in any of <blockquote>, <table>, <ul>, <ol>, <li>, <p>, <tr>,
<h1/2/3/4/5/6>. They're treated correctly when placed in <div>, <td>, <th>, or
any inline element. The problem appears to be somewhere in Parser.php lines
2076 to 2096; commenting these (and respective closing braces, etc.) out fixes
the problem in my test case, although God only knows what else it does.

ayg wrote:

One-line proposed patch againt r16063

This patch appears to fix the problem: it basically just removes the "not in a
block" condition from the code that outputs paragraphs. The only question is
if that could break something. I notice nothing odd with the patched version
in a local copy of [[Paris]], which is a fairly typical example of a long
enwiki article in terms of formatting.

attachment 6200a.patch ignored as obsolete

robchur wrote:

*** Bug 7060 has been marked as a duplicate of this bug. ***

michael wrote:

Line breaks *must* be ignored in <table>, <ul>, <ol>, and <tr> tags, because paragraphs or plain text aren't supposed
to appear directly within them. <H>eadings should only contain inline content, so line breaks ought to be ignored in them
too.

<Blockquote> elements should definitely behave just like <div>s. <Li> tags can have block or inline content too, but
changing their current behaviour might break existing pages -- some research would be required.

ayg wrote:

(In reply to comment #5)

Line breaks *must* be ignored in <table>, <ul>, <ol>, and <tr> tags, because

paragraphs or plain text aren't supposed

to appear directly within them. <H>eadings should only contain inline

content, so line breaks ought to be ignored in them

too.

Ah, right. That's probably the reason for the current behavior.

<Blockquote> elements should definitely behave just like <div>s. <Li> tags

can have block or inline content too, but

changing their current behaviour might break existing pages -- some research

would be required.

Why do you suspect that for <li> and not for <blockquote>?

michael wrote:

The main difference is that list items are almost always entered using wikitext * or # bullets. Blockquotes can only
be generated by typing HTML code.

Come to think of it, line breaks can't really be entered in lists generated using # or *, since they signal the end of a
list item.

Real HTML <li>[...]</li> tags are very rarely used, and then only by HTML-savvy editors who want to apply a class
or style attribute, or by editors who cut-and-paste some existing HTML code. Since block or inline content is
allowed in list items, this would be the way to allow correct and more complex formatting of the contents of list
items, but at the cost of complex and unusual markup in the edit field, incompatible with wikitext lists.

ayg wrote:

Alternative patch (against 16144), changes only <blockquote>

It doesn't seem like <li>s are parsed this way for lists, but it seems they
*are* parsed for the TOC, so <li> will have to remain on the list for the time
being. As you say, it's not a terrible loss.

attachment 6200b.patch ignored as obsolete

ayg wrote:

*** Bug 7262 has been marked as a duplicate of this bug. ***

morten wrote:

Thank you for the patch. It works like a charm. I hope this is generally adopted into
MediaWiki.

ayg wrote:

That's reviewed, then, I suppose.

rotemliss wrote:

(In reply to comment #11)

That's reviewed, then, I suppose.

The fact it works doesn't mean it doesn't break other things. (I don't know if
it does.)

michael wrote:

Is this a duplicate of Bug 1857?

ayg wrote:

*** Bug 1857 has been marked as a duplicate of this bug. ***

ui2t5v002 wrote:

When will the patch go live on the English Wikipedia?

ayg wrote:

Well, first of all it has to be applied, which it hasn't been (otherwise this would be RESOLVED FIXED with a
revision number noted in the last comment). I did try applying it, but it gave me XHTML validity errors and
on inspection it totally screwed up tag nesting. I'll probably get around to poking it again at some point in
the indefinite future, if that helps.

admin wrote:

็How about fixing by using function nl2br? or using ereg function to get to word
from <blockquote> then use explode function to turn all word
into array by \n and then echo it out with <p> and </p> per array index.

ui2t5v002 wrote:

Is Bug 6419 related?

ayg wrote:

I'm quite sure it's not.

ui2t5v002 wrote:

Is there a way to work around this in quotation templates? Putting a table or div or something inside the blockquote?

ayg wrote:

Yes, <blockquote><div> seems to do it. Good idea, now we can use semantic markup without the requirement of littering <p>'s everywhere manually.

ui2t5v002 wrote:

(In reply to comment #21)

Yes, <blockquote><div> seems to do it.

Hmmm.. Doesn't work for me.

http://en.wikipedia.org/wiki/User:Omegatron/Sandbox

What, specifically, did you find to work?

ui2t5v002 wrote:

Aha.

This works:

<blockquote><div>
paragraph 1

paragraph 2</div></blockquote>

but this doesn't:

<blockquote><div>paragraph 1

paragraph 2</div></blockquote>

ayg wrote:

It looks to be sensitive to newlines before/after </?div>, but this works:

<blockquote><div>
Foo

bar
</div></blockquote>

This doesn't:

<blockquote><div>Foo

bar</div></blockquote>

ui2t5v002 wrote:

Yeah. For the record, it apparently needs a newline in both places. This:

<blockquote><div>
paragraph 1

paragraph 2

paragraph 3</div></blockquote>

results in:

<blockquote>
<div>
<p>paragraph 1</p>
<p>paragraph 2</p>
paragraph 3</div>
</blockquote>

while this:

<blockquote><div>
paragraph 1

paragraph 2

paragraph 3
</div></blockquote>

results in:

<blockquote>
<div>
<p>paragraph 1</p>
<p>paragraph 2</p>
<p>paragraph 3</p>
</div>
</blockquote>

  • Bug 11282 has been marked as a duplicate of this bug. ***

michael wrote:

Another bug in the default blockquote behaviour: an HTML ins or del inside a blockquote causes a paragraph break to be placed in front of it. The workaround is to nest a div inside the blockquote.

ayg wrote:

Added parser tests in r38624. I have no idea where the special handling is that makes blockquotes different from divs: even if I convert everything in Parser.php that contains the string "blockquote" or "div" to treat them identically, they're still treated differently. Maybe something in Sanitizer.php or something.

Technically, blockquote can only contain block elements (unlike div which can also contain inline elements and text), so the bug happens with invalid markup only. No one takes seriously the block-inline part of the (X)HTML spec, though.

The effect of the newlines is a more general bug: bug 9207.

ayg wrote:

(In reply to comment #29)

Technically, blockquote can only contain block elements (unlike div which can
also contain inline elements and text), so the bug happens with invalid markup
only.

Aha, I had missed that. So what's correct behavior in this case anyway, then? Wrapping inline elements and text in paragraphs, probably?

No one takes seriously the block-inline part of the (X)HTML spec, though.

Well, maybe no one *should*, but we do claim to produce valid XHTML (barring bugs). So I guess we have to.

The effect of the newlines is a more general bug: bug 9207.

That seems more like an opposite bug, doesn't it? Why do you think that bug is more general?

ui2t5v002 wrote:

(In reply to comment #29)

Technically, blockquote can only contain block elements (unlike div which can
also contain inline elements and text), so the bug happens with invalid markup
only.

When we say "newlines" we mean "paragraphs", since a newline in wikicode is converted to paragraph tags when rendered to HTML. Paragraph tags are certainly valid inside blockquotes.

smccandlish wrote:

See Bug #15491; the ins/del issue has been broken out into a separate bug, since it does not appear to be related to this one.

Another classic bug. :) Bumping for next bug day...

blog wrote:

So is anyone actually working on these classic bugs?

ayg wrote:

Nope. Feel free to submit a patch.

smccandlish wrote:

Anyone interested in this bug should probably also take a look at Bug #15491.

Patch fixing this bug (6200) and also bug 15491

Hi! We (CustIS, custis.ru) have also discovered this problem, so I've written a simple patch for <blockquote> parser handling (paragraph end regexp in Parser::doBlockLevels). This patch also fixes bug 15491 (if I understood its point correctly: "<blockquote>a <ins>b</ins> c</blockquote>" does not insert a newline before b anymore), and it also fixes the behaviour of <center> (paragraphs are allowed within <center>).

attachment Y-000-parser-debug-doblocklevels.diff ignored as obsolete

smccandlish wrote:

When's the next "Bug Day"?

(In reply to comment #38)

When's the next "Bug Day"?

Never. The idea never got going.

ayg wrote:

Comment on attachment 7603
Patch fixing this bug (6200) and also bug 15491

Your patch causes XML well-formedness errors -- specifically, it causes the tests "Regression with preformatted in <center>" and "Bug 8293: Use of center tag ruins paragraph formatting" to fail. Specifically, the HTML output of

"""
<center>
Blah
</center>
"""

contains "<p></center>", and the output of

"""
<center>
foo
</center>

bar

baz
"""

contains

"""
<p>foo
</center>
"""

So this patch isn't good, it causes regressions. To run parser tests on your installation so you can check future patches, run this command from the location where you installed MediaWiki:

php maintenance/parserTests.php--quiet

(I've added two new parser tests in r70167 to improve checks for <blockquote>.)

Completely rewritten doBlockLevels for MediaWiki 1.16

Hmm... You're right.

But there are so many problems with doBlockLevels that I think the correct way to fix all them is to rewrite it, and I've even done it now! :) (for 1.16, but is there any difference to trunk?) Patch is attached. It even passes all parser tests with some modifications similar to "</p>\n<table>" --> "</p><table>", which, I suppose, isn't any real difference.

The patch not only fixes newline issues with <blockquote> etc, but also replaces <p>...</p> to <div class="paragraph">...</div> when ... contains <div>'s, which, I think, could be very controversial, but I see no other solution that fixes the following problem:

(example)
"""
A [[Image:B.jpg|right]] C

A [[Image:B.jpg|right]] C
"""
is not split into two paragraphs as each of them contains <div class="floatright">.

(other possible solution is to not use <div> in thumbnail code and replace it with some inline element with display:block style, but this is even uglier and breaks HTML semantics)

attachment Y-000-parser-debug-doblocklevels.diff ignored as obsolete

Completely rewritten doBlockLevels for MediaWiki 1.16

+ minor fix

attachment Y-000-parser-debug-doblocklevels.diff ignored as obsolete

Completely rewritten doBlockLevels for MediaWiki 1.16

Oops, there is another little fix. :)

attachment Y-000-parser-debug-doblocklevels.diff ignored as obsolete

Punting this to the new parser Brion has under development.

*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*

New version of doBlockLevels (something fixed...)

Updated doBlockLevels

attachment Y-000-parser-debug-doblocklevels.diff ignored as obsolete

sumanah wrote:

+need-review to signal to developers that this patch needs reviewing

wicke wrote:

Vitalyi, thank you for your patch! As you probably know, the current parser is very limited in what it can do, so we are trying to develop a new parser that fixes these kinds of issues more systematically.

We try to incorporate your ideas in the new parser, but don't currently have the resources to thoroughly test your patch with real content. Did you already test your patch against the parser tests?

Yes, I did it when I've developed it. All tests passed after some fixing.
We use this parser on all our wikis - some of them are corporate (intranet) ones, some of them are public.
Although parser tests pass, my implementation is still different from current MW parser, and it still can sometimes lead to an unwanted behaviour. For example, in my implementation <div>s (and, for example, <center> also) inside raw <html> blocks may interfere with <div>s outside it, because <html> may be wrapped into a "paragraph", implemented using a <div>. This is easily fixable by extracting all block elements (div, center) from <html>, so doBlockLevels can see them and wrap accordingly.
Also, we had a funny bug with an unclosed <span> element in a pictogram template - some of the markup was mishandled and the image was duplicated on the page :) I didn't had time to test this deeply, just fixed it by adding a closing </span>. I can try to test this if it will be useful :)

Your patch does not appear to be made against SVN head. Please update it so that I can attempt to apply it.

Rewritten doBlockLevels for r103314

Yeah, it was made against 1.16.
Updated it against r103314. The new patch doesn't include the part for cite extension, I'll post it separately.
There are some changes in parser tests. Most of them are related to the space between closing and opening tags (</p>\n<p> --> </p><p>). Some are related to the subject of this bug.
Plus there is one test that I've intentionally didn't change - <html><script>...</script></html> which will now wrap into <div class=paragraph> not <p>.

attachment Parser-new-doblocklevels-r103314.diff ignored as obsolete

Rewritten doBlockLevels for r103314

Oops, fixed one more parser test :)

attachment Parser-new-doblocklevels-r103314.diff ignored as obsolete

Patch for Cite parser tests

Mostly related to space between closing and opening tags, but there's also one test with gallery in which real code was changed (with File:Foobar).

Attached:

smccandlish wrote:

In the code, that's actually '<div class="paragraph">', with a quoted value, right?

Yes, but unquoted in the case of true wgRawHtml.

Updated patch (code-style + one fix), for r103914

New version of my patch, with correct code style, and one bug fixed - preformatting with a space at the line start didn't work right after a list element (if there was no blank line).

+ Discovered another difference to stock parser: a hack in the stock parser allows it to preformat lines in table cells. But because of this hack, for example, the following text:

pre

Parses to the non-expected:

<pre>pre
</pre>

After applying my patch, preformatting inside table cells is now totally disabled... There are no tests about it - is it considered to be a feature?

+ Where can I read about the new parser which is now in development? How will it work? Will it continue to support these hacks? :-)

attachment Parser-new-doblocklevels-r103914.diff ignored as obsolete

smccandlish wrote:

Failure to support explicitly marked-up <pre>...</pre> content as preformatted simply because it's inside a table cell would clearly be a bug. That exact layout situation is common in template documentation.

wicke wrote:

Hi Vitalyi,

thank you for your update- I hope to find time to review your code and test it against real-world wiki source this week. The parser tests do not cover everything, so doing A-B testing on a wiki dump is still the most reliable test.

After applying my patch, preformatting inside table cells is now totally
disabled... There are no tests about it - is it considered to be a feature?

Preformatting in table cells should be supported, and ignoring the indentation for markup-only lines appears to be a sensible choice to me.

+ Where can I read about the new parser which is now in development? How will
it work? Will it continue to support these hacks? :-)

I am currently working on a parser based on a PEG-based tokenizer for a standard HTML5 parser backend. The idea is to map both html tags and wiki syntax to the same tokens, and let the HTML parser construct the DOM tree from the resulting token soup. Some wiki-specific rules are enforced by post-processing the token stream before feeding those to the HTML parser.

Documentation for this effort can be found at http://www.mediawiki.org/wiki/Future/Parser_plan and my user page (http://www.mediawiki.org/wiki/User:GWicke). The (currently Javascript) source is available in the VisualEditor extension in SVN trunk.

(In reply to comment #57)

Failure to support explicitly marked-up <pre>...</pre> content as preformatted
simply because it's inside a table cell would clearly be a bug. That exact
layout situation is common in template documentation.

Could you point us to examples? I'd like to get this added to the parser tests if possible.

wicke wrote:

Vitalyi, I added the test case you described in r103933. If you came across more differences or issues not covered by parser tests (perhaps the one mentioned in comment 49?), then it would be nice to add parser tests for those as well. This benefits both the old and new parser(s).

smccandlish wrote:

Could you point us to examples? I'd like to get this added to the parser tests
if possible.

No, I was responding to Vitalyi here:

After applying my patch, preformatting inside table cells is now totally
disabled... There are no tests about it - is it considered to be a feature?

It couldn't possibly be a feature, since pre inside td is entirely routine markup.

sumanah wrote:

(In reply to comment #58)

+ Where can I read about the new parser which is now in development? How will
it work? Will it continue to support these hacks? :-)

I am currently working on a parser based on a PEG-based tokenizer for a
standard HTML5 parser backend. The idea is to map both html tags and wiki
syntax to the same tokens, and let the HTML parser construct the DOM tree from
the resulting token soup. Some wiki-specific rules are enforced by
post-processing the token stream before feeding those to the HTML parser.

Documentation for this effort can be found at
http://www.mediawiki.org/wiki/Future/Parser_plan and my user page
(http://www.mediawiki.org/wiki/User:GWicke). The (currently Javascript) source
is available in the VisualEditor extension in SVN trunk.

Vitaliy, I also suggest you subscribe to the parser/visual editor mailing list:

http://lists.wikimedia.org/pipermail/wikitext-l/

New version of patch, for r105865

Updated the patch.
There is one fix: it does not wrap <html> contents into paragraphs (<div class=paragraph>) anymore, like the original parser.
But, it still has the issue with preformatting inside table cells.

Attached:

sumanah wrote:

Vitaliy: Thanks for your patch. After you fix the issue with preformatting inside table cells, might you be interested in using developer access to directly suggest it into our Git source control system?

https://www.mediawiki.org/wiki/Developer_access

https://www.mediawiki.org/wiki/Git/Workflow#How_to_submit_a_patch

sumanah wrote:

Mark, do you have a moment to check on whether this patch is still potentially useful? Thanks.

Change 77961 had a related patch set uploaded by Cscott:
Make linebreaks in <blockquote> behave like <div> (bug 6200).

https://gerrit.wikimedia.org/r/77961

  • Bug 51086 has been marked as a duplicate of this bug. ***

Change 77961 merged by jenkins-bot:
Make line breaks in <blockquote> behave like <div> (bug 6200).

https://gerrit.wikimedia.org/r/77961

Note that indented text inside <blockquote> should *not* be treated like <div>; see bug 52763.

Change 78967 had a related patch set uploaded by Subramanya Sastry:
Revert "Make line breaks in <blockquote> behave like <div> (bug 6200)." See bug 52763

https://gerrit.wikimedia.org/r/78967

Change 78967 merged by jenkins-bot:
Revert "Make line breaks in <blockquote> behave like <div> (bug 6200)." See bug 52763

https://gerrit.wikimedia.org/r/78967

From bug 52763.

Change Iecab69460c6aac36acfe2d9440dc5d3590de8b57 for bug 6200 was somewhat
prematurely +2'ed and merged. Subbu has pointed out that another difference
between <blockquote> and <div> is that indented content inside <blockquote> is
*not* considered to start a <pre>.

This bug seeks to revert that part of the <blockquote> behavior change, since a
quick grep through enwiki-20130708 finds that indented text inside <blockquote>

is quite common.

So, this is trickier than what it seems to be. It is not enough to make <blockquote> behave like <div>. Indent-pre in <blockquote> should be turned off (especially because a lot of existing pages that use indents in block-quote will break without it). Or, a convincing case has to be made why supporting indent-pre in <blockquote> is okay to do and then fix existing pages that use indents in blockquote.

Change 78972 had a related patch set uploaded by Cscott:
Make line breaks in <blockquote> behave like <div> (bug 6200).

https://gerrit.wikimedia.org/r/78972

I'm merging bug 52763 and this one; also merging the previous patch for 6200 and the patch I wrote for 52763. Please look at the parser tests carefully to observe the behavior change. Primarily, we are now creating <p> elements inside <blockquote> -- which I believe to be correct (and necessary to fix bug 15491), but caution is warranted.

  • Bug 52763 has been marked as a duplicate of this bug. ***

Change 78972 merged by jenkins-bot:
Make line breaks in <blockquote> behave like <div> (bug 6200).

https://gerrit.wikimedia.org/r/78972

(In reply to comment #76)

Change 78972 merged by jenkins-bot:
Make line breaks in <blockquote> behave like <div> (bug 6200).

https://gerrit.wikimedia.org/r/78972

Patch got merged, can this bug report be closed as RESOVLED FIXED?

I've removed the workarounds from the Quote and Bq templates on en.wiki.

FWIW, the new behavior does seem to match the behavior of div tags. However the behavior is somewhat wonky. Specifically, whether or not there is a newline preceding the closing div or blockquote tag determines whether or not the final chunk of text is put in a paragraph tag. There may be a good reason why this is the case, but I don't know what it would be.

There's also a second issue: If the blockquote only contains 2 chunks of text separated by 2 linebreaks, the linebreaks are completely ignored. This also matches the behavior of div tags, but in neither case does it make sense.

(In reply to comment #79)

Looks like there are still some issues:
https://en.wikipedia.org/wiki/User:Kaldari/Bug6200

Isn’t this just the generic Bug 5718 (Block element written inline splits multiline paragraphs)?

(In reply to comment #79)

Looks like there are still some issues:
https://en.wikipedia.org/wiki/User:Kaldari/Bug6200

This is existing paragraph wrapping behavior when block tags (opening/closing) are present on the same line as plain text.

https://git.wikimedia.org/blob/mediawiki%2Fcore.git/bc97a03f06bfb4bfd33cd2ec2a127c2f607e34e2/tests%2Fparser%2FparserTests.txt#L1200

This is somewhat odd behavior, of course.

Yeah, it's a side effect of the old parser handling tags on a line-by-line basis. But I think it's too established a behavior to change now, and Parsoid is backwards-compatible with this behavior.

If someone wants to spend some quality time with a wikitext dump we could probably determine exactly how much wikitext depends on the current "odd" behavior.

In any case, since the behavior is now the same as for <div>, I think we can resolve this bug as fixed.