Page MenuHomePhabricator

Avoid nested definition lists
Open, LowPublic

Description

Author: Shtriter

Description:
The nesting of definition lists like ";; x :: y" produces awful html.
Moreover, the parser outputs different html for 2 dls with the common structre.
The only difference between these lists is thet one of them is single-line and the other
is not.
The simple example:
;; x :: y

;; x
:: y

Output:
<dl><dt> x&nbsp;</dt><dd><dl><dt></dt><dd> y
</dd></dl>
</dd></dl>
<p><br>
</p>
<dl><dt></dt><dl><dt> x
</dt><dd> y
</dd></dl>

IMHO, single-line dl parcing is not quite right. The emply <dt><dt> should stay before
'<dt> x&nbsp;</dt>', like in multi-line variant.

I've discussed the problem on MediaWiki-General. TimStarling suggested to treat the second
semicolon as literal semicolon. It can be archived by adding new line:

$oLine = preg_replace( '/;(;)+/', ';<nowiki>$1</nowiki>', $oLine );

after

$preOpenMatch = preg_match('/<pre/i', $oLine );

PS. If there are more the one colon on the line (like in the first example), all colons,
starting from 2nd will be also treaten as literals. Cause "; x :: y" acts in the same
way.


Version: unspecified
Severity: normal

Details

Reference
bz6569

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:19 PM
bzimport set Reference to bz6569.

Shtriter wrote:

Improved Parser.php - treats 2nd semicolon as literal

Solves the problem of nested definition lists as described in the bug #6569.

attachment Parser.php ignored as obsolete

ayg wrote:

Please include patches as diffs, not as entirely new files. To apply your
changes, the devs would have to guess at what version you were working from,
diff them themselves, and only then would they be able to apply the diff to the
current version.

EN.WP.ST47 wrote:

Patch that applies the above change

patch for r25328

Attached:

wicke wrote:

Single-line definition lists handling currently appears to be wildly inconsistent:

http://www.mediawiki.org/wiki/User:GWicke/Definitionlists

My personal preference would be to treat a '; x : y' pair as a syntactic unit, so that

*; bla : blub

produces

<ul>
<li>
<dl>
<dt>bla&#160;</dt>
<dd>blub</dd>
</dl>
</li>
</ul>

and

*; bla :: blub

results in

<ul>
<li>
<dl>
<dt>bla&#160;</dt>
<dd>: blub</dd>
</dl>
</li>
</ul>

This would make it different from

*; bla
:: blub

which imo should result in

<ul>
<li>
<dl><dt>bla&#160;</dt></dl>
</li>
</ul>
<dl>
<dd><dl><dd>blub</dd></dl>
</dl>

to stay consistent with general nested-list handling. This is also how lists are currently interpreted in the prototype PEG parser and HTML serializer we are currently working on: http://www.mediawiki.org/wiki/Future/Parser_plan.

sumanah wrote:

From IRC conversation with Gabriel just now -- the patch might be technically fine, but it appears to be inconsistent with general nested list behaviour, and Gabriel it makes more sense to treat ; bla : blub as a unit. So the patch needs more discussion on https://lists.wikimedia.org/mailman/listinfo/wikitext-l . It could be that this patch is obviated by the new parser being developed ( https://www.mediawiki.org/wiki/Future ).

wicke wrote:

Adding the newparser keyword so we keep this issue in mind for it.

wicke wrote:

Additional information from http://lists.wikimedia.org/pipermail/wikitext-l/2011-November/000483.html. Nested definition lists are rare enough to allow us to decide on a new standard without breaking too many pages:

Can we deconstruct the current parser's processing steps and build a set
of rules that must be followed?

I think the commonly-used structures are quite clearly defined, but the
behaviour of these strange permutations is quite unspecified. The parser
output for the case reported in the bug already changed in the meantime..

I think we need to get a dump of English Wikipedia and start using a
simple PEG parser to scan through it looking for patterns and figuring
out how often certain things are used - if ever.

I just ran an en-wiki article dump through a zcat/tee/grep pipeline:

pattern count example

^ 548498738 (total number of lines)
^; 681495
^;[^:]+: 153997 ; bla : blub
^[;:*#]+;[^:]+: 3817 *; bla : blub
^;; 2332
^[:;*#]*;[^:]*:: 41 most probably ;::
^[;:*#]*;[^:]+:: 17 ;; bla :: blub

Nested definition lists are not exactly common. Lines starting with ';;'
often appear as comments in code listings. The most common other
application appears to be indentation and emphasis. Any change in the
produced structure that keeps indentation and bolding should thus avoid
breaking pages.

sumanah wrote:

(In reply to comment #7)
Dan, I'm marking this patch reviewed per Gabriel's comments; it would be great if you could reply, revise, and resubmit. Thanks!

  • Bug 11894 has been marked as a duplicate of this bug. ***

We added several parser tests documenting Parsoid's behavior in parserTests.txt, but disabled them for the PHP parser for now. Please test the patch against those. The expected output might need whitespace adjustment to match the PHP parser output. The Parsoid parser test runner renormalizes whitespace, so should still pass after those changes.

Can it be that this old task is fixed nowadays?

The example posted in the description today renders as

<dl>
<dt>x&#160;</dt>
<dd>
<dl>
<dd>y</dd>
</dl>
</dd>
</dl>
<dl>
<dd>
<dl>
<dt>x</dt>
<dd>y</dd>
</dl>
</dd>
</dl>

My personal preference would be to treat a '; x : y' pair as a syntactic unit, so that

*; bla : blub

produces

<ul>
<li>
<dl>
<dt>bla&#160;</dt>
<dd>blub</dd>
</dl>
</li>
</ul>

and

*; bla :: blub

results in

<ul>
<li>
<dl>
<dt>bla&#160;</dt>
<dd>: blub</dd>
</dl>
</li>
</ul>

This is exactly what happens today.

This would make it different from

*; bla
:: blub

which imo should result in

<ul>
<li>
<dl><dt>bla&#160;</dt></dl> <---- today it renders as <dl><dt>bla</dt></dl>
</li>
</ul>
<dl>
<dd><dl><dd>blub</dd></dl></dd> <---- last </dd> added by me now, it was missing in the example
</dl>

Almost like today, the &#160; is missing, but maybe this is fine, or maybe I didn't copy / paste correctly? See the examples at https://www.mediawiki.org/wiki/User:Qgil-WMF/Sandbox#phab:T8569

PS: commenting here because this is the oldest task assigned to @gwicke. :)

GWicke added a project: Parsoid.
GWicke set Security to None.
GWicke added a subscriber: Arlolra.