Page MenuHomePhabricator

Add non-breaking spaces in additional places automatically
Open, LowPublic

Description

Author: ui2t5v002

Description:
As an alternative solution to T5461, non-breaking spaces should be added automatically by Mediawiki on page render in appropriate places:

Don't worry too much about false positives, since an extra non-breaking space won't cause any serious problems unless many of them occur on the same line.

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:04 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz13619.
bzimport added a subscriber: Unknown Object (MLST).

Clarifying that this requests an addition to the existing automatic   rules, rather than creating a new feature.

ui2t5v002 wrote:

(In reply to comment #1)

Clarifying that this requests an addition to the existing automatic  
rules, rather than creating a new feature.

Is there any documentation for the existing rules?

Documentation? Don't be silly, this is MediaWiki! ;)

You can find the current rules in Parser::parse(), though:

Clean up special characters, only run once, next-to-last before doBlockLevels

$fixtags = array(

  1. french spaces, last one Guillemet-left
  2. only if there is something before the space

'/(.) (?=\\?|:|;|!|%|\\302\\273)/' => '\\1 \\2',

  1. french spaces, Guillemet-right

'/(\\302\\253) /' => '\\1 ',
'/ (!\s*important)/' => ' \\1', #Beware of CSS magic word !important, bug #11874.
);

ui2t5v002 wrote:

(In reply to comment #3)

Documentation? Don't be silly, this is MediaWiki! ;)

I wasn't expecting a book. :) just a link to mailing list or prior bug report.

You can find the current rules in Parser::parse(), though:

Ok, so currently all it does is:

  • Changes "some : word" into "some : word" and likewise for ? : ; ! % »
  • Changes "« " into "« "
  • Breaks things inside HTML tags :)

So adding one before dashes is easy enough. Just add a hyphen and the codes for en and em dashes to the ?|:|;|!|% regexp.

I'd like it to also add a nbsp; for anything like "10 kiloohm" or "100 MW". We could either write a huge regular expression for every unit and prefix that exists (http://en.wikipedia.org/wiki/User:Bobblewik/monobook.js/unitformatter.js), or we could just make the rule for any time a number is followed by a space that is followed by a letter. The Manual of Style actually recommends as much:

http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style#Non-breaking_spaces
http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style_%28dates_and_numbers%29#Non-breaking_spaces

Active MoS editors generally believe that something along these lines would be great. If you want to simplify the rule to "number space letter gets replaced by no-break space", then the MoS editors believe that additional markup would be useful for the no-break space, probably a double-comma (,,) (that is, the double-comma would be typed and show it the edit window, and would be rendered as hard-space in the text). The reason is that we don't want automatically-inserted invisible characters to start multiplying in the text, as additions and deletions are made; we want to be able to see them, and easily insert and delete them. On the other hand, if you use very specific rules to insert no-break spaces exactly where most style manuals want them inserted (and I like http://en.wikipedia.org/wiki/User:Bobblewik/monobook.js/unitformatter.js as a good start), then perhaps the double-comma markup is not necessary; we'll be happy to take anything you can give us and try it out.

I should add: I'm talking about en.wikipedia.org. It's my sense that GA and FA article reviewers are included in the long list of people who have approved the idea; if it makes a difference, I'll be happy to survey their opinions.

ui2t5v002 wrote:

(In reply to comment #5)

If you want to simplify the rule to "number space letter gets replaced
by no-break space", then the MoS editors believe that additional markup would
be useful for the no-break space

No. This is about adding a non-breaking space automatically when the page is rendered. Please don't add even more markup to the already cluttered and confusing syntax. Wiki markup is not like HTML, where you have to specify formatting and detail every little thing. The whole point of a wiki is that you enter semantic information, and it takes care of all the formatting and other little details for you.

The reason is that we don't want
automatically-inserted invisible characters to start multiplying in the text,
as additions and deletions are made

They won't be multiplying over time and they won't be visible in the edit box. This wouldn't affect the code in the edit box at all. It would only affect the HTML of the final rendered article.

Thanks for the explanation; I agree that's more elegant if the wizards can do it. Would anyone like me to survey among article reviewers and MoS editors to see if they see potential problems from a broad rule such as "number space letter never wraps"?

ui2t5v002 wrote:

(In reply to comment #8)

Would anyone like me to survey among article reviewers and MoS editors to
see if they see potential problems from a broad rule such as "number space
letter never wraps"?

Absolutely. It's recommended in the manual of style to add a non-breaking space for this case (not just units), but there are certainly a few cases that shouldn't be. False positives won't cause much of a problem, though, since it will just prevent things from line wrapping, and it can't happen multiple times in a row to create a page-widening attack. ("1 a 1 a 1 a" --> "1 a 1 a 1 a ")

ui2t5v002 wrote:

Oh wait. :) "a1 a1 a1 a1" --> "a1 a1 a1 a1"

Maybe we need to worry about that in some rare case? Or make it only for numbers with no letters inside? javascript would be something like: \s[,.0-9]+

Why worry about spacing here? You can just write aaaaaaaaaaaaa... and widen to your heart's content. :)

I'm surveying the WP:MOSNUM people now and I gave them the http://en.wikipedia.org/wiki/User:Bobblewik/monobook.js/unitformatter.js list to tweak. Not wrapping at "number space letter" is a non-starter. More than 90% of the time, that will be something we want to wrap, such as "the 1969 Mets World Series" or "9999 bottles of beer".

ui2t5v002 wrote:

(In reply to comment #12)

More than
90% of the time, that will be something we want to wrap, such as "the 1969 Mets
World Series" or "9999 bottles of beer".

Why would we want those to wrap? MOSNUM currently recommends that they don't.

gnygaard wrote:

Why would we want to keep them from wrapping? MOSNUM is nonsense, recommending non-breaking spaces in places where they are not needed, and not recommending them in places where they are needed. It is also vague and ambiguous, arguably recommending a nonbreaking space at the star in "Ninety-nine*bottles of beer", and in the first space but not saying anything about the second space in a paper weight of "75 g m<sup>−2</sup>"; if that breaks, it should be between the 5 and the g, NOT between the g and the m, which is not only what the MoS rule says, but it is ALSO what we would get if this bug/feature request were implemented.

ui2t5v002 wrote:

(In reply to comment #14)

Why would we want to keep them from wrapping?

Why wouldn't we? See:

http://en.wikipedia.org/wiki/Wikipedia_talk:Manual_of_Style#No-break_spaces_discussion_continues_at_bugzilla

pmanderson wrote:

There is no consensus for not wrapping "9999 bottles of beer". If the letter and number are long, it may well produce clumsy final text; the key question is whether "9999<nowiki> </nowiki>bottles" will disable this feature, so it can be turned off when it does cause trouble.

ui2t5v002 wrote:

(In reply to comment #16)

There is no consensus for not wrapping "9999 bottles of beer".

Please discuss at http://en.wikipedia.org/wiki/Wikipedia_talk:Manual_of_Style#No-break_spaces_discussion_continues_at_bugzilla

then we can come back here and tell the devs what we want

ui2t5v002 wrote:

(In reply to comment #16)

the key question is
whether "9999<nowiki> </nowiki>bottles" will disable this feature, so it can be
turned off when it does cause trouble.

It does. Try a long string of « word »« word »« word »« word » vs <nowiki>« word »« word »« word »« word »</nowiki>

Why not create a MediaWiki: message with a space, comma, or whatever you want, separated list of units.

Then take that message and quote it then convert the separators into |'s turn it into a proper regex list with escaping.

Then just add a &nbsp; with the regex [/(\d+) (<Quoted | list here)/S, "\\1&nbsp;\\2"]

That way only real units have the nbsp added, and additionally wikis may localize the units, and also add any newer or custom units such as fake units which apply only to their wiki. Or instead they can just replace the message with a - and have the whole thing disabled if they don't want it.

ayg wrote:

Please don't discuss the merits of various ideas here, discuss them on-wiki and report on consensus. Bugzilla is an even worse discussion forum than talk pages. :)

We've had localizable regexes before that were part of the parser, like linktrail, but AFAIK those have been disabled as too scary. They can still be localized per-language, but only in the PHP files, not in the MW-namespace messages.

(In reply to comment #19)

Why not create a MediaWiki: message with a space, comma, or whatever you want,
separated list of units.

Why not create a MediaWiki: message where one could add regular expressions and their replacements? Then every language (this discussion here is very en-focused) could add it's rules, could test them and so on...

For German and many related languages, the "digit space letter" rule would be wrong too often, I believe.
Few examples translated to English, using "_" to represent the nonbreaking space:

  1. word space digit rules: the year 1960 and ==> year_1960 and a class 23354 consumer good ==> a class_23354 consumer good laid down in ISO 4711 and not in ==> in ISO_4711 and an ASA 22 film ==> an ASA_22 film this is in paragraph 16 of the law on ==> in paragraph_16 of but article 3 in the constitution ==> but article_3 in king Henry 8 did ==> king Henry_8 did
  1. more complex: the years 1970 and 71 ==> years 1970_and_71 is 17 and a half miles from home ==> is 17_and_a_half_miles from home was 18 miles and three eighth until ==> was 18_miles and three_eighth until my 22 years old sister ==> my 22_years_old sister took 23 years until ==> took 22_years until

I doubt, that this can be had in a language independent way. We still would have not so few false positives, such as:

found the article 19 feet behind the 
went in that year 1999 soldiers to
according to ISO 1234 people in Spain

(Note that, English word order and comma rules make English much less prone to some of those)

Currencies, and their abbreviations, can appear both in front of, and after the figures they relate to, so we should have both a " curreny space [+-] digit " and a " digit space currency " rule and probably tolerate " In week 17 € 1500 were spent " unless we can make a " 'week' space digit " rule eat the 17 on its own, hiding it from the cureency rules.

Also, there are style rules like these:

we saw 1 young man ==> saw a young man / saw one young man
...
not even 7 sailors ==> not even seven sailors
...
when 12 candles ==> when twelve candles
with 13 grumps ==> with 13_grumps

So I suggest a language specific, or language group specific, kind of treatment.

  • Bug 18443 has been marked as a duplicate of this bug. ***

(In reply to comment #3)

Documentation? Don't be silly, this is MediaWiki! ;)

Heh. I've created https://meta.wikimedia.org/wiki/Help:Newlines_and_spaces#Non-breaking_spaces

(In reply to comment #20)

Please don't discuss the merits of various ideas here, discuss them on-wiki and
report on consensus. Bugzilla is an even worse discussion forum than talk
pages. :)

Perhaps we can summarize on that Meta page (and even discuss in its talk)?

We've had localizable regexes before that were part of the parser, like
linktrail, but AFAIK those have been disabled as too scary. They can still be
localized per-language, but only in the PHP files, not in the MW-namespace
messages.

This still holds true, so I suppose this is the way here too, and I've written it in the above page. I'm not going to summarize anything else from these two bugs because they're too long, but feel free if you find something consensual. :-)

fyi: Because of bug #18443 I already started a discussion at w:de concerning German typography.

At https://de.wikipedia.org/wiki/WD:TYP#automatische_leerzeichen there's an unfinished table called 'regexps' which will resolve bug #18443 and this bug at least for w:de.
That table is still under construction. If it's finished I'll inform you here.

sowerk wrote:

I’d like to point out one approach, which was discussed in w:de some years ago (discussion felt asleep back then):

Use of underscores for thin- and non-breaking-spaces within the wiki-code:

One underscore for thin-space: _ ⇒ “ ”
Two underscores for n-b-space: __ ⇒ “ ”

Underscores are hardly ever used, except for links (there a filter can easily be implemented). In those rare remaining cases, the nowiki-tag should be used.

This would allow every user with minimal experience to use the correct typography, avoid long lists of common abbrevations as started on the German project site and ensure, that copy-paste-errors of spaces are easily detectable.

(In reply to comment #27)

Use of underscores for thin- and non-breaking-spaces within the wiki-code:

This is bug 3461, please continue there.

matthiasbecker1967 wrote:

It would be helpful to fix this bug at least vor numbers and SI units and perhaps some widely used non-SI units (as ft, kn/kt mph, sm/nm)

Thanks Matthias. Would really be nice to see movement on this after all these years ... it would make VE so much prettier too if we didn't have to deal with some nbsp-equivalent in VE.

Dan

we made a few regexps for the German part of the problem:
see https://de.wikipedia.org/wiki/Wikipedia:Typografie/Automatische_Leerzeichen#Regexps

Is it possible to test those regexps somehow in an easy way?

In T15619#819739, @seth wrote:

we made a few regexps for the German part of the problem:
see https://de.wikipedia.org/wiki/Wikipedia:Typografie/Automatische_Leerzeichen#Regexps

Is it possible to test those regexps somehow in an easy way?

Depends on the definition of easy. You can set up a test wiki with [[MediaWiki-Vagrant]] and patch LanguageDe.php or (eek) Parser.php after the lines mentioned in https://phabricator.wikimedia.org/T20443#227957

matmarex set Security to None.
matmarex edited subscribers, added: matmarex; removed: Unknown Object (MLST).

I think this would actually be a pretty great thing to do. However, the way the &nbsp; insertion currently works is less than wonderful; implementing more rules could make T5158: Parser inserts invalid &nbsp; in the middle of style attribute (French spaces) worse.

I used the Parser.php changes mentioned at https://de.wikipedia.org/wiki/Wikipedia:Typografie/Automatische_Leerzeichen#Regexps and it seems to work. I had to replace the order of a few entries (I've already done that at the wiki page).

What could be the next step?

You need to submit a patch. Then you need somebody willing to ruin their reputation with various Wikimedia wikis' communities to merge it. Then you both need to respond to the inevitable reports of broken pages and soothe the people angry about them. Wikimedia Foundation has a Parsing team now, perhaps you can find a willing volunteer among them, as they've done it before.

Change 328847 had a related patch set uploaded (by Seth):
fix #T15619

https://gerrit.wikimedia.org/r/328847

I think such a content processing should be done on edit time not on parser time.

I think such a content processing should be done on edit time not on parser time.

The insertion of a nbsp before a '%' character is already done in Parser.php for some years now. It's done after preparsing of wikitext and not on edit time. Thus "5 %" will be modified, but "[[5 %]]" and "<math>5 %</math>" won't.

How should the nbsp before the '%'-char be inserted on edit time in your opion? We don't want the wikitext itself to get more complicated.

The insertion of a nbsp before a '%' character is already done in Parser.php for some years now.

I think this was already the wrong way several years ago.

How should the nbsp before the '%'-char be inserted on edit time in your opion? We don't want the wikitext itself to get more complicated.

The parser should not get more complicated. When somewhere a (narrow) no-break space is needed than this should be stored in the database as Unicode character. The editor should be able to to handle (narrow) no-break spaces. They should be invisible for normal WYSIWYG editing and they should be shown in ¶ mode. And the editor should have rules for automatic creating of such spaces.

I'm pretty sure the rest of the parsing team will agree that we'd like to migrate *away* from the auto-insertion of &nbsp; rather than towards it. I'm fairly certain that Parsoid doesn't insert all of the &nbsp; that the PHP parser does, for instance. Adding language-specific regular expressions (even «) is kind of horrifying. I've C-2'ed the gerrit patch because *at the very least* language-specific fixups should be done by some code underneath language/, not directly in the Parser.php. I personally greatly prefer the solution in T5461: Syntax extensions: special character, e.g. underscore, for non-breaking space (&nbsp;), since the ultimate goal is to make it easier to enter non-breaking spaces and nicer to look at them in wikitext. For example, VE has autocompletion helpers that can do the &nbsp; substitution automatically when the language is german and certain prefix strings are typed -- but then it serializes to &nbsp; in wikitext and power editors who look at the wikitext will complain. If the wikitext looks "nice" (either ~ or _ to represent the &nbsp;) than autocorrection via bots or VE or whatever would be much more acceptable to editors.

<rant>And see T119463: Automatically convert spaces after section markers (§) into non-breaking spaces which @matmarex linked -- obviously once we start down the nbsp slippery slope, folks are going to clamor for them everywhere. And we haven't even touched curly quotes yet... Really, this would all be better done in a "nice typography" skin, where we can do curly quotes and drop caps and hyphenation and full justification and french spacing and whatever nice things are needed to make beautiful typography. I do want nice typography! But these should really be made accessible to users and editors to insert, not hard-coded into Parser.php in ways that will break our less-visible languages.</rant>

This bug is #7 on the WMDE technical wishlist which is why it is suddenly being discussed and patches proposed.

I don't think this is a good idea. The whole point of the proposal is to take workload from wiki editors and do this automatically. Only this can guarantee a consistent behaivor - as you might notice, not any newbie would add this because it really isn't an intuitive thing to do. Bots can be workarounds for problems in wiki software, but no alternatives for requested parser behaivor.

To be honest, I would even support keeping $nbsp; instead of introducing ~ or _ …

Unfortunately, the world's languages are not consistent. Adding "automatic" rules to the parser in a place where they do not vary per-language and editors can not easily change or disable them for special cases is a recipe for trouble, IMO. Implementing automatic rules in Visual Editor is more editor-friendly, in that you can simply hit backspace if you do not like what VE just did for you. It's much harder to fight against the Parser when it does the wrong thing.

Change 328847 had a related patch set uploaded (by Zoranzoki21; owner: seth):
[mediawiki/core@master] Parser: Correct white space in German abbreviation

https://gerrit.wikimedia.org/r/328847