Page MenuHomePhabricator

auto-insert of non-breaking whitespace where appropriate
Closed, DuplicatePublic

Description

Problem:
There are several places where (thin) non-breaking spaces should be inserted, e.g., between numbers and units. Of course, there exist different spacing rules in different countries.
Up to now, inserting thin spaces still leads to problems: "nbsp" is generally too large, "thinsp" and "U+202f" won't be displayed in the wanted way using opera and so on; see e.g. [http://de.wikipedia.org/wiki/Wikipedia:Meinungsbilder/Typographie_(Zwischenr%C3%A4ume)#Browser-Unterst.C3.BCtzung] (german).
There are possibilities to display thin non-breaking spaces by using some html/css tricks, see [http://de.wikipedia.org/wiki/Schmales_Leerzeichen#.C3.9Cbergangsl.C3.B6sungen]. But that would complicate the source of articles too much.

Solution:
At http://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia/Archiv/2009/Woche_01#.26nbsp.3B (german, but with php source-code) the idea was given to use (localized) regexps for automatically inserting of whitespace in some cases. With this modification we could easily auto-insert even sophisticated things like [http://de.wikipedia.org/wiki/Schmales_Leerzeichen#.C3.9Cbergangsl.C3.B6sungen] without obfuscating the article source code.

But:
Maybe such a thing would slow down the parsing of wikitext, so I guess it would be the best to implement the idea at test-wiki first. Somebody should profile the parsing after those changes. I could help in generating some fast regexps.


Version: unspecified
Severity: enhancement

Details

Reference
bz18443

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:35 PM
bzimport set Reference to bz18443.
bzimport added a subscriber: Unknown Object (MLST).

matthiasbecker1967 wrote:

Why not add non-braking space after a number in every case? That won't do any bad I think.

(In reply to comment #1)

Why not add non-braking space after a number in every case? That won't do any
bad I think.

This would not solve the problem, because not all cases contain numbers (like the German abbreviation "z.(thin non-breaking space)B.")

Apart from that there are several false positives like "In the year 2525 and ..." where there shouldn't be a non-breaking space after the "2525".

So I don't think this would be a good solution. I still believe that the already mentioned discussion at w:de (http://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia/Archiv/2009/Woche_01#.26nbsp.3B) gives a possible and good solution.

The proposed solution on de.wiki (if i can understand this right. Google translate for german sucks very very badly) is to add: wfMsgForContent( 'nbsp-before-word' ) => '\\1 \\2' to the $fixtags array in ~ line 302 of Parser::parse in includes/parser/Parser.php. In other words, have a system message with a regex to tell mediawiki where to put the non-breaking spaces.

This seems like a bad idea. First someone is bound to put an invalid regex in there (that could probably be worked around by checking for validity). Allowing the users to add an arbitrary regex that gets executed on all text when parsing seems like begging for someone to put something evil in there. Regexes are powerful, you can do quite computationally intensive thingies with them, sometimes without meaning to.

Additionally, mistakes could cause quite a mass of confusion. If someone for example set nbsp-before-word to be /./ say (or anything where they forgot the brackets), that would make the parser output only nonbreaking spaces, and break the entire site which would be quite disruptive.

(In reply to comment #3)

The proposed solution on de.wiki [...] is to add: wfMsgForContent(
'nbsp-before-word' ) => '\\1 \\2' to the $fixtags array in ~ line 302 of
Parser::parse in includes/parser/Parser.php.

Right. Or even better:

wfMsgForContent('auto-thinspace') => '\\1<span style="margin-left:0.167em"><span style="display:none">&nbsp;</span></span>\\2'

This leads to thin spaces which are compatible with all common browsers, see http://de.wikipedia.org/wiki/user:Raphael_Frey/Labor#Browser-Unterst.C3.BCtzung (the span-solution is the column called "Übergangslösung")

Regarding the problems Bawolff mentioned, this is very similar to other regexp-based extensions like the spam-blacklist, the title-blacklist and the abuse filter (aka edit filter).
Of ourse only admins should be allowed to edit the regexps. And they have to be very careful, that's true; at least as careful as if they were editing the sbl, tbl or af.

(In reply to comment #4)
....

Regarding the problems Bawolff mentioned, this is very similar to other
regexp-based extensions like the spam-blacklist, the title-blacklist and the
abuse filter (aka edit filter).
Of ourse only admins should be allowed to edit the regexps. And they have to be
very careful, that's true; at least as careful as if they were editing the sbl,
tbl or af.

Abuse filter/spam blacklist mistakes are easier to fix than what is proposed here. I think collecting a list of wanted rules, and hardcoding those to MW is much more likely to succede than letting admins dynamically add such rules. (As it stands of course, one could do this on the js side already, but that is really icky).

(Note MW does have some rules for adding nbsp in certain contexts. The rules just aren't all that complex)

(In reply to comment #5)

(Note MW does have some rules for adding nbsp in certain contexts. The rules
just aren't all that complex)

What are they, by the way? I think this is not documented anywhere, but it would important to keep it consistent if we add such a new rule.
Right now I can remember only the separators for digits, used by formatnum, which is defined in the MessagesXx files and can be modified only there.

Moreover, some such rules are defined by the [[International System of Units]] itself IIRC, and are not that easy to find, but may be included in some library already? The reporter/voters should probably do some investigation.

(In reply to comment #6)

(In reply to comment #5)

(Note MW does have some rules for adding nbsp in certain contexts. The rules
just aren't all that complex)

What are they, by the way? I think this is not documented anywhere, but it
would important to keep it consistent if we add such a new rule.
Right now I can remember only the separators for digits, used by formatnum,
which is defined in the MessagesXx files and can be modified only there.

Moreover, some such rules are defined by the [[International System of Units]]
itself IIRC, and are not that easy to find, but may be included in some library
already? The reporter/voters should probably do some investigation.

They're run towards the end of the parsing process (The original proposal in comment 0 that's linked actually refer to them).

Specificly they are:

373 # Clean up special characters, only run once, next-to-last before doBlockLevels
374 $fixtags = array(
375 # french spaces, last one Guillemet-left
376 # only if there is something before the space
377 '/(.) (?=\\?|:|;|!|%|\\302\\273)/' => '\\1&#160;',
378 # french spaces, Guillemet-right
379 '/(\\302\\253) /' => '\\1&#160;',
380 '/&#160;(!\s*important)/' => ' \\1', # Beware of CSS magic word !important, bug #11874.
381 );
382 $text = preg_replace( array_keys( $fixtags ), array_values( $fixtags ), $text );

In english they say:

*If you have a character (any character including spaces), followed by a space, followed by any of the following characters: ?,:,;,!,% or » (U+BB), the space gets replaced with a non-breaking space.
*If you have a « (U+AB) followed by a space, that space is replaced by a non-breaking space.
*As an exception to these rules, if you have a non-breaking space followed by "!important", the non-breaking space is turned back into a normally breaking space. This is to prevent messing up CSS style attributes. (This isn't perfect, there's an open bug somewhere about css styles being messed up by this in edge cases).

Based on the Guillemet characters, I imagine this is meant for the typing rules of french.

In reply to comment #5)

Yes, a hardcoded solution would be ok. But at least in the beginning there should be an easy way of communication (between admins and devs) regarding changes of that hardcoded rules.

The typographic rules[1] in Germany are quite complicated:
there should be a _narrow_ _non-breaking_ space inside of

  • abbreviations (like 'z. B.', 'i. d. R.', 'u. a.')
  • abbreviations with numbers (like '§ 315', 'Abs. 3', 'S. 78 ff')
  • dates like '1. Mai'
  • between numbers and units (like '100 m', '5 kg')

If I'd get an "ok" here, s.t. some dev would insert those hardcoded rules for w:de (and probably for all other de-projects, too), then I could create some regexps.

[1] actually "rule" is not the right word here. "typographic sugar" would be a better description.

I imagine we'd want to change these rules so they're handled in the i18n files instead of in the parser itself (Since we'd want vary per lang). CC'ing Niklas to see if he has any thoughts on the i18n aspects.

The typographic rules[1] in Germany are quite complicated:

One of the scary things about this type of scheme is that its invisible to the user. If there are exceptions to the rules, the user cannot override these exceptions (Well maybe they could do things like insert &#32;, but its not obvious to the user how to/very difficult for them). Hence we'd want to make the rules have effectively no false positives.

(In reply to comment #9)

Hence we'd want to make
the rules have effectively no false positives.

I fully agree with that.
(And actually that was one of the reasons, why I asked for a management system where admins can quickly change regexps. Because it's quite easy to overlook such false positive cases a priori.)

However, cases like "123 %" have schown, that we don't have to fear false positives too much.

yes. Since that bug is older, lets continue the discussion over there.

  • This bug has been marked as a duplicate of bug 13619 ***