Page MenuHomePhabricator

Unicode combining characters are difficult to edit in some browsers
Closed, ResolvedPublic

Description

Author: Gerard.meijssen

Description:
People on the Lingala Wikipedia complain about a lack of support for characters like the ɔ́ that should show as one character and do not. They also do not show properly in bugzilla (certainly in edit mode).

They will not collaborate in Betawiki as a consequence.. I posted on my blog about this, I posted on the Afrophone mailinglist and I got this additional comment from renaud gaudin:

"Also, I've been reporting for ages that the interface should use the
same font as text so that "buttons" are rendered properly.

On a French Windows with Internet Explorer (which is what's used in a
large part of west Africa), the "Edit" button (but also a significant
part of Interface texts) displays unknown (square) characters...

It makes it hard to convince people to first use Wikipedia, then to
contribute to it."

Thanks,

GerardM

Version: 1.13.x
Severity: major

Details

Reference
bz16697

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:24 PM
bzimport set Reference to bz16697.
bzimport added a subscriber: Unknown Object (MLST).

I think it would be great if we had some URLs here with examples, and a few screenshots with 'observed' and 'expected' behaviour.

Gerard.meijssen wrote:

Character misrepresented in edit mode on Meta

Attached:

Lingala_edit.jpg (800×1 px, 142 KB)

Gerard.meijssen wrote:

Character properly represented once saved on Meta

Attached:

Lingala_final_form.jpg (800×1 px, 132 KB)

Gerard.meijssen wrote:

Comment on attachment 5591
Character properly represented once saved on Meta

The behaviour is for me the same on the Lingala Wikipedia. This is in FF3.04. For other browsers the behaviour is said to be different. GM

Gerard.meijssen wrote:

Lingala characters in edit mode using Chrome

This is how the edit page looks like on the ln.wikipedia ... Chrome is clearly inferior to Firefox in supporting the special characters in edit mode. In final form it is ok.

Attached:

Chrome_ln.jpg (800×1 px, 122 KB)

This "ɔ́" is a Unicode compound character, consisting of a base character ("ɔ") followed by a combining accent (" ́").

The bad news is that plenty of software is a little spotty about handling such characters cleanly. In this case, that means the browsers and the fonts.

Pasting "ɔ́" into Firefox 3 on my Mac seems to work fine. If it's not functioning in other current browsers, bug reports should be filed in the appropriate locations so it can be fixed for future versions. I'm not sure there's much else to be done on our end... working with the characters relies on them actually being supported by the browsers!

Gerard.meijssen wrote:

(In reply to comment #6)

Thanks Brion, both FF and Chrome show "Mbɔ́tɛ!" properly in final form on the Wiki. http://meta.wikimedia.org/wiki/User:GerardM/Lingala IE does not. In edit mode, FF shows the diacritic separately while Chrome does not know how to handle it.

So there is a difference between final form and edit.. Is there a difference in the font support indicated by MediaWiki ?
Thanks,

GerardM

This is entirely dependent on the text rendering and font support of the browser and the operating system it's running on.

Some quick tests on my boxes:

Safari 3 / Mac 10.5 -- edit good, page good
Safari 3 / Win XP -- edit good, page good

Firefox 2 / Ubuntu 7.10 -- edit good, page good
Firefox 3 / Mac 10.5 -- edit good, page good
Firefox 3 / Win XP -- edit and page both show base and comining character ok, but incorrectly spaced (not composited into a single visible glyph)

Chrome / Win XP -- edit and page both shown with correct composition but a big box instead of the base character

IE 7 / Win XP -- edit and page both show good base character followed by a totally unrelated character (looks like hebrew or something, not the expected acute accent at all!)

lang.support wrote:

Assuming

  • Windows XP SP2
  • Assuming complex script support enabled
  • Assuming OpenType fonts with appropraite mark positioning support
  • Assuming that DEjaVu Sans isn't installed

There will still be a problem

The font fallback in the css rules is inappropriate and could be considered broken in this instance. the css rules should reflect langauge specific styling needs.

It isn't a browser issue.

It is:

  1. an end user issue: fonts and font rendering support is needed; and
  2. its a wikipedia issue, assuming the end user has things set up correctly, the css rules are inadequate

Andrew

In what way are they inadequate, specifically, and what would you recommend as a change?

(Consider downloadable web fonts to be a potential option, though that brings difficulties with it.)

lang.support wrote:

For ln.wikipedia.org current css rules controlling font display would be

#content, #bodyContent {
font-family:'DejaVu Sans','Segoe UI','Lucida Sans Unicode','Lucida Grande',Tahoma,'Arial Unicode MS','Lucida Sans',Verdana,sans-serif;
}

Taking each font in turn:

DejaVu Sans - OK for Lingala
Segoe UI - I'd need to test, should be ok, but Vista font
Lucinda Sans Unicode - cannot correctly render all Lingala characters, no mark, makmk OpenType features
Lucinda Grande - Mac OS font, don't know if this supports Lingala or not, would need to test.
Tahoma - version 3.0.6 (on WinXP) does not support Lingala, Version 5.0 may support Lingala, would need to test.
Arial Unicode MS - cannot correctly render all Lingala characters, no mark, makmk OpenType features
Lucinda Sans - support for lingala unknown
Verdana - version 3.0.6 (on WinXP) does not support Lingala, Version 5.0 may support Lingala, would need to test.

General rule of thumb for CSS font family fallback choose most appropriate non-core fonts first, then fall back to core OS fonts

So a rule like

#content, #bodyContent {
font-family:'DejaVu Sans','Segoe UI','Lucida Grande',Tahoma,Verdana,sans-serif;
}

would be better

Although best would be to add other download able fonts suitable for African languages:

#content, #bodyContent {
font-family:'DejaVu Sans','Charis SIL','Gentium Book Basic','Liberation Sans','Doulos SIL','African Sans serif','African Sans','Segoe UI','Lucida Grande',Tahoma,Verdana,sans-serif;
}

depending on Tahoma and Verdana v. 5.0 support for Lingala, i'd be tempted to strip these from the CSS rules, may or maynot help Vista users, but could cause problems for users on older windows and Mac users who have an older version of MS Office installed.

Segoe UI and Lucinda Grande. Would need to test these when i'm back in the Office on Monday.

Also I likes using monospaced fonts for textareas, and i find its sueful to explicity state font rules for the textarea element, so

#content, #bodyContent, textarea {
font-family:'DejaVu Sans','Charis SIL','Gentium Book Basic','Liberation Sans','Doulos SIL','African Sans serif','African Sans','Segoe UI','Lucida Grande',Tahoma,Verdana,sans-serif;
}

might work better.

ayg wrote:

(In reply to comment #11)

For ln.wikipedia.org current css rules controlling font display would be

#content, #bodyContent {
font-family:'DejaVu Sans','Segoe UI','Lucida Sans Unicode','Lucida
Grande',Tahoma,'Arial Unicode MS','Lucida Sans',Verdana,sans-serif;
}

If those are wrong, bring it up at [[ln:MediaWiki talk:Monobook.css]], not here. We (developers/sysadmins) don't have control over what CSS rules sysops choose to add for their own wikis. The MediaWiki default is not to specify fonts at all for any language.

Gerard.meijssen wrote:

A solution should not be confined to the ln.wikipedia. It should also work on Commons or Meta. When CSS rules are supposed to be language specific, then the support of the CSS should be based on what language is selected in the user preferences.

MediaWiki is software that should work for any language. It does only need to specify fonts that work.
Thanks,

GerardM

The right place for such a rule were probaly the default Lingala style sheet ([[betawiki:MediaWiki:Common.css/ln]]).

But this is partially a browser problem: browsers (except IE, of course) override the font set in the style sheet for characters which cannot be displayed in that font, and should do the same for combined characters. (At least in normal text; it's less obvious what would be the right thing to do in an edit box where displaying multiple characters as one has its own usability problems.)

A short-term solution might be a bot replacing the combination with a single character, for which there is better browser support.

ayg wrote:

(In reply to comment #14)

The right place for such a rule were probaly the default Lingala style sheet
([[betawiki:MediaWiki:Common.css/ln]]).

Languages should *not* use *.css for default stylesheets (or *.js for default JS). They are meant *only* for user customizations. Any language-specific code put there will not be maintainable: changes made in MediaWiki will not stack with user customizations, and so will not take effect on upgrade. If default fonts are necessary for some languages, these should be added through some separate, specially-designed mechanism. The only language that uses its own CSS file right now is German, for bug 1553, and that's really not ideal (although it's a single rule that's not likely to change or become obsolete, much better than a list of fonts).

A short-term solution might be a bot replacing the combination with a single
character, for which there is better browser support.

This could be done by the software, if the character combinations are supposed to be canonically identical. Actually, I thought Unicode normalization was supposed to do that anyway, but maybe I'm wrong on that.

(In reply to comment #15)

(In reply to comment #14)

A short-term solution might be a bot replacing the combination with a single
character, for which there is better browser support.

This could be done by the software, if the character combinations are supposed
to be canonically identical. Actually, I thought Unicode normalization was
supposed to do that anyway, but maybe I'm wrong on that.

There is no single character for ɔ́, afaik.

lang.support wrote:

(In reply to comment #12)

The MediaWiki default is not to specify
fonts at all for any language.

Which is actually a good approach, stylesheets should be language neutral as much as possible.

But there are two scenarios for content:

  1. all content in a single page is monolingual - in which case all is fine
  2. content is predominately in one language, but contains words, phrases, quotes from other languages - this is a more problematic scenario. Since it would require different fonts to be used to display different languages. IN most the major languages that have full OS support current approach works fine. OS font-linking/switching and browser based approaches work fine. But for lesser used languages where there is no official OS support, things become more problematic, since different fonts may need to be specified for that language as distinct form the text of the surrounding page.

The easiest and simplest approach is the use of language tagging in the markup and then users can tie their own css rules to the language markup.

Essentially the nature of Wikipedia content, means that the first fall back is OS and web browser fallback mechanisms, the second fall back is end user CSS overrides.

For monolingual content in a language specific wiki, its possible to have some sensible CSS rules in a language specific customisations to monobook.css

For content that includes words and phrases in other languages, the most sensible approach is language markup, this allows CSS rules to be created for wiki specific monobook.css or user specific CSS rules.

lang.support wrote:

(In reply to comment #13)

A solution should not be confined to the ln.wikipedia. It should also work on
Commons or Meta. When CSS rules are supposed to be language specific, then the
support of the CSS should be based on what language is selected in the user
preferences.

But a user may have their preferences set to one language and may also work in other languages. So that approach will work in many cases but not in all.

MediaWiki is software that should work for any language. It does only need to
specify fonts that work.

It does work for any language. But for languages not supported officially by major OS vendors, things have always been more problematic, the advent of Unicode doesn't change that.

There are limitations to web browsers, operating system support, and even HTML and CSS specifications.

In an ideal world there would be comprehensive Latin script OpenType fonts available by default within an OS. But even is there are, web browsers and CSS provide no way to control which OpenType features are used for specific HTML documents. So its impossible to use a single Latin script font for all Latin script languages, even if it has language specific features and alternative glyphs for various languages. Browsers and CSS provide no way to access or control these features.

The best approach I've found for working with multiple scripts and languages (and some projects i've worked with up to 100 languages) is to have the main CSS rules be language neutral, tag primary language of a document, allow mechanisms for authors to indicate/markup up change of languages, allow language specific styling independent of the main styling for the theme/skin, and allow users to override/control aspects of the language specific styling.

lang.support wrote:

(In reply to comment #15)

(In reply to comment #14)

This could be done by the software, if the character combinations are supposed
to be canonically identical. Actually, I thought Unicode normalization was
supposed to do that anyway, but maybe I'm wrong on that.

But there are many base character + combining character combinations in the Latin and Cyrillic scripts that do not and will never have precomposed forms

The character sequence open O + combining acute <U+0254 U+0301> is both the NFC and NFD form

Gerard.meijssen wrote:

MediaWiki experience with a change for Bug 1941 showed that a change FROM monospace removed the ability for Safari to work properly. It is likely that the change TO monospace for the edit screen for FireFox and IE will make these browsers work as well for these browsers.
Thanks,

GerardM

lang.support wrote:

(In reply to comment #20)

MediaWiki experience with a change for Bug 1941 showed that a change FROM
monospace removed the ability for Safari to work properly. It is likely that
the change TO monospace for the edit screen for FireFox and IE will make these
browsers work as well for these browsers.
Thanks,

Although Gerard, the languages you are interested in are unlikely to be well supported by monopaced fonts. Catch-22.

Gerard.meijssen wrote:

Betawiki has a gadget that allows you to cycle from monospace, to sans, to serif. This shows that monospace breaks the usability of some of the languages we support. The only logical conclusion is to change from monospace to an other style of fonts. No catch-22 for me. If something is not usable, we use something else.

This WILL work for Firefox, Opera and Safari. Internet Explorer and Chrome are both currently broken; IE shows the wrong character Chrome shows no character.

lang.support wrote:

(In reply to comment #22)

Betawiki has a gadget that allows you to cycle from monospace, to sans, to
serif. This shows that monospace breaks the usability of some of the languages
we support. The only logical conclusion is to change from monospace to an other
style of fonts. No catch-22 for me. If something is not usable, we use
something else.

This WILL work for Firefox, Opera and Safari. Internet Explorer and Chrome are
both currently broken; IE shows the wrong character Chrome shows no character.

For lesser used languages, this approach assumes:

  1. End users have installed all language support available within the OS. Currently the only OS that I know of that installs all language support available by default is Windows Vista (maybe macOS too, don't know enough about MacOS to say). Windows XP and older versions of Widows as well as most, if not all Linux distros only install a minimal set of language support. Full language support has to be specified during the install or installed afterwards.
  1. End users have downloaded and installed appropriate fonts to cover all the languages covered by Betawiki. Since lesser used languages may not be supported
  1. Generic font families assume that the end user has modified the default browser generic fonts per writing script to use appropriate fonts
  1. That appropriate (monospaces, serif, sans-serif) fonts are available for the language in question

Gerard.meijssen wrote:

That is all well and good. I do have all the appropriate fonts installed. It is monospaced for the edit window that fails me for most browsers. When problems are eliminated, there is at least a fighting chance of getting it right.
Thanks, GerardM

lang.support wrote:

(In reply to comment #23)

  1. That appropriate (monospaces, serif, sans-serif) fonts are available for the

language in question

A clarification on my comment, with respect to monospaced fonts. It is important to note that there are few monospaced fonts (if any) that support various lesser used languages. And for many writing scripts monospaced fonts are inappropriate.

Looking at windows environment for instance, you'll find that for most scripts there isn't a monospaced font available.

For various dubious reasons certain browsers default to monospaced fonts for displaying text in certain html elements. This is poor internationalisation. It doesn't scale in a truly multilingual environment.

Using generic font families can be a problem when you are developing or maintaining a truly multilingual environment. They are useful in themes and skins, so that the themes or skins can be made language neutral, but the themes or skins then need to be overlaid with language specific styling to ensure that all text has a chance of displaying.

To have something concrete I propose to add css override for languages with no good monospace font(s). The css would use serif or sans-serif font style for textareas. There should also be possibility to users switch back to monospaced.

Implemented solution I proposed above in r53874.