Page MenuHomePhabricator

WebFonts converts some unicode sequences to older deprecated forms
Closed, ResolvedPublic

Description

WebFonts unnecessarily converts some Malayalam character to old unicode representations (http://unicode.org/versions/Unicode5.1.0/#Malayalam_Chillu_Characters). Old char points are not supported by many softwares including Apple's safari and Google Chrome (chromium), so WebFonts cause a real problem in reading. It also breaks ability of user to interlink articles by copy pasting the title.


Version: unspecified
Severity: normal

Details

Reference
bz29005

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:33 PM
bzimport set Reference to bz29005.

Problem with Chromium is their bug. We have already reported it.
http://code.google.com/p/chromium/issues/detail?id=45840
If apple safari does not show Malayalam properly, it is their bug. Consider filing a bug(if possible).

From http://unicode.org/versions/Unicode5.1.0/#Malayalam_Chillu_Characters
"Because older data will use different representation for chillus, implementations must be prepared to handle both kinds of data."

If chromium or apple safari breaks this standard, please report bugs against them. Chrome is known to buggy with Malayalam and all the news portals, blogs of Malayalam are aware of this issue. From the above bug, you can see that it is not limited to Malayalam. Chrome cannot render "Srilanka" written in Sinhala because of that bug.

By the normalization rules in Webfont, it tries to use the least common denominator for the encoding, so that users from older versions and new versions of operating system can read, use the content(copy the content to local machine).

Since there is a force conversion of whatever user writes/edits to atomic chillu characters, in Malayalam wiki, there is no problem with interlinking articles by copy pasting. The normalization code automatically converts it to correct link. If this is not the case, please show me an example.

Unless the webfonts does not break any of the functionality of mediawiki, the normalization code is going to be present there.

As an additional choice, I am going to add Anjali Old Lipi font as default font for Malayalam. It has dual encoding implemented. To avoid any issues with font preference, I am going to list them in alphabetical order: Anjali , Meera, Rachana, Raghu Malayalam.

Chillu characters from unicode 5.1 is not affected by those chromium and safari bugs. Current script converts characters not affected by the bug to those affected by the bug. I think Malayalm Wikipedia not using WebFonts yet, besides Malayalam Wikipedia, there may be many other implementations to Mediawiki. I wonder how this can help interlinking by copy-pasting the titles there. You may use these fonts created by Junaid P V (https://github.com/junaidpv/Malayalam-Fonts/archives/master) with new chillu character, which contains all fonts listed here.

I am not sure about working of WebFonts from mobile devices but none of those
from Apple, including iOS not supporting old encoding. It is important because day by day traffic through them are increasing.

Mediawiki itself has problem with joiner based characters.

Mediawiki itself has problem with joiner based characters.

Really, like what? (I'm just curious). Is there a bug about it?

(In reply to comment #4)

Mediawiki itself has problem with joiner based characters.

Really, like what? (I'm just curious). Is there a bug about it

Before mediawiki deployment 1.16 for wikimedia wikies, search error was common because mediawiki ignored the joiner for searching, and consider both characters (characters with joiner and without joiner) are same. After 1.16 deployment chillu characters in database switched to 5.1 version, so I am not sure whether it persists in mediawiki now ;-) But I haven't heard anyone fixed this.

Please see an old screenshot here http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/46413/

all results in the screenshot are wrongly displayed.

(In reply to comment #3)

Chillu characters from unicode 5.1 is not affected by those chromium and safari
bugs. Current script converts characters not affected by the bug to those
affected by the bug.

Let us not mix the chromium bug and an optional feature of webfonts.

I think Malayalm Wikipedia not using WebFonts yet, besides
Malayalam Wikipedia, there may be many other implementations to Mediawiki. I
wonder how this can help interlinking by copy-pasting the titles there.

Webfonts is an _extension_ .For mediawiki instances outside wikipedia, one can decide whether it should be installed, enabled or not. If installed, a user can disable it completely using user preference screen. Or temporarily using the menu. The extension is configurable and one can completely remove the normalization rules. one can use their own fonts. And it is well documented. see http://www.mediawiki.org/wiki/Extension:WebFonts So whether the rules are necessary for an instance is upto the admin of the wiki.

You may
use these fonts created by Junaid P V
(https://github.com/junaidpv/Malayalam-Fonts/archives/master) with new chillu
character, which contains all fonts listed here.

We cannot and we should not use unofficial fonts from random location. Those fonts are not even a fork or maintained by typography experts. If somebody report a bug in font, I need to contact typographers and for that I should use official, upstream fonts. Anjali Old Lipi font has dual encoding implemented, the upstream version from varamozhi project is already added to webfonts. And that is the default font.

I am not sure about working of WebFonts from mobile devices but none of those
from Apple, including iOS not supporting old encoding. It is important because
day by day traffic through them are increasing.

It is a separate topic altogether. If mobile devices have broken rendering, it is bug with them. Srilanka is not going to change their country name if mobile names does not render it properly when written in Sinhala or the traffic is less for Sinhala wikipedia. But mobile phones will fix the bug.

Also note that, there is nothing called old encoding. Old encoding means old data. There is no concept of old data and new data. it is just data. Dual encoding of Malayalam is very complicated and serious issue and cannot be solved by webfonts. We discussed this during the language committee meeting and we are trying to find a solution for that. And please don't bring that chillu discussion here, we had enough of them :)

(In reply to comment #6)
I am sorry but none of those reasons are not good enough for converting readable "data" to non-readable "data". Junaid P.V. is not a random person, as you know who contributes his time for Malayalam Computing and he is the developer of Narayam Extension for adding different language input methods for the various text input fields in Mediawiki. AGF :)

(In reply to comment #6)

(In reply to comment #3)

Chillu characters from unicode 5.1 is not affected by those chromium and safari
bugs. Current script converts characters not affected by the bug to those
affected by the bug.

Let us not mix the chromium bug and an optional feature of webfonts.

I'm confused by this and then praveenp's response:

(In reply to comment #7)

I am sorry but none of those reasons are not good enough for converting
readable "data" to non-readable "data".

If webfonts is converting "Chillu characters from Unicode 5.1", that would be a bug, right?

Or am I reading it wrong?

(In reply to comment #8)

If webfonts is converting "Chillu characters from Unicode 5.1", that would be a
bug, right?

I think he created so as a feature, but ultimately to an end user it is a bug. It can be an option to users (don't know for what), but giving converted chillu characters by default is a failure on usability.

Gerard.meijssen wrote:

The presentation is determined by how the characters are compiled. Many people prefer to read the language in a particular way. This is not necessarily in the latest way Unicode has it.

By allowing for these differences in the presentation, our public is increased while at the same time at the backend we retain the latest Unicode version.

Many people prefer to read the language in a particular way

Please please please no one make that into a preference. We're talking about unicode code points, not background colours.

The two ways of encoding the character *should* be identical to the user, they aren't - mostly due to crappy software support, but they should be [in the ideal world, well in the ideal world there wouldn't be two ways to encode a single character...].

While its a little weird for mediawiki on one end to force convert everything 5.1 encoding, spit it out, then on the js side run a regex through the entire page converting it back to the older encoding, it doesn't seem horrible if it makes it work for everyone.

Anyways, going back to the original bug:

*For the issue of Safari being stupid and stripping ZWJ's: Since we're doing this on the client side anyways, might I suggest that webfonts detects what browser is in use, and only normalizes like that if its using a browser that isn't broken in that way (or disables those fonts from the choice menu if they require such normalizations)

*Per comment 1, I'm also unclear how this could break interlinking, since they're all normalized to one form on the mediawiki side (While I guess it could if the content language is not ml, since we only do the normalization for ml, but that seems like an edge case)

Rendering issue with Google chrome for ZWJ/ZWNJ got fixed in chrome 12. ie http://code.google.com/p/chromium/issues/detail?id=45840 fixed now.

And Google chrome does not support webfonts with complex scripts. http://code.google.com/p/chromium/issues/detail?id=78155 This also got fixed in chrome 12.

(In reply to comment #11)

*Per comment 1, I'm also unclear how this could break interlinking, since
they're all normalized to one form on the mediawiki side (While I guess it
could if the content language is not ml, since we only do the normalization for
ml, but that seems like an edge case)

:) http://wiki.smc.org.in is Malayalam site which using English interface.

And people are still terribly addicted to English while using Internet.

Is it possible for a site admin to set $wgFixMalayalamUnicode as false in DefaultSettings.php for keeping user's contributions untouched (?). Popular Windows tools and Mac tools for typing Malayalam using 5.1 encoding. So copy-pasting title for linking will surely fail. I know default implementation - Anjali Old Lipi - gives exactly same as in database. But somehow it is buggy (lot of spelling mistakes) and people will eventually switch to some other font given, for better display.

(In reply to comment #11)

Many people prefer to read the language in a particular way

Please please please no one make that into a preference. We're talking about
unicode code points, not background colours.

The two ways of encoding the character *should* be identical to the user, they
aren't - mostly due to crappy software support, but they should be [in the
ideal world, well in the ideal world there wouldn't be two ways to encode a
single character...].

In ideal world, we expect that. But in real world, dual encoding is there, at least for Malayalam. To make things simple:

A letter L is written in L1 way. And in 2009, Unicode said it can be written in L2 way too. And asked applications to support both. Obviously many applications failed to do this. Unicode did not define that L1 and L2 are equal. So big issues with search, sort.. what not?. ml wikipedia decided to keep the data in L2 using a force conversion. Many websites decide to stick with L1 for stability, backward compatibility issues until there is a Unicode definition stating L1 == L2. because that is the minimum version they(not limited to websites, os and applications too)can support. At the same time L1 is well supported in majority of applications(Google chrome used to support it , from chrome 6.0 to chrome L1 it was broken, now fixed). There are fonts which does not show same glyph for L1 and L2 because the typographers care about the language and aware of dual encoding issues. So to make everybody happy, just for these extreme cases, I added a feature to do L2->L1 conversion. So that users can view/use L1 which is working in the systems for many years. It is not meant for all languages or for all fonts. And it is a configuration entry.

L1 vs L2 is very controversial issue. And it becomes more complex when I say that there are more than one L with this issue.

*Per comment 1, I'm also unclear how this could break interlinking, since
they're all normalized to one form on the mediawiki side (While I guess it
could if the content language is not ml, since we only do the normalization for
ml, but that seems like an edge case)

You are correct. It does not break any interlinking.

Since there is no reproducible case how this option breaks anything, I explained my best why it was added, shall we close it?

Finding out broken versions of the browser(In our case chrome 6 to 11) and changing extension behavior based on that...Should we really need to do that? Considering Chrome does not support webfonts(complex script webfonts. Malayalam is an example) at all till Chrome 11, it seems unnecessary. Let us just declare that "Chrome was broken, and was not supporting Malayalam rendering or Malayalam webfonts till version 12". I hope that helps. Proof is bug 45840 and 78155 of chrome.

(In reply to comment #14)

Pls do not mix other issues with current problem. Even if all other bugs including those in safari, mobile devices get fixed like chromium, why we really want convert encoding in database to old version encoding without reader's direct request.

Why this converting cannot be an option other than default?

Sticking with Unicode 5.0 font is not a good idea for Malayalam. Unicode corrected errors like representation of "zero". And new code points will included for new symbols (eg: Rupee Symbol - Unicode 6.0).

Implementation of default Anjali old lipi is buggy.

So reopening.

Chromium bug is still open! - Chromium 12.0.742.91 (87961) Ubuntu 11.04.

(In reply to comment #15)
why we

really want convert encoding in database to old version encoding without
reader's direct request.

This is not what happens. WebFonts only converts what is displayed (and even that happens only for some particular fonts). MediaWiki itself normalizes all data to specific format (which is ewest format in Unicode, as far as I know).

junu.pv+public wrote:

(In reply to comment #17)

This is not what happens. WebFonts only converts what is displayed (and even
that happens only for some particular fonts). MediaWiki itself normalizes all
data to specific format (which is ewest format in Unicode, as far as I know).

But, WebFonts convert data in text fields too. That will make problems on wikies that do not have normalisation enabled, for example Wikimedia Commons. If we open and save pages in such wikies, data will be converted unintentionally. I think this is a critical bug.

(In reply to comment #18)

But, WebFonts convert data in text fields too. That will make problems on
wikies that do not have normalisation enabled, for example Wikimedia Commons.
If we open and save pages in such wikies, data will be converted
unintentionally. I think this is a critical bug.

Normalization is enabled in wikis as a fix for a reported bug. If it is not there, Firefox, chrome extensions like fix-ml, people using Inscript keyboards, and keyboards other than Narayam will surely enter chillu, AU vowel sign, NTA in 5.0 unicode way. This was considered as bug and thats why normalization is enabled in other wikis. If it is not enabled in commons, please file a bug for that.

Please understand that dual encoding is an issue. and the Malayalam normalization in wiki is workaround and a not a solution. The solution should come from UTC. I am trying for that. Can we just hold on till we get any reply from TDIL or UTC on that? Or I can keep only AnjaliOldLipi font alone for Malayalam. If I get a confirmation from 2-3 people from Malayalam wiki, I will remove Meera, Rachana, RaghuMalayalam fonts and there by avoiding the normalization rules of that fonts. Let me know.

If it is not enabled in commons, please file a bug for that.

Currently MediaWiki's chillu normalizations (which I believe is what comment 18 is referring to) are only enabled on wikis with a content language of ml ( see docs on $wgFixMalayalamUnicode ). It would probably make sense to have those normalizations on multi-lingual wikis as well (for that matter its weird that there are different normalizations in use for different language wikis, but i guess there are performance concerns) but anyways that is a separate bug.

junu.pv+public wrote:

What about removing normalisation within this extension and using hacked fonts that can show all characters for Malayalam?

(In reply to comment #21)

What about removing normalisation within this extension and using hacked fonts
that can show all characters for Malayalam?

Is there no font that can show all characters without hacking?

junu.pv+public wrote:

(In reply to comment #22)

Is there no font that can show all characters without hacking?

Only one, AnjaliOldLipi, among popular fonts and used by WebFonts. It is what second para of comment #19 referring.

I wonder why this prioritized low, even though it affects users directly!

(In reply to comment #23)

(In reply to comment #22)

Is there no font that can show all characters without hacking?

Only one, AnjaliOldLipi, among popular fonts and used by WebFonts. It is what
second para of comment #19 referring.

Now aruna also shows well. http://sourceforge.net/projects/aruna/

cibucj wrote:

Unicode didn't add the Malayalam Chillu characters on a whim. It was added after around 2 years of deliberations. UTC finally concluded that, practice existed before 5.1 was problematic and standalone characters has to be defined for Malayalam Chillus.

It is a misreading of the standard that it specifies two different encodings for chillus. There is only one encoding and that is the standard chillus defined in 5.1. What standard says is, the rendering implementations should be prepared to handle the pre-existing data that was present, before chillus were properly defined. So, if at all you are converting the codepoints, that should be from pre-existing sequences to standard chillus.

Also, keep in mind that never these two sequences (standard chillus, and pre-existing sequence counterparts) will be canonically equivalent. Characters has to be marked canonically equivalent when they are defined. That didn't happen; so it will never happen as per the rules.

We don't need to play UTC here. Rather, we should be thinking about what is best for the Malayalam users. If you take the stock of things today from the implementation point of view, it is like this:

Standard chillus(>=5.1):

  • All rendering systems support them because they are plain simple characters without any special joining properties. If the font has it, rendering engine can display it.
  • Almost all Malayalam fonts support it. In case of fonts like Rachana, Meera etc, even though original version does not have the chillu characters, there are versions available with the standard chillus.

Pre-existing non-standard chillus(<5.1):

  • In case of rendering systems it is a hit or miss. Some browsers in some systems can display them correctly - example. Firefox + Linux, Chrome + Windows etc. Some others cannot display them. For example, Chrome + Linux.
  • All Malayalam fonts support them.

Since this is about WebFonts, fonts are in Wikimedia's control, but the rendering systems are not. So you should be going with the option that would fetch maximum support from rendering systems.

Also, I want to mention the original political positions of Santhosh and me. Santhosh was arguing against standalone chillus and I was arguing for it. However, decision has been made by UTC years back. Now it is time for implementations to follow the standard so that a standard will be beneficial to its users. Wikimedia should not get stuck in Unicode 5.0 and it should progress to later versions as the Unicode standard progresses.

shijualex wrote:

I found Santhoish Thottingal is trying to bring dual encoding issue into wikimedia world using webfont as a plat form. This is not all acceptable to Malayalam wikimedia community. Dual encoding is not an issue inside wiki projects.

This issue needs to be fixed immediately considering the severity of the issue. I have changed the priority of the issue.

Also some third person developer need to handle all the issues related to malayalam. Santhosh is using his official role in WMF to play around with Malayalam data to push his personal POV (and his Free Software organization's POV). This is not acceptable to Malayalam wikimedia community.

(In reply to comment #26)

Also, I want to mention the original political positions of Santhosh and me.
Santhosh was arguing against standalone chillus and I was arguing for it.
However, decision has been made by UTC years back. Now it is time for
implementations to follow the standard so that a standard will be beneficial to
its users. Wikimedia should not get stuck in Unicode 5.0 and it should progress
to later versions as the Unicode standard progresses.

I don't have any disagreement in this. And I am not for continuing mediawiki or any software in any older unicode versions. Yes, I had disagreement on UTC's decision. But that is irrelevant now. I want to support new version of unicode eveywhere. I have asked the designers of the font to update to new versions.They were not ready and they had disagreement with UTC's decision. Recently they told me that they are not for sticking in 5.0 and want to move forward. New versions of the fonts will be released.. Not only with the characters in question, but also supporting new characters in version of Unicode. I don't think UTC will take any decision on equivalence. Till then I wanted to keep the two fonts Meera and Raghumalayalam as non default fonts. But to add them, I have to use the character conversion. I asked Shiju many tiimes whether I can remove them. But I did not get clear answer. But I am going to remove them now and will add when a new version of those fonts are ready.

Meera and Raghu Malayalam removed from the options of Malayalam in r106502.
Will be reintroduced when the upstreams release new vesion with latest unicode support.
Now Malayalam got only AnjaliOldLipi as option. Malayalam community(that includes me) can file new bug if any other fonts need to be added(should be opensource, well maintained with active upstream).

Please confirm and close the bug. Thanks

shijualex wrote:

Thanks for fixing this issue. I suggest some one who is technically good, verify and close this bug.

I am really sorry for the statement Also some third person developer need to handle all the issues related to Malayalam

I withdraw that statement and apologizing for it . As long as there is no forced conversion of existing Unicode text to an old Unicode version, just for displaying the text in Unicode 5.0 font, I do not have any problem in Santhosh working in any issue related to Malayalam. Sorry once again for that statement.

cibucj wrote:

(In reply to comment #28)

(In reply to comment #26)
forward. New versions of the fonts will be released.. Not only with the
characters in question, but also supporting new characters in version of
Unicode.

That is great news! Thanks Santhosh. Along with that, I would love to see equal opportunity for users to choose between a modern and a traditional orthography font. From 1970s onward, kids are studying new orthography. Whether we like it or not that it is a fact and Mediawiki or any software should honor that. However, I don't have a font to suggest. Just something to keep in mind for future font selections.

I don't think UTC will take any decision on equivalence. Till then I
wanted to keep the two fonts Meera and Raghumalayalam as non default fonts.

There is no 'till then..'. As I mentioned before, that is not going to happen and no development plans should be wait for anything like that.

But
to add them, I have to use the character conversion. I asked Shiju many tiimes
whether I can remove them. But I did not get clear answer. But I am going to
remove them now and will add when a new version of those fonts are ready.

What about using the forks of those fonts with standard chillus? I know, when the additional characters defined in later Unicode versions, that will not get propagated to those forks when the original fonts add those chars. However, those chars are really archaic and chillus are very common. So chillu support should trump the support for new archaic chars.

Setting normal priority since it seems like all the urgent issues here are taken care of. Leaving this to Santhosh or someone else to close.

Meera font updated with latest version from upstream in r113808