Page MenuHomePhabricator

Unicode normalization "sorts" Hebrew/Arabic/Myanmar vowels wrongly
Closed, DeclinedPublic

Assigned To
None
Authored By
Hippietrail
Jun 13 2005, 1:08 PM
Referenced Files
F2076: Pre_5.1_Normalization_Data.png
Nov 21 2014, 8:32 PM
F2077: utf8
Nov 21 2014, 8:32 PM
F2075: Unicode_5.png
Nov 21 2014, 8:32 PM
F2074: Bibi_-_incorrect_rendering.bmp
Nov 21 2014, 8:32 PM
F2073: Bibi_-_correct_rendering.bmp
Nov 21 2014, 8:32 PM
F2072: bug2399-safari.png
Nov 21 2014, 8:32 PM
F2071: bug2399-windows-same.png
Nov 21 2014, 8:32 PM
F2070: bug2399
Nov 21 2014, 8:32 PM

Description

I can't find a bug report but there is discussion of the Hebrew case here:
http://en.wikipedia.org/wiki/Wikipedia:Niqqud
The Hebrew case seems to have been known for some time.

Now we are noticing a similar problem with Arabic on Wiktionary. There is some
discussion here: http://en.wiktionary.org/wiki/Talk:%D8%AC%D8%AF%D8%A7


Version: unspecified
Severity: major
URL: http://www.mediawiki.org/wiki/Unicode_normalization_considerations

Details

Reference
bz2399

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The bug as I noticed it, is caused by the special characters used for vowels,
dagesh, right & left shin etc. not being sorted properly by wiki / probably not
being recognized as RTL.

Lots of free texts in Hebrew are quite ancient and depend on Niqqud to be read
properly, so fixing this bug should take a high priority IMHO.

Input text is checked for valid UTF-8 and normalized to Unicode Normalization Form C (canonical composed form).

Someone needs to provide:

  • Short and exact before and after examples
  • If possible, a comparison against other Unicode normalization implementations to show if we're performing

normalization incorrectly

If there is an error in my normalization implementation, and it can be narrowed down, I'd be happy to fix it.
If this is the result of the correct normalization algorithm, I'm not really sure what to do.

dovijacobs wrote:

For a typical before and after example, see the following comaprison of versions:

http://he.wikisource.org/w/index.php?title=%D7%90%D7%92%D7%A8%D7%AA_%D7%94%D7%A8%D7%9E%D7%91%22%D7%9F&diff=2794&oldid=1503

In that example, the only change actually made by the user was adding a category
at the end, but when the text was saved, the order of vowels was altered in most
of the words in the text.

If what Brion means is an example of a single word or something like that, it
will be hard to provide examples because only texts contributed until December
show "before" examples.

However, maybe this will help: When vowelized texts from word processors like
Word and Open Office are pasted into Wiki edit boxes, the vowels are
automatically changed to the wrong positions in the wiki coding.

jeluf wrote:

Dovi, what browser are you using, and which version of it? Which operating system?

Looking at the diff that you provided, checking the first few lines, those look
OK to me.
All the letters are identical on the right and on the left.

jeluf wrote:

Comparing with Brion's laptop (he uses MacOS 10.4, I use 10.3.9) the letters
differ between mine and his. There are dots in some of Brion's letters where I
don't see any.

(I was testing in Safari and JeLuF in Firefox. They may render differently, or have been using different fonts...)

Yes, I would very much like to get individual words. You can copy them out of the Wikipedia pages if you like.

Very very helpful for each of these would be:

  • The 'before' formatting, saved in a UTF-8 text file (notepad on Windows XP is ok for this)
  • The 'after' formatting, saved in a UTF-8 text file
  • A detailed, close-up rendering of what it's supposed to look like (screen shot of 'before' correctly rendered, using a large enough font size I can tell the difference)
  • A detailed, close-up rendering of what it ends up looking like

If possible, a description of which bits have moved or changed and how this affects the reading of the text.

eran_roz wrote:

a txt file in utf-8

Attached:

sgb wrote:

I’m using IE 6 in Win2K Professional, and
I’ve been seeing this problem as well. Text
that I created a year or so ago in Arabic
are fine, but if I now open and re-save them
(using all of the same software as before),
Arabic vowel pairs become reversed. I can
provide you here with some examples, one
with the vowels together, and another
separating the vowels with a tashdid
(baseline) ... then you can remove the
tashdid and bring the vowels together to see
what happens. (Tahoma would be a good font
to see this.)

  1. This pair is supposed to look like a

little superscript w with an '''over'''line:
سسّـَس سسَّس (if you get an '''under'''lined w,
it’s reversed).

  1. This pair is supposed to look like a

little superscript w with
an '''under'''line: سسّـِس سسِّس (if the
underline is below the entire '''word'''
rather than below the little '''w''', it’s
reversed).

  1. This pair is supposed to look like a

little superscript w with a '''double
over'''line: سسّـًا سسًّا (if you get a w with
double '''under'''line, it’s reversed).

  1. This pair is supposed to look like a

little superscript w with a '''double
under'''line: سسّـٍا سسٍّا (if the double
underline is below the entire word rather
than below the little w, it’s reversed).

  1. This pair is supposed to look like a

little superscript w with a comma above it:
سسّـُس سسُّس (if the comma is '''in''' the w
rather than above it, it’s reversed).

  1. This pair is supposed to look like a

little superscript w with a '''fancy'''
comma above it: سسّـٌا سسٌّا (if the fancy comma
is '''in''' the w rather than above it, it’s
reversed).

As I am looking at this note '''before''' I
save it, everything on my screen appears
correct. After I save it, all six examples
will be reversed. You can insert spaces in
the examples to separate the vowels, and you
should find that they have become the
reverse order from the control examples with
tashdids (baselines) in them.

sgb wrote:

I just now sent the above message (# 8)
concerning Arabic vowel pairs, and I see
that all of the vowel pairs are correct.
Clearly, the "bugzilla" software is
different from the "en.wiktionary.org"
software.

If you will copy my examples from the above
message into a Wiktionary page, you will see
how they become reversed.

Here's the given string broken into groups of base and combining characters:

d7 91 U+05D1 HEBREW LETTER BET
d6 bc U+05BC HEBREW POINT DAGESH OR MAPIQ < in normalized string, this
d6 b7 U+05B7 HEBREW POINT PATAH < sequence is swapped

d7 99 U+05D9 HEBREW LETTER YOD
d6 b0 U+05B0 HEBREW POINT SHEVA

d7 91 U+05D1 HEBREW LETTER BET
d6 bc U+05BC HEBREW POINT DAGESH OR MAPIQ < in normalized string, this
d6 b7 U+05B7 HEBREW POINT PATAH < sequence is swapped

d7 a8 U+05E8 HEBREW LETTER RESH
d6 b0 U+05B0 HEBREW POINT SHEVA

d7 a1 U+05E1 HEBREW LETTER SAMEKH

The only change here in the normalized string is that the dagesh+patah
combining sequence is re-ordered into patah+dagesh.

I've tried displaying the before and after texts in Internet Explorer 6.0
(Windows XP), in Firefox Deer Park Alpha 2 (Mac OS X 10.4.2), and Safari 2.0
(Mac OS X 10.4.2). The two strings appear the same, even zoomed in, on IE/Win
and Firefox/Mac. In Safari the dots are slightly differently positioned.
I do not know if this slight different is relevant or 'real'.

Python program to confirm that another implementation gives the same results:

from unicodedata import normalize
before = u"\u05d1\u05bc\u05b7\u05d9\u05b0\u05d1\u05bc\u05b7\u05e8\u05b0\u05e1"
after = u"\u05d1\u05b7\u05bc\u05d9\u05b0\u05d1\u05b7\u05bc\u05e8\u05b0\u05e1"
coded = normalize("NFC",before)
if (coded == before) or (coded != after):

print "something is broken"

else:

print "as expected"

Created attachment 754
Strings from attachment 1 displaying identically in IE 6.0 on Windows XP Professional SP2

Attached:

bug2399-windows-same.png (52×66 px, 469 B)

Created attachment 755
Hightlighted display difference in Safari on Mac OS X 10.4.2

The dots show slightly displaced in Safari 2.0 on Mac OS X 10.4.2 in the
normalized text.
Is that movement (from the black dot location to the red dot location)
significant?

They *do not* display differently in Firefox DeerPark alpha 2 on the same
machine.
Both string forms display identically on that browser and OS.

They *do not* display differently in Internet Explorer 6.0 on Windows XP
Professional SP2.
Both string forms display identically on that browser and OS.

Attached:

bug2399-safari.png (117×149 px, 3 KB)

The problem is only (I think) on win 98 and XP prior to SP2.

sgb wrote:

I’ve been requesting a fix for the incorrect
Arabic normalization (compound vowels) for
months, but Arabic still cannot be entered
and saved properly in en.wiktionary
articles, and I have never received a reply
to my requests. I don’t know if I haven’t
made myself clear, if no one has had the
time, or if no one thinks I know what I’m
talking about.

I use Firefox 1.0.7 and also IE 6 in Win2K
Pro. It makes no difference which browser I
use, I cannot save Arabic files correctly in
en.wiktionary...nor can anyone else,
apparently, because whenever somebody opens
an old Arabic article to make some small
change, the vowels become incorrectly
reversed upon saving.

I’ve been typesetting Arabic professionally
since the 1970’s and I know how it’s
supposed to be written. If you need
examples, either here or on en.wiktionary, I
can easily provide them.

In short, the current normalization produces
the wrong results with all compound vowels:
shadda+fatha, shadda+kasra, shadda+damma,
and shadda+fathatan, shadda+kasratan,
shadda+dammatan. In the following examples,
(A) = correct, and (X) = wrong:
(A) عصَّا ; (X) عصَّا
(A) عصِّا ; (X) عصِّا
(A) عصُّا ; (X) عصُّا
(A) عصًّا ; (X) عصًّا
(A) عصٍّا ; (X) عصٍّا
(A) عصٌّا ; (X) عصٌّا

Under the current normalization, if anyone
opens a page containing (A), it will become
(X) when he saves it (even if he makes no
changes). One example is
http://en.wiktionary.org/wiki/حسن , which
was written with all the correct vowels
prior to implementation of normalization
(and which appeared correctly), but has
since had to have some of its vowels removed
because of this serious problem.

I will be happy to explain further if anyone
needs clarification.

What I need is a demonstration of incorrect normalization. This
is a Unicode standard and, as far as I have been able to test,
everything is running according to the standard.

Pretty much every current XML-based recommendation, file format
standard and protocol these days is recommending use of Unicode
normalization form C, which is what we're using. If this breaks
Arabic and Hebrew, then a lot of things are going to break it
the same way.

If there's a difference in rendering, is it:

  • A bug in the renderer?
  • Is this an operating system bug? (old versions of Windows)
  • Is this an application bug? (browser etc)
  • A bug in the normalization implementation?
  • A bug in the normalization rules that Unicode defines?
  • A bug in the Unicode data files?
  • A corrupt copy of the Unicode data files?

The impression I've been given is that it's a bug in old versions
of windows and that things render correctly on Windows XP. Can
you confirm or contravene this?

Can you make a clear, supportable claim that a particular normalized
character sequence is incorrectly formed? If so, how should it be
formed? Is the correct formation normalized or not? If not why not?
If so why isn't it what we get from normalizing the input?

Is there an automatic transformation we can do on output? If so what?
If there is, should we do so? What are the complications that can arise?

Or perhaps the error is in the arrangement of the original input?
Where does the input come from and what arranges it? Is it arranged
correctly? If not how should it be arranged? How can it be arranged?

Is there an automatic transformation we can do on input? If so what?
If there is, should we do so? What are the complications that can arise?

On these questions I've gotten a lot of nothing. The closest has been
an example of a string in 'before' and 'after' state, which appears to
render identically in Windows... so what's the problem?

I can confirm that the bug has been fixed in Hebrew in the Service Pack 2 of Win
XP but not in earlier versions. If this is the case in Arabic as well, which our
Arabic-reading members can check, then we probably should add in the main he.wiki
pages and the equivalent Arabic ones an explanation of the problem with a
recommendation to upgrade to said OS & Service Pack.

iwasinnam wrote:

Correct rendering of the string "Bibi" with fixed-width font

Attached:

iwasinnam wrote:

Inorrect rendering of the string "Bibi" with fixed-width font

screenshot taken in wiki editor box after pressing 'Show Preview'

Attached:

iwasinnam wrote:

If indeed the Unicode normalization rules imply the switching of the DAGESH and
the PATAH (as demonstrated in comment #10), then I suppose it's a bug in the
renderer.
As for the way things _should_ be, it is completely insignificant for a user which
way the symbols are stored. In Hebrew (manual) writing it is completely
insignificant if the DAGESH is written down before PATAH or vice-versa. When
typing text on a computer (at least in Windows), the text is displayed and stored
correctly only if the DAGESH is entered first. I haven't here the tools to examine
the way it is stored internally, but it's nevertheless renderend correctly any
time. This is not the case in Wiki. Once the procedure switches the two symbols,
the DAGESH is displayed _outside_ of the BET. An obvious misrendering (see
attachments id=978, id=979).
I have experienced this bug in Widnows 2000 as well as Windows XP with IE 6.0.x.
I believe this should be considered a significant bug as these are highly popular
environments. Moreover, Hebrew (and Arabic) are used mostly in scriptures, poetry
& transliteration of foreign words and names. Many Wiki pages (especially in
Wikitext) contain such texts. The bug renders such text as hard to read and is
_very_ appearent to any user that tries to read these texts (and very annoying for
myself as I am currently writing about China and constantly need to transliterate
Chinese names).

(In reply to comment #19, by Ariel Steiner)
Ariel, did you experience this bug in Win XP with Service Pack 2? I use that and
see Hebrew with nikkud on wiki perfectly. Others have reported this bug to exist
in Win XP with SP1 but without SP2, so I assume it has been fixed in the latter
service pack.

iwasinnam wrote:

I experienced the bug on both WinXP (no SP2) & Win2K, both with IE6 and Firefox
1.0.7. I don't see why a user should upgrade from Win2K (or Me) to WinXP SP2
just because of a nikkud problem

dovijacobs wrote:

I'd like to add to Ariel's comments that nikkud works perfectly fine in various
fonts and on all platforms for word processors: Word for Windows and Open Office.
Why should Mediawiki be any different? Don't the word processors also use Unicode?

Dovi

Dovi, typical word processors probably aren't applying canonical normalization to
text.

Ok, spent some time googling around trying to find more background on this. Basically
there seem to be two distinct issues:

  1. The normalization rules order some nikkud combinations differently from what the

font render in old versions of Windows expects. This is a bug in either Windows or
the font. From all indications that have been given to me, this is fixed in the
current version of Windows (XP Service Pack 2).

  1. In some rarer cases appearing in at least Biblical Hebrew, actual semantic

information may be lost by application of normalization. This is a bug in the Unicode
standard, but it's already established. Some day they may figure out a proper
workaround.

As for 1), my inclination is to recommend that you upgrade if it's bothering you.
Turning off normalization in general would open us up to various weird data
corruption, confusing hard-to-reach duplicate pages, easier malicious name spoofing,
etc. If Microsoft has already fixed the bug in their product, great. Use the fixed
version or try a competing OS.

It might be possible to add a postprocessing step to re-order output to what old
buggy versions of Windows expect, but this sounds error-prone.

As for 2), it's not clear to me whether this is just a phantom problem that _might_
break something or if it's actually breaking text. (Most stuff is probably affected
by problem 1.) There's not much we can do about this if it happens other than turning
off normalization (and all that entails).

Background links:
http://www.unicode.org/faq/normalization.html#8
http://www.qaya.org/academic/hebrew/Issues-Hebrew-Unicode.html
http://lists.ibiblio.org/pipermail/biblical-languages/2003-July/000763.html

Does anybody know if the Windows bugs were in the fonts, in
Uniscribe, or in both? Can the new Uniscribe handle the old
fonts for instance?

If all or part of the problem was with the fonts, then what
about 3rd party fonts not under Microsoft's control?

Also, has Microsoft issued any kind of fix for OSes other than
XP?

Has anybody tested this on any Unix or Linux platforms? How does
Pango handle this?

Without knowing the answers to all these questions, I would lean
to a user option to perform a post-normalization compatibility
re-ordering.

gangleri wrote:

Hallo!

[[en:Wikipedia_talk:Niqqud#Precombined_characters_-_NON-precombined_characters]]
relates about some notes received from
http://mysite.verizon.net/jialpert/YidText/YiddishOnWeb.htm : Recommendations
for Displaying Yiddish text on Web Pages.

Depending on platforms, browsers, characters (and fonts?) one may experineced
some of the mentioned problems.

http://mysite.verizon.net/jialpert/YidText/YiddishOnWeb.htm suggests as "output"
preference to use "precombined characters" and to "postpone" "NON-precombined
characters" for later days.

Consequences: Wikimedia projects should provide at least some notes about the
problem (affected platforms / browsers / what to do / how to configure / upgrade
to ...)

Regards Reinhardt [[user:gangleri]]

gangleri wrote:

Please see also
bug 3885: title normalisation

rotemliss wrote:

I've tried to check what caused the problem, and I've detected the problem.

The problem is in UtfNormal::fastCombiningSort, in the file
phase3/includes/normal/UtfNormal.php. It combines the Nikud in the order of its
numbers in $utfCombiningClass (defined in
phase3/includes/normal/UtfNormalData.inc). This array, unserialized, is shown in
[[he:Project:ניקוד#איפיון הבאג]], in the <pre>. You can see Dagesh is 21, and
Patah is 18, so they are re-ordered: instead of Dagesh+Patah, we get
Patah+Dagesh. But they SHOULD be first Dagesh then Patah, because that's their
order - so it's a bug in MediaWiki that we re-order it. In WinXPSP2, they are
shown correctly because of a *workaround* (it's not a bugfix there - only a
workaround for mistakes), but their order is however wrong. Maybe in Vista it
they won't use this workaround.

The question is, what does this function (UtfNormal::fastCombiningSort) do?
What's it purpose? Why should it sort the Nikud, or anything else? It's already
sorted well. How is it related to the normalization? There is any documentation
about it?

You can just delete the Nikud from the array $utfCombiningClass, if you want to
operate the function.

Changing the summary, for that's exactly the bug. Also changing the OS and
Hardware, because the bug is not only there - the final view problem is there,
but the problem exists everywhere.

Thank you very much, and please answer my questions in the third paragraph, so
we will be able to fix that bug.

rotemliss wrote:

(In reply to comment #27)

This array, unserialized, is shown in [[he:Project:ניקוד#איפיון הבאג]],
in the <pre>.

Now it's shown in [[User:Rotemliss/Nikud]].

Rotem, this function implements a Unicode standard. The bug is in the standard.
Until some future version of Unicode "fixes" this, I'm just going to mark this
bug as LATER.

iwasinnam wrote:

I for one totally support the suggested solution, namely "Remove the
normalization check" etc.
That would be ideal for the Hebrew Wikipedia since its guidelines strictly
forbid the use of nikkud (vowel markers) in its titles, i.e., there are no
composed letters in document titles. Seperating the title and display title
would also be very convenient because it will allow easy searching on one hand
and the use of nikkud in the display title where appropriate.

kenw wrote:

Incidentally, this is not a "bug" in the Unicode Standard, and won't be fixed
later in that standard. The entire issue of canonical ordering of "fixed
position" class combining marks for Hebrew has been debated extensively on the
Unicode forums, but the outcome isn't about to change, because of requirements
for stability of normalization.

The problem is in people's interpretation of the *intent* of canonical
ordering in the Unicode Standard. (See The Unicode Standard, 5.0. p.
115.) "The canonical order of character sequences does *not* imply any kind of
linguistic correctness or linguistic preference for ordering of combining
marks in sequences." In effect, the Unicode Standard is agnostic about the
input order or linguistically preferred order of dagesh+patah (or
patah+dagesh). What normalization (and canonical ordering) *do* imply,
however, is that the two sequences are to be interpreted as equivalent.

It sounds to me like Mediawiki is implementing Unicode normalization correctly.

The bug, if anything, is in the *rendering* of the sequences, as implied by
some of the earlier comments on this. dagesh+patah or patah+dagesh should
render identically -- there is no intent that they stack in some different way
dependent on their ordering when rendered. The original intent of the fixed
position combining classes in the standard was that they applied to combining
marks whose *positions were fixed* -- in other words, the dagesh goes where
the dagesh is supposed to go, and the patah goes where the patah is supposed
to go, regardless of which order they were entered or stored.

Also, it should be noted that the Unicode Standard does not impose any
requirement that Unicode text be stored in normalized form. Wikimedia is free
to normalize or not, depending on its needs and contexts. Normalization to NFC
in most contexts is probably a good idea, however, as it simplifies
comparisons, sorts, and searches. But as in this particular case for Hebrew,
you can run into issues in the display of normalized text, if your rendering
system and/or fonts are not quite up to snuff regarding the placement of
sequences of marks for pointed Hebrew text.

--Ken Whistler, Unicode 5.0 editor

dovijacobs wrote:

Hebrew vowelization seems much improved in Firefox 3. It would be nice to know exactly what changed and how, and have these things documented in case there are future problems.

Firefox 3 seems to correctly represent the vowel order for webpages in general and Wikimedia pages in particular.

The only anomaly I nevertheless found is that pasting vowelized text into the edit page only shows partial vowelization. On the "saved" wiki page it appears correctly.

rotemliss wrote:

(In reply to comment #33)

Hebrew vowelization seems much improved in Firefox 3. It would be nice to know
exactly what changed and how, and have these things documented in case there
are future problems.

Firefox 3 seems to correctly represent the vowel order for webpages in general
and Wikimedia pages in particular.

The only anomaly I nevertheless found is that pasting vowelized text into the
edit page only shows partial vowelization. On the "saved" wiki page it appears
correctly.

The bug of showing the Dagesh and other vowels in the wrong order usually depends on operating system. For example, Windows XP (possibly only with Service Pack 2) displays it well, while older Windows systems don't.

However, Firefox 3.0 did fix some Hebrew vowels bugs, like the problem with Nikud with justified text (see https://bugzilla.mozilla.org/show_bug.cgi?id=60546 ).

  • Bug 14834 has been marked as a duplicate of this bug. ***

ravi.chhabra wrote:

Since this bug also effects Myanmar exactly in the same way, could the title be appended with Myanmar as well? Normalization is not taking place the way it should. Here is the sort sequence it should be as specified in Unicode Technical Note #11.

Name Specification
Consonant [U+1000 .. U+102A, U+103F, U+104E]
Asat3 U+103A
Stacked U+1039 [U+1000 .. U+1019, U+101C, U+101E, U+1020, U+1021]
Medial Y U+103B
Medial R U+103C
Medial W U+103D
Medial H U+103E
E vowel U+1031
Upper Vowel [U+102D, U+102E, U+1032]
Lower Vowel [U+102F, U+1030]
A Vowel [U+102B, U+102C]
Anusvara U+1036
Visible virama U+103A
Lower Dot U+1037
Visarga U+1038

I can provide more technical detail if needed. Hence U+1037 should always come after U+103A (even though U+103A is 'higher'). And U+1032 should come _before_ U+102F, U+1030, U+102B, U+102C and so on. I noticed that this but is related more to Unicode Normalization than it is to MediaWiki itself. But an important question I have is *can* Unicode Normalization Check be disabled for Myanmar Wikipedia while we try to resolve it? Thanks, because that would be very helpful?

ayg wrote:

(In reply to comment #36)

Since this bug also effects Myanmar exactly in the same way, could the title be
appended with Myanmar as well?

You can do things like that yourself here.

But an important
question I have is *can* Unicode Normalization Check be disabled for Myanmar
Wikipedia while we try to resolve it? Thanks, because that would be very
helpful?

See [[mw:Unicode normalization concerns]]. This is feasible. We could turn off normalization for article text and leave it for titles, which would allow DISPLAYTITLE to be used to work around ugly display in titles. However, it would require some work.

ravi.chhabra wrote:

I would prefer normalization as there are benefits from it, since it enforces a particular sequence. My question now is what kind of data should I provide to Brion Vibber so that he can implement the normalization for Myanmar? Our case is quite different from Hebrew and is more straight forward. I believe UTN#11 V2 would be sufficient? It was updated recently for Unicode 5.1

I would like to wait a while before actually thinking of disabling for article text and using work around for titles. If it can be implemented we won't need to off normalization, and would benefit from it. Thanks.

ayg wrote:

It would almost certainly be a bad idea to use different normalization for a single wiki. This would create complications when trying to, for instance, import pages. If this is genuinely an issue for Myanmar, we should fix it in the core software for all MediaWiki wikis that contain any Myanmar text. Same for Hebrew and Arabic.

What exactly is the issue here? Some user agents render theoretically equivalent sequences of code points differently, so normalization changes display? Which user agents are these?

ravi.chhabra wrote:

Relative Order (Normalization?) for Unicode 5.1 Myanmar

Attached:

Unicode_5.png (509×790 px, 39 KB)

ravi.chhabra wrote:

Relative Order (Normalization?) for pre-Unicode 5.1/Myanmar

Attached:

Pre_5.1_Normalization_Data.png (509×792 px, 110 KB)

ravi.chhabra wrote:

I have attached two images. The first one shows normalization sequence for 5.1, and the 2nd one shows normalization sequence for pre Unicode 5.1. It is drastically different. The copy of those two can be found here.
http://unicode.org/notes/tn11/myanmar_uni-v2.pdf
Page 4 for latest, and page 9 for deprecated.

The normalization done at MediaWiki seems to be for pre 5.1. I am added pre 5.1 table here.
Name Specification
kinzi U+1004 U+1039
Consonant [U+1000 .. U+102A]
Stacked U+1039 [U+1000 .. U+1019, U+101C, U+101E, U+1020, U+1021]
Medial Y U+1039 U+101A
Medial R U+1039 U+101B
Medial W U+1039 U+101D
Medial H U+1039 U+101F
E vowel U+1031
Lower Vowel [U+102F, U+1030]
Upper Vowel [U+102D, U+102E, U+1032]
A Vowel U+102C
Anusvara U+1036
Visible virama U+1039 U+200C
Lower Dot U+1037
Visarga U+1038

Yes, normalization changes display. I have attached a jpeg file showing the error caused here https://bugzilla.wikimedia.org/show_bug.cgi?id=14834

ayg wrote:

Contents of includes/normal/UtfNormalData.inc

As far as I can tell, MediaWiki is indeed using the 5.1 tables. I've attached the data used for normalization, which is generated by a script that downloads the appropriate files from http://www.unicode.org/Public/5.1.0/ucd/. If you can spot an error, please say what it is.

You might want to talk to Tim Starling, since as far as I can tell he's the one who wrote this.

Attached:

ravi.chhabra wrote:

U+1037 is int(7) and U+103A is int(9), this means that U+1037 should always be first? This seems so similar to the pitah-dagesh issue. :(

This is the relevant section of $utfCombingClass:

["့"]=>
int(7)
["္"]=>
int(9)
["်"]=>
int(9)

The order given here does not seem to be the same as the order given in UTN#11. I guess this would be a lesson not to take UTN's too seriously. I do like the sort order as it is in Wikipedia, just that it's having problems with Fonts. And I am a bit surprised that data in UCS does not match what was authored in the UTN. So as far as MedaiWiki is concerned, it's just like the way it is with Hebrew. We will now need to move over to Unicode mailing list and ask what's going on. Simetrical, many thanks for clearing this one up for me. :)

As a side note developer of Parabaik font gave me this link http://ngwestar.googlepages.com/padaukvsmyanmar3
I noticed that the sequence was recently changed to have been mentioned.

ravi.chhabra wrote:

Found something which should not have been re-sequenced.

Input: U+101E U+1004 U+103A U+1039 U+1001 U+103B U+102C
Output: U+101E U+1001 U+103B U+102C U+1004 U+103A U+1039

The output is wrong because U+1004 is consonant, and U+1001 is also consonant. Hence MedaiWiki should not have swapped them, that is if my understanding of Unicode Normalization is correct. My understanding is that the sorting starts over whenever a new consonant starts, because this is the beginning of a new syllable cluster. No fonts will be able to render the output from Mediawiki.

ayg wrote:

I suggest you e-mail Tim Starling.

ravi.chhabra wrote:

I am adding it here that the issue with Myanmar Unicode (Lower Dot and Visible Virama) is an issue that will be covered in the revision to UTN#11 as a foresight in the standards review process. Due to the stability criteria of UnicodeData.txt there is nothing we can do about this. This is not a MediaWiki bug, since many people are now referencing this to point out as a bug, I need to clarify this here. This sadly does mean that fonts and IMEs will need to update and mean while MediaWiki 1.4 will have the problem mentioned here and the way to resolve this is to simply wait update fonts and IMEs. The advantage of turning off Normalization far outweigh the disadvantage. If there are plans to adopt a less invasive normalization process as mentioned in Normalization Concerns than the issue can be resolved. The developers of fonts and IMEs have agreed to update so those implementing MediaWiki install bases might want to keep Normalization on.

The 2nd issue with Kinzi (comment #45) seems to be resolved now. Was MediaWiki updated between July and now??

gangleri wrote:

FYI: https://bugzilla.wikimedia.org/show_activity.cgi?id=2399
I did not change priorities; I only added me as CC:.
It seams that the Priority field is gone.

Marking REOPENED. The standard was updated since 2006. We discussed this in the Berlin Hackathon.

Assigning to me so we can look over the current state and see about fixing it up.

Apparently, you have not implemnted the contractions and expansions of UCA.

Note that there has been NO change in Unicode 5.1 (or later) for the normalization which is now stabilized since at least Unicode 4.0.1.
The bugs above are most probably not related to normalization, if it is implemented correctly (and normalization is an easy problem that can be implemtned very efficiently).

And the changes in the DUCET (or now the CLDR DUCET) do not affect how Hebrew, Arabic or Myanmar is sorted, within the same script.

Then you should learn to separate the Unicode Normalization Algorithm (UNA), the Unicode Collation Algorithm (UCA), and the Unicode Bidi Algorithm (UBA), because the Bidi algorithm only affects the display, but definitely NOT the other two.

And the order produced by normalization is orthogonal to the order of collation weights generated by UCA, even if normalization is assumed to be performed first before computing collations (but this is not a requirement, it just helps reducing the problem, by making sure that canonically equivalent strings will collate the same.

Many posters above seem to be completely mixing the problems !

Note: for Thai, Lao, Tai Viet, the normalization does not reorder the prepended vowels (neither do the Bidi algorithm).

But such reordering is *required* when implementing the UCA, and this takes the form of contractions and expansions, that are present in the DUCET for these scripts.

Final note: it is highly recommanded to NOT save texts with an implicit normalization. Even if normalization is implemted correctly.

There are known defects (yes bugs in renderers of browsers that frequently do not implement normalizations and that are not able to sort, combine and position the diacritics correctly if they are not in a specific order, which is not the same as the normalized order)

There are also because incorrect assumptions made by writers (that have not understood when and where to insert CGJ to restrict the normalization of reordering some pairs of diacritics), and so have written their texts in such a way that they "seem" to render correctly, but only on a bogous browser not performing the normalizations correctly and/or with strong limitations in their text renderer (unable to recognize strings that are canonically equivalent but for which they expect only one order for successive diacritics in order to position them correctly).

This type of defects is typical of the "bug" described above about the normalized order of the DAGESH (a central point in the middle of a consonannt letter, in order to modify it) or SIN/SHIN DOTS (above the letter, on the left or right, also modifying the consonnant), and the other Hebrew vowel diacritics: Yes the normalization reorders the vowel diacritics before the diacritics that modify the consonnant (this is the effect of an old assignment of their relative "combining classes", in a completely illogical order of values, but this will NEVER be changed as it would affect the normalizations).

But many renderers are not able to display correctly the strings that are encoded in normalized order (base consonnant, vowel diacritic, sin dot or shin dot or dagesh). Instead they expect that the string will be encoded as (base consonnant, dagesh or sin dot or shin dot, vowel diacritic), even if it is completely canonically equivalent to the previous and should display exactly the same ! (such rendering bugs were found in old versions of Windows with IE6 or before).

For this reason, you should not, on MediaWiki, apply any implicit renormalization of any edited text. If one wants to enter (base consonnant, dagesh or sin dot or shin dot, vowel diacritic) in the Wiki text, keep it unchanged, do not normalize it, as it will display correctly on both the old bogous renderers and on newer ones.

All my remarks in the previous message also apply to the Arabic diacritics.

For example the assumptions made by Brion Viber in his message #23 are completely wrong. He has not understood what is normalization and the fact that, only with conforming renderers, the normalization *must not* affect the rendering (but if they do, this is due to bugs in renderers, not bugs in the normalizer used on MediaWiki).

merelogic wrote:

*** Bug 31183 has been marked as a duplicate of this bug. ***

This should probably be reassigned to one of our localization engineers.

reassigned to Amir as he is part of localization engineers. This bug is still present as can seen in : https://en.wikisource.org/wiki/User:Amire80/Havrakha

dovijacobs wrote:

For an extremely clear description of the problem in Hebrew, see here (pp. 8 ff.):
http://www.sbl-site.org/Fonts/SBLHebrewUserManual1.5x.pdf

Amir: Do you (or the L10N team) plan to take a look at this at some point?
This ticket is place 14 in the list of open tickets with the highest votes...

... and one of the oldest open and assigned tasks.

Qgil removed Amire80 as the assignee of this task.Jan 9 2015, 10:30 PM
Qgil added a subscriber: Language-Team.

reassigned to Amir as he is part of localization engineers. This bug is still present as can seen in : https://en.wikisource.org/wiki/User:Amire80/Havrakha

@Amire80 didn't take this task himself, so I placed it up for grabs. CCing Language-Engineering instead.

This is not really specific to RTL. In fact LTR or RTL layout plays
absolutely no role here for sorting. I addition the Myanmar script is not
even RTL. What is really needed is a correct collation order, i.e.
integrating a correct Internationaluzation library like ICU within
MediaWiki and also exposing it in the API (even if it requires an optional
plugin) for generating sortkeys or comparing strings in a locale-sebsirive
way or at least with the DUCET for the neutral root locale. All is already
written in lots of bindings for various programming languages. The
integration for client-side sorting is more difficult, but we could offer
server-side helper query to sort some JSON array. Many projects want
correct collation, and sometimes several collations for the same human
language (e.g. Chinese).
Le 9 août 2015 13:55, "Ebraminio" <no-reply@phabricator.wikimedia.org> a
écrit :

Ebraminio moved this task to MediaWiki-core on the RTL workboard.
Herald added a subscriber: Aklapper.

TASK DETAIL

https://phabricator.wikimedia.org/T4399

WORKBOARD

https://phabricator.wikimedia.org/project/board/1280/

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Ebraminio
Cc: Aklapper, cscott, Meno25, Verdy_p, Nahum, Hippietrail, brion, Amire80,
siebrand, Matanya, Qgil, tstarling, kaldari, Language-Engineering,
Gryllida, Shizhao, Arrbee, KartikMistry, Legoktm, Malyacko

Using ICU for this (which in fact we already do in production) won't help, as the bug is in the Unicode standard itself -- any compliant implementation will "fail" equally.

If there's too much inertia to fix the spec and change texts to use the updated codepoints, or whatever it would take, my recommendation is to change the point where we normalize: moving it from all input to comparisons only.

Wrong. The Unicode standard dies not specify any collation tailoring for
any language and not even any script.

My point was that the classification of this bug report for RTL is
inaccurate, completely unrelated.

What is in the Unicode standard is a normative reference to the collation
algorithm detailing the necessary steps needed including standard
normalisation, and case folding for case insensitive search, and
describing the collation strength with multiple levels. Then there's an
informative reference to the DUCET which is more or less the base for
defining the collation foe the neutral root locale, that will then be
tailored for specific scripts, languages, orthographies or
transcriptions.

If you want accurate sorts for language you need tailorings, and those are
NOT in TUS, but MAY be found in CLDR for some locales. CLDR data is not
part of TUS.

There's nothing wrong in TUS about the collation algorithm. If you say that
Unicode is wrong it is because you assume (incorrectly) that code point
values are sorted according to some specific locale, but they are not for
any locale, not even for English!

As long as Wikimedia will only sort by code point values (in fact by the
binary encoding in UTF-8) WE WON'T be able to sort correctly in any
language. So RTL vs LTR is not the issue here.

This bug dies nor depend on resolution of RTL related bugs which are
themselves never dependant on collation.

It is only affecting I18n for things like performing plain text searches,
or embedding text fragments into others, or correctly handling the visual
layout for proper display of tables, alignment of paragraphs or floating
elements to the correct margin, using CSS classes, or sorting items listed
in categories and breaking these lists into pages of max 200 items, or
performing dynamic sorts of contents displayed in selected columns of
tables (difficult to solve with client-side Javascript without some
server-side helpers with custom data queries)
Le 10 août 2015 01:31, "brion" <no-reply@phabricator.wikimedia.org> a
écrit :

brion added a comment.

Using ICU for this (which in fact we already do in production) won't help,
as the bug is in the Unicode standard itself -- any compliant
implementation will "fail" equally.

If there's too much inertia to fix the spec and change texts to use the
updated codepoints, or whatever it would take, my recommendation is to
change the point where we normalize: moving it from all input to
comparisons only.

TASK DETAIL

https://phabricator.wikimedia.org/T4399

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: brion
Cc: Aklapper, cscott, Meno25, Verdy_p, Nahum, Hippietrail, brion, Amire80,
siebrand, Matanya, Qgil, tstarling, kaldari, Language-Engineering,
Gryllida, Shizhao, Arrbee, KartikMistry, Legoktm, Malyacko

Also normalization is definitely not an issue. Compliant normalization is
already part of the standard collation algorithm, so collators that fail
depending on normalization forms presented in their input, are definitely
not compliant collators. The collator algorithms implementation is fully
compliant. Sorry to contradict you.

If you think that something is wrong, this is not in TUS but caused only by
missing data in localized tailorings of CLDR, but we can help solving these
issues by providing our own tailorings and experimenting with them, but
these bugs will be better solved by submitting bug reports to the CLDR TC,
via the in contact form (the CLDR project now hosted by Unicode
Consortium, some years ago it was part of the ICU project open-sourced and
hosted by IBM, but CLDR is now independant of ICU, which only a reference
implementation, with a specific branch for the CLDR project not
immediately reflected in ICU because it may sometime have differences for
experimental data or updates needed for some newer or older versions of
TUS, and it will not include many optional features of ICU. But everything
that is standardized (not informative) in TUS or stabilized (not
providional) in CLDR root locale is fully implemented in ICU, INCLUDING
COLLATION DATA.
Le 10 août 2015 01:31, "brion" <no-reply@phabricator.wikimedia.org> a
écrit :

brion added a comment.

Using ICU for this (which in fact we already do in production) won't help,
as the bug is in the Unicode standard itself -- any compliant
implementation will "fail" equally.

If there's too much inertia to fix the spec and change texts to use the
updated codepoints, or whatever it would take, my recommendation is to
change the point where we normalize: moving it from all input to
comparisons only.

TASK DETAIL

https://phabricator.wikimedia.org/T4399

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: brion
Cc: Aklapper, cscott, Meno25, Verdy_p, Nahum, Hippietrail, brion, Amire80,
siebrand, Matanya, Qgil, tstarling, kaldari, Language-Engineering,
Gryllida, Shizhao, Arrbee, KartikMistry, Legoktm, Malyacko

I think you misunderstand the nature of this bug due to the use of the word "sort". Collation is not at issue here; the issue is that our unconditional application of normalization form C on all input causes certain combining characters to unexpectedly change order from the input data.

@brion: normalization (whatever form we ue) is ALSO a requirement for collation. Normalization is definitely not the problem. The problem is that we don't use the standard collation but only (UTF-8) binary ordering.

The issue on Hebrew Niqqud is that it is incorrectly encoded if its interpretation depends on the normalization form (i.e. in this Niqqud text there are missing CGJ to block the normalized reordering!). Text renderers are also supposed to use normalization too.

For Arabic, the issue is with various extended Arabic letters that are not sorted correctly, and with the optional diacritics. But it is the same problem as well with Entended Latin (so "é" sorts after "z" and not between "e" and "f"...). Here also normalization form C is definitely not the issue (it is correct in all cases), only a proper collation is missing.

Sorting is an application of collation. Using true collation (instead of binary order) would cleanly solve the problem. But collation depends on the locale: we could use a default locale such as the root locale (which is more or less "language neutral"), based on CLDR data (the DUCET collation order) would be fine for most purposes in multilingual and English wikis

On monolingual wikis (such as Wikipedia or Wiktionary) whose default locale is another language, we could use another sorting default. But on localized Wiktionary there are many categories that should be sorted using another language than the default language (e.g. sorting Chinese and English categories in the French Wikipedia should use a Chinese or English sort order, not the French order...)

Unfortunately, there's still now way to specify which default sort order (a locale code) to use in specific category: we have {{DEFAULTSORT:key}} for use in pages to categorize but no {{SORTIN:languagecode}} for use in a category (it would be extremely useful in Wiktionary).

No we only have a single per-wiki collation order, but no way to specify another collation order for a specific category (e.g. sorting Chinese categories using a Chinese sort order, on an English or French wiki whose default per-wiki sort order is English or French).
Having per-category sort orders would requiring computing the final sort keys differently (not the same key as the key "prefix" specified in {{DEFAULTSORT:}} in pages to categorize or in their explicit key prefix parameter of their "[[Category:Name|key]]"

One problem: setting or changing the per-category sort order would require reindexing all pages listed in that category (even if they specify themselves thei own sort key prefix). For categories that have small amounts of members, this is not a huge problem. But for categories that have thousands of members, the list of key prefixes may be long and we should not need to parse again all the member pages to recompute the final sort key according to the sort order of the category.

Ideally pages should be indexed like they are today using binary sort order but we should still be able to list a cateogry using an alternate sort orderdifferent from the default., and some categories should adopt another default sort order, different from the default sort order of the wiki (this default sort order is set according to the default language/locale of the wiki: on multilingual Commons or MediaWikiWiki, that default sort order is English or could be the DUCET-based root order, but even there, in Categories listing long lists of Chinese city names, these categories should sort using a Chinese sort order by default; it will be difficult to implement for categoies of people names, as people names are more international than expected even if they natively speak Chinese).

For now I've not seen any extension such as {{SORTIN:locale-code}} to set another default sort order per category, and not even any UI (or user preference) to select an alternate sort order (e.g. sorting Chinese either in radical/stroke order or in Pinyin order or using some other Chinese romanization scheme), or changing the sort order to be case sensitive or not (i.e. adjusting the collation strength), depending on the current viewing user (independantly of how it is sorted by default for everyone)

Oh, indeed, I missed that part of your post. That's T30397: Allow collation to be specified per category. And the other missing feature you want is T37378: Support multiple collations at the same time. But I don't think either of those is relevant here?

@Verdy_p: I believe the issues you are raising are already covered in other bugs: T47443, T136150, T30397

It sounds like all the issues encompassed in this bug are either upsteam bugs with old fonts or operating systems (which probably isn't much of an issue any more now that it's 12 years later) or the fact that none of these wikis are using UCA collation.

MediaWiki now supports UCA collation for Hebrew and Arabic (support for Myanmar is T154510). All those wikis have to do is ask for it. We aren't going to switch them to UCA collation without consensus from their communities though.

I don't think there is anything currently actionable in this bug, so I would recommend closing it as Declined. New bugs can be set up for switching individual wikis to UCA collation once their communities have discussed it.

MediaWiki now supports UCA collation for Hebrew and Arabic (support for Myanmar is T154510). All those wikis have to do is ask for it. We aren't going to switch them to UCA collation without consensus from their communities though.

... Another bug for Burmese (Myanmar) I created recently is T187721, and it's probably somewhat related.

From the Hebrew side, I agree that it's OK to close this. The issues for Hebrew only affect old browsers, which have low usage these days, but I'm not sure about other languages.

In any case, it's not an RTL bug, so I'm removing that tag.

Amire80 moved this task from Font issues to Unicode support on the I18n board.