Page MenuHomePhabricator

Bangla letter Nukta Problem
Closed, ResolvedPublic

Description

Author: omi

Description:
There are 4 Bangla (AKA Bengali) letters who comes with a NUKTA (U+09BC) and
stands as a different meaning or pronunciation. They are Ra (U+09B0), Rra
(U+09DC), Rha (U+09DD) and Yya (U+09DF).

Ra 09B0 came from 09AC + 09BC
Rra 09DC came from 09A1 + 09BC
Rha 09DD came from 09A2 + 09BC
Yya 09DF came from 09AF + 09BC

This is what Unicode consortium says, because they didn't do any research on
Bangla and followed ISCII. Any ways, let's come to the point.

Wikipedia pages are behaving strangely after the correct input. If I write 09DC,
it becomes 09A1 + 09BC automatically after saving the data. Like that if I write
09DD, it becomes 09A2 + 09BC and 09DF becomes 09AF + 09BC. Fortunately 09B0 is
ok from this problem. 09B0 has no problem.

Now you have to sort the issue by reversing the rendering. If I type 09B0, it
should stay as it is and if I type 09AC + 09BC, it should become 09B0
automatically. As I told you that 09B0 has no problem, but you also need to
define 09AC + 09BC = 09B0. Like that 09DC will stay as it is and if anyone puts
09A1 + 09BC, it will become 09DC after saving. 09DD will stay as it is and if
anyone puts 09A2 + 09BC, it will become 09DD after saving and 09DF will stay as
it is and if anyone puts 09AF + 09BC, it will become 09DF after saving.

So please sort the issue ASAP.

Omi Azad
Contributor
Bangla Computing and Localization Projects:
Ankur: http://www.ankurbangla.org
Ekushey: http://www.ekushey.org


Version: unspecified
Severity: critical
URL: http://bn.wikipedia.org

Details

Reference
bz5948

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 9:13 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz5948.
bzimport added a subscriber: Unknown Object (MLST).

ragibhasan wrote:

This is a serious issue and would effect searches for articles, as the articles
are automatically mistitled ... instead of one unicode character, the
aforementioned characters are divided into two characters. So, anyone searching
for an article title involving the above characters are not able to find them
... using either the bn-wiki's built-in search, or the google search.

Ragib
Administrator, bn-wiki

Unicode normalization is applied to all input, including both
edits and search text, so this should work consistently in that
respect.

If there's a bug in the Unicode definitions, I'm afraid you'll
need to take it up with Unicode to get it fixed consistently...

omi wrote:

Well, I told you that UTC is full of Indic illiterate people and
that is why they have too many problems. Still if I raise a
legal issue to them, they don't understand what to do. :)

Sir it's absolutely your problem. We are using thousands of
software with UTF-8 encoding from both open and closed source
field. None of them has this problem. If I write 09DD in Open
Office, it never becomes 09A2 + 09BC, same for MS Office, even
in Gedit or notepad. So what should I think?

UTC made mistake by writing definition of these characters in
http://www.unicode.org/charts/PDF/U0980.pdf and you followed
that. Can you show me any reference from UTC site, which makes
you think that the current rendering is Okey?

ragibhasan wrote:

The rendering is a serious problem. Almost all other websites are rendering the
above two unicode characters correctly. For example, please check the following
page from BBC Bengali service's webpage (written using unicode Bangla).

http://www.bbc.co.uk/bengali/news/story/2005/08/050831_mknizami.shtml

Find the following word:

রয়েছে

Now, here is the same word when I write it in Wikipedia (English or Bangla)

রয়েছে

More specifically, look at the following character:

য় : from BBC Bengali's site
য় : from Wikipedia

Now, you can check that the 2nd example is not the intendend letter yya, rather
it is the juxtaposition of two letters, ja and nukta. (য + ় )

This normalization is totally incorrect, and is messing up with searches for the
appropriate texts. Also, the correct unicode is used by almost all Bangla
websites (I gave the example from BBC's Bengali service), so I don't see why
wikipedia sould render it incorrectly, and thus make the articles unreachable
from search engines. This bug is a serious one and needs to be fixed immediately.

Thanks

Ragib

Admin, Bangla wikipedia

ragibhasan wrote:

I'd also like to draw your attention to Google's Bangla language localized page
at http://www.google.com.bd . Look at the text:

ভাষা সম্পর্কিত হাতিয়ারসমূহ

Specifically: in the word হাতিয়ারসমূহ

you will find the character য়

This is correctly rendered. Google is using the correct unicode symbol for yya,
and NOT the incorrect juxtaposition of ja and nukta.

I can give a lot of other examples, but I guess you'd understand the issue now.
The Bangla typing systems, documents, everything else have already corrected
this issue, and so has Firefox/mozilla in their localized version of
Firefox/mozilla browsers. I don't see any reason to continue the incorrect
rendering in Media wiki. This would hurt the Bangla wikipedia a lot as the
articles will become unreachable from search engines ... because people looking
for a page will not type the incorrect code, nor will google or anything else do
the redundant mapping to the incorrect code pairs.

Thanks

Ragib

omi wrote:

Since 2000 I'm working with Unicode, Microsoft and other orgs regarding Bangla
issues. So I know what I'm saying. I asked Brion Vibber to show me any ref he
has. I bet he cannot and this is a WiKi's problem indeed.

There are exactly two possibilities:

  1. Our implementation of Unicode normalization is correct to specs.
  2. Our implementation of Unicode normalization is incorrect and does

not follow spec.

If you can show that 2) is true, it's my problem and I'll be happy to
fix it.

However you indicate that 1) is the case. In this case you'll need to
take it up with the Unicode Consortium to either get the UCD
corrected or new characters added which have more appropriate
normalization characteristics. Similar breakage will occur in all
other applications that follow W3C recommendations to normalize input
to form C, making it very much Unicode's problem if it's wrong.

omi wrote:

Brion Brother,
You didn't get me clearly. You said in #1 that "Our implementation of Unicode
normalization is correct to specs" but I asked you to show me any document
referring your correct specifications. If you cannot show that, then it
automatically goes to #2 and you have to fix it.

Bro, it's not a UTC problem. It's your problem. As Ragib provided some links of
Bangla texts above, you can check them out.

I can understand that you followed http://www.unicode.org/charts/PDF/U0980.pdf
's Additional Consonant section. They didn't tell you to make your normalization
like the given reference. The reference is there to show you how the thing is.
So please try to sort it asap.

omi wrote:

Bro,
You didn't get me actually or may be I completely missed the track. You are
doing as UTC said in http://www.unicode.org/reports/tr15/ Section: "Table 2:
String Concatenation." That is only for the case if you type 09AC + 09BC or 09A1
+ 09BC or 09A2 + 09BC or 09AF + 09BC They didn't tell you to follow the same
rule if you directly type 09B0, 09DC, 09DD or 09DF, you don't need to re-encode
them according to any rule. That is not a rule indeed.

Let me try to tell you the whole thing once again. If I type 09B0, 09DC, 09DD or
09DF. You don't need to apply any rule to them. But if I type 09AC + 09BC or
09A1 + 09BC or 09A2 + 09BC or 09AF + 09BC, you can apply any normalization rule
to them and that is what UTC is saying. But in your case, when I type 09DC, it
becomes 09A1 + 09BC, which is very wrong. Please double check your reference
documents, they didn't ask you to do anything like that. I hope you understand
now...

gangleri wrote:

Marking as:
Bug 5948 blocks: Bug 3985: character conversion (tracking)

Let's see, the Unicode character database is:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

The entry for 09DC is:
09DC;BENGALI LETTER RRA;Lo;0;L;09A1 09BC;;;;N;;;;;

That shows a canonical decomposition to 09A1 (BENGALI LETTER DDA) followed
by 09BC (BENGALI SIGN NUKTA).

We then check the composition exclusion table:
http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt

Here we find an entry excluding it from being produced by canonical
composition:
09DC # BENGALI LETTER RRA

Thus the normalized canonical composition (NFC) will remain decomposed, as
09A1 09BC.

Further we can check the entry for this character in the normalization
test suite:
http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt

Here we can see that 09DC normalizes the same way in all four forms:
09DC;09A1 09BC;09A1 09BC;09A1 09BC;09A1 09BC; # (ড়; ড◌়; ড◌়; ড◌়;
ড◌়; ) BENGALI LETTER RRA

I can confirm also that Python's Unicode normalization implementation
produces the same output:

import unicodedata
unicodedata.normalize("NFC", u"\u09dc")

u'\u09a1\u09bc'

Case closed.

If you don't like the normalization rules, talk to Unicode.

If you find browsers with incorrect search systems, file a bug with them.

If you find search engines with incorrect search systems, file a bug with
them.

omi wrote:

Let's see, the Unicode character database is:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

The entry for 09DC is:
09DC;BENGALI LETTER RRA;Lo;0;L;09A1 09BC;;;;N;;;;;

So leave 09DC as it is. Why you are moving to normalize it?

That shows a canonical decomposition to 09A1 (BENGALI LETTER DDA) followed
by 09BC (BENGALI SIGN NUKTA).

We then check the composition exclusion table:
http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt

Here we find an entry excluding it from being produced by canonical
composition:
09DC # BENGALI LETTER RRA

Thus the normalized canonical composition (NFC) will remain decomposed, as
09A1 09BC.

Normalize is only require when I type ড followed by ় and if I type ড় it will
remain same.

Further we can check the entry for this character in the normalization
test suite:
http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt

Here we can see that 09DC normalizes the same way in all four forms:
09DC;09A1 09BC;09A1 09BC;09A1 09BC;09A1 09BC; # (ড়; ড◌়; ড◌়; ড◌়;
ড◌়; ) BENGALI LETTER RRA

Again I say the same thing. You need to use the normalization ting only for some
sequences like ড ়, ঢ ়, য ় and ব ় (this one is not mentioned by them) but you
applied the rule to all cases.

I can confirm also that Python's Unicode normalization implementation
produces the same output:

import unicodedata
unicodedata.normalize("NFC", u"\u09dc")

u'\u09a1\u09bc'

Case closed.

UTC is not doing wrong by any mean. Why you are up to change a independent
character to a character sequence? Check the documents carefully, UTC didn't
tell you the *change it* anywhere and understand the issue.

If you don't like the normalization rules, talk to Unicode.

If you find browsers with incorrect search systems, file a bug with them.

If you find search engines with incorrect search systems, file a bug with
them.

Very silly answer.
So you think that *only* you are moving with perfection and whole wold is wrong?
Unicode is wrong, Browser is wrong, search engine is wrong, Microsoft is wrong,
Sun is wrong, IBM is wrong, Mozilla is wrong? :)

You are arguing by showing unnecessary points and trying not to understand the
whole thing. Thousands of software are working fine except you. If you behave
like this or don't try to understand the fact, WiKi will become Week-i to Bangla
speaking community. If you are not satisfied with my points, try consulting with
UTC. By this time I'll show this bug to my UTC contacts and I'll hope they'll
give some light in this issue.

Finally, you are mis-understanding the whole point alone with UTC's documentation.

gerardm wrote:

When Brion's defence is based on "this is how it is done in Python", it is in
Python where this bug needs fixing.

It is similar to an issue in the Dutch language, there the ij is invariable
written as an "i" and a "j". However the glyph kids learn in school is not this
combination. I know that Me
MediaWiki does not have this behaviour ij is not changed in it's two "parts"; it
stays like it is.

Thanks,

GerardM

gerardm wrote:

When Brion's defence is based on "this is how it is done in Python", it is in
Python where this bug needs fixing. If this is so, this bug can be closed again

It is similar to an issue in the Dutch language, there the ij is invariable
written as an "i" and a "j". However the glyph kids learn in school is not this
combination. I know that Me
MediaWiki does not have this behaviour ij is not changed in it's two "parts"; it
stays like it is.

Thanks,

GerardM

The python example is just showing that python correctly
implement the unicode recommandation.

Please reread comment 12 which explain why MediaWiki
respect the normalization rule.

Retagging as LATER. Fill a bug at unicode.org .

omi wrote:

[Qouting from http://www.mediawiki.org/wiki/Unicode_normalization_considerations ]

  • a surprising composition exclusion in Bangla o The result doesn't render right with some tools, probably again a

platform-specific bug

o Some third-party search tools apparently don't know how to normalize

and fail to locate texts so normalized.

The rendering and third-party search problems are annoying, though if we stay on
our high horse we can try to ignore it and let the other parties fix their
broken software over time.

The canonical ordering problems are a harder issue; you simply can't get these
right by following the current specs. Unicode won't change the ordering
definitions because it would break their compatibility rules, so unless they
introduce *new* characters with the correct values... Well, it's not clear this
is going to happen. [/quote]

I think I fail to make you understand about the problem. Also I'm not getting
one thing, that why you are applying normalization rules in your software. There
are thousands of web sites and millions of web pages currently in Bangla and the
web page itself never apply any rule for rendering the character. The character
always remain as it is. The wikimedia software is changing the character to a
sequence, saying it's normalization.

Let me give you a sort example, so that you can understand more clearly. If you
type Â, it remains like that. It never becomes like A^, but about Bangla when
I'm typing য়, it's becoming like য়

As I said before, it's not our end's problem, it's your problem. Whenever I save
my text, it should remain same as it is. If any rendering needed, the rendering
engine should be responsible for this. Like Uniscribe Engine in Windows,
Pango/QT on Linux etc. So it would be better if you remove all the normalization
rules from your end and leave it on Application end.

Omi, do you have difficulty reading the things I've written?

I ask this not to be rude, but because your responses don't appear to
display any comprehension of any of the following:

  • The reasons given for why normalization is done
  • The reasons given for why the result is 100% correct implementation

of specs (though the specs might not be to your liking)

  • The fact that I understand the problems with third party software

that this causes

  • The fact that I am willing to accommodate the issue and made some

recommendations on how to do this

I'm not going to waste any more time discussing this issue with you
if you're this incapable of following the discussion. If you still
care about this issue, please ask someone who is able to follow an
argument, read and understand documentation, and reason with others
to continue instead of you.

omi wrote:

After doing a huge R&D we found we just need to fix the fonts. Then everything
will be sorted. Microsoft has came up with their solution and soon we'll fix the
same on other fonts for Linux and OSX. The issue is sorted. Update your fonts
and you'll find everything perfect.

Changing all WONTFIX high priority bugs to lowest priority (no mail should be generated since I turned it off for this.)

If I understand this report correctly, it turned out to be a font issue. So I am marking this as FIXED. If this is inaccurate then please REOPEN it.