Page MenuHomePhabricator

Arabic double diacritics presentation
Open, LowPublicFeature

Description

Author: alfarq

Description:
In Arabic, there's presentation of double diacritics. For example, the sequence of "U+0651 ARABIC SHADDA" and "U+0650 ARABIC KASRA" will be presented as "U+FC62 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM". There no such presentation yet in MediaWiki, since the sequence will be swapped after saving. In previous example, the sequence is swapped into U+0650 and U+0651.


Version: 1.16.x
Severity: enhancement
URL: http://id.wikipedia.org/wiki/Pengguna:Alfarq/Arabic_double_diacritics

Details

Reference
bz21429

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:50 PM
bzimport set Reference to bz21429.

What is the bug? All text is converted to some normalisation form.

alfarq wrote:

Ups, sorry. I meant in the edit box. The result is fine, since both sequences are converted to correct character. But not in the edit box. An example, I wrote: ARABIC LETTER ALIF, ARABIC LETTER LAM, ARABIC LETTER HAH, ARABIC LETTER REH, U+0651 ARABIC SHADDA, U+064F ARABIC DAMMA. In the edit box, the double diacritics will be converted to U+FC61 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM. Whenever I click "Save page" or "Show preview", the source become: ARABIC LETTER ALIF, ARABIC LETTER LAM, ARABIC LETTER HAH, ARABIC LETTER REH, U+064F, U+0651. This time, there's no U+FC61 character that I expected to see.

Isn't the U+FC61 a compatibility character whose normalization excludes decomposition and recombinations under NFD/NFC canonical equivalences?

If some Arabic fonts do not support two successive diacritcs as recommended by Unicode, and only support the decomposable compatibility characters, these fonts are really bogous and should be avoided. But the problem is not there, see below.

If the character is not a canonical equivalent to the two diacritics, it must not be altered (even if it's not recommended).
In other words, MediaWiki must just apply the NFC normalization, but NOT the NFKC normalisation.

When I look at the UCD, it reveals that U+FC61 decomposes as "[isolated] U+0020 U+064F U+0651"

Which means that this is just a compatibility decomposition, and not a canonical decomposition (note also that the decomposition adds an extra space, which in newer documents should rather be a non-breaking space instead of a regular space, to avoid side effects that are possible with whitespace compressions in HTML and XML). Note also that the space still prohibits reordering.

I see no reason then, why Mediawiki would choose to convert U+FC61 incorrectly to U+064F U+0651 (stripping the "[isolated]" compatibility specifier and one space).

And also no reason why it would recombine U+064F U+0651 (adding the leading space and an inexistant [isolated] form) into U+FC61 in the editor.

The same reason should be applied to all the other Arabic compatibility characters (with implicit letter forms) that should be avoided in actual arabic text, unless there is a strong reason to display the character in isolation with a specific form distinct from the normal Arabic presentation rules.

Normalisation of the Arabic presentation forms was requested by members of the Arabic Wikipedia community. I recorded the request at bug 9413 and later implemented it.

OK, but bug 9413 just spoke about the presentational forms of letters (i.e. the distinction of *letters* between initial, media, final, and isolated). The Shadda is not a letter and may be inserted at any place within a word as a presentational feature. As it is presentational, changing it by the compatibility mapping will change exactly its presentational semantic.

If the purpose was to convey a single meaning, it should have been stripped completely. When U+FC61 appears, it is used in isolation where its expected width and appearance is important. Changing it will alter its width, and the KASRA may not fit very well.

But may be the font renderers are now capable of handling it and generating exactly what U+FC61 displays when it is mapped in a font (but such mapping is not necessary in any Arabic font, even if those fonts are most often adding those mappings).

I'm not sure this is a big issue. What is the problem if we cannot see the difference, except when editing where you'll type BACKSPACE twice instead of once to delete it completely in insert mode (but no difference when you select if with the mouse).

The only cases where it could make some difference is when U+FC61 is followed by another Arabic diacritic (due to canonical reordering after the compatibility decomposition has been applied. This does not change the BiDi behavior and joining behavior, even if there are spaces or punctuations on both sides.

If it ever appears in the middle of a word, however, this will change its appearance, because the decomposition and the joining type will alter its form. I doubt that such cases are existing in normal Arabic. This could be an issue in IDNA domain names, if this compatibility character was not mandatorily mapped to the normal shadda+diacritic (just like other Arabic compatibility presentational forms), but it should merit some investigation to check that this is effecgtively the case with the newer IDNA RFCs and Unicode papers about IDNA (which has relaxed some rules to allow more characters that were restrited before).

But if this causes any problem in a URL inserted as the target of an external link, one could still use the "xn--" notation in the hidden URL. But I also have serious doubts that such an URL with compatibility URLs would be harmless (most probably in a cybersquatting domain), where instead it could be valid and distinct within the URL query string part, or anchor part, or path part, for example as a link to a site detailing the Unicode properties of this compatibility character ; but may be there's a way to still encode the URL specially).

Anyway, all those Arabic compatibility characters are really not recommanded within any part of a stable URL, and are also no longer generated by Arabic keyboards in any decent browser since long (and they are most probably detected in browsers or security suites as dangerous if ever found in an URL, where the brower or its security extension will propose to the user to follow the link with the normal characters, or cancel the navigation and come back, or confirm that the user really wants to go there after he's been warned, notably if they appear in the domain name part, in some IDNA-enabled registry or private subregistry that does not implement a restriction on those characters in their DNS records).

sgb-wobeck wrote:

This bug was first reported in Bug 2399 - Unicode normalization "sorts" Hebrew/Arabic/Myanmar vowels wrongly.

Amire80 subscribed.
Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:01 AM