Page MenuHomePhabricator

Devanagari and Arabic combining character handling
Closed, DeclinedPublic

Description

User:Atitarev from Wiktionary has complained that the normalisation used by Lucene does not suit Hindi and Arabic. In the examples I have been given, composing characters such as U+093C are used add diacritics to characters, and the resulting combinations have no composed form in Unicode. It is requested that the composing marks be stripped before search indexing is done, so that titles which differ only by the combining marks they contain can be returned in "did you mean" and autocomplete results.

A list of affected characters will be given as a comment or attachment.


Version: unspecified
Severity: enhancement

Details

Reference
bz27055
TitleReferenceAuthorSource BranchDest Branch
dev: Stop printing success update notes twicerepos/releng/cli!10migrdontPrintUpdateNotesTwicemain
Customize query in GitLab

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:19 PM
bzimport added projects: CirrusSearch, I18n.
bzimport set Reference to bz27055.
bzimport added a subscriber: Unknown Object (MLST).

The discussion can be seen here, but here are the diacritics and characters provided to me:

Hindi:
First of all, the pairs with nuqta (a dot underneath) and without it should be searchable the same way Roman letters with diacritics and without are searchable.

  • क़/क ख़/ख ग़/ग ज़/ज फ़/फ ड़/ड ढ़/ढ

The letters are not identical but So that if a user typed खून, ख़ून would also be listed.

  • Words containing diacritics ॉ (candra), ् (virama) should be equal to those without them: चॉकलेट / चाकलेट, सन् / सन. Similar to the way English words entries with a space are equal to those having a hyphen (-) between them. ----

Arabic:

  • Different forms of alif: ا, أ‎, إ‎, ﺁ‎ and ٱ‎‎ should be searchable together, e.g. أمس and امس, etc.
  • Words containing any of these diacritics could be searchable as if they don't have them and the other way around:

ـَ fatHa, ـِ kasra, ـُ Damma, ـْ sukuun, ـّ shadda, ـٰ dagger 'alif.

  • ـٌ tanwiin al-Damm (تنوين الضم)
  • ـٍ tanwiin al-kasr (تنوين الكسر)
  • ـً tanwiin al-fatH (تنوين الفتح) ----

Persian often uses a zero-width nonjoiner (& # x200C;) as in ویکی‌پدیا. People who don’t know how to access it tend to substitute a space: ویکی پدیا. It’s a misspelling, but lots of people can’t help it.

In languages like Khmer and Thai that do not use word spaces, there is often a zero-width space (& # x200B;) as in តើអ្នកនិយាយ​ភាសាអង់គ្លេស​ទេ. More often than not, it is simply left out (តើអ្នកនិយាយភាសាអង់គ្លេសទេ). Both spellings are correct.

I think Anatoli neglected to mention the word-final Arabic pair ه/ة. The final letter ة may be typed as ه.

Just adding a note that stripping diacritics from latin letters is not always the correct thing to do. It is obvious that we need to support different models for different languages.

(In reply to comment #1)

The discussion can be seen here, but here are the diacritics and characters
provided to me:

Hindi:
First of all, the pairs with nuqta (a dot underneath) and without it should be
searchable the same way Roman letters with diacritics and without are
searchable.

  • क़/क ख़/ख ग़/ग ज़/ज फ़/फ ड़/ड ढ़/ढ

The letters are not identical but So that if a user typed खून, ख़ून would also
be listed.

  • Words containing diacritics ॉ (candra), ् (virama) should be equal to

those without them: चॉकलेट / चाकलेट, सन् / सन. Similar to the way English words

entries with a space are equal to those having a hyphen (-) between them.

Arabic:

  • Different forms of alif: ا, أ‎, إ‎, ﺁ‎ and ٱ‎‎ should be searchable

together, e.g. أمس and امس, etc.

  • Words containing any of these diacritics could be searchable as if they

don't have them and the other way around:

ـَ fatHa, ـِ kasra, ـُ Damma, ـْ sukuun, ـّ shadda, ـٰ dagger 'alif.

  • ـٌ tanwiin al-Damm (تنوين الضم)
  • ـٍ tanwiin al-kasr (تنوين الكسر)
  • ـً tanwiin al-fatH (تنوين الفتح) ----

Persian often uses a zero-width nonjoiner (& # x200C;) as in ویکی‌پدیا. People
who don’t know how to access it tend to substitute a space: ویکی پدیا. It’s a
misspelling, but lots of people can’t help it.

In languages like Khmer and Thai that do not use word spaces, there is often a
zero-width space (& # x200B;) as in តើអ្នកនិយាយ​ភាសាអង់គ្លេស​ទេ. More often
than not, it is simply left out (តើអ្នកនិយាយភាសាអង់គ្លេសទេ). Both spellings are
correct.

I think Anatoli neglected to mention the word-final Arabic pair ه/ة. The final
letter ة may be typed as ه.

Actually चॉकलेट can also be written as चौकलेट or चोकलेट . However, everything other than चॉकलेट is grammatically incorrect. But, if equivalence is to be added, it should be चॉकलेट and चौकलेट, not चाकलेट. Reason being that a lot of unwanted equivalences would be introduced as well, like हॉल (hall) and हाल (condition someone is in).

The handling for halant/viram is correctly stated as equivalence. However, there is more to it. Five characters in hindi when followed by halant, can be replaced by an anuswara on the next character. All five represent nasal sounds, which can be represented by anuswara. For example, सङ्गीत/संगीत, सम्वत/संवत

The five characters are ङ ञ ण न म

But not all cases of anuswara can be equated to each one, since each has a different sound.
There is a grammatical rule which decides this. The rule depends on the character next to these five characters. On a case basis:

क ख ग घ are preceded by ङ
च छ ज झ are preceded by ञ
ट ठ ड ढ are preceded by ण
त थ द ध are preceded by न
प फ बी भ are preceded by म

Note that this is similar the utf8 encoding order. The four alphabets come in the stated order before before the respective nasal alphabet.

So, if I type in सन् , I would expect संतान to show up, but not संभव.

However, this limitation of equating is an ideal case with perfect grammar. In actual usage, न् has been used in place of ङ् ञ् and ण् but not म् since it is an entirely different sound. So, if I type in सन्, I would also expect संगीत, संजय, संडे to show up, but still not संभव. Hope I have clarified this clearly enough.

PS:The nuqta stuff is correct.

Bug 33548 is related to this. Its about the appearance of devanagari diacritics in the "did you know" results.

(In reply to Dave Ross from comment #1)

Arabic:

  • Different forms of alif: ا, أ‎, إ‎, ﺁ‎ and ٱ‎‎ should be searchable

together, e.g. أمس and امس, etc.

امس : search=امس
https://ar.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5%3A%D8%A8%D8%AD%D8%AB&profile=default&search=امس&fulltext=Search&uselang=en
There is a page named "امس" on this wiki.

أمس :
https://ar.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5%3A%D8%A8%D8%AD%D8%AB&profile=default&search=أمس&fulltext=Search&uselang=en
Create the page "أمس" on this wiki!

أمس :
https://ar.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5%3A%D8%A8%D8%AD%D8%AB&profile=default&search=أمس&fulltext=Search&uselang=en&srbackend=CirrusSearch
Create the page "أمس" on this wiki!

Needs reassessment with Cirrus.

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Needs reassessment with Cirrus.

Okay, let’s do that!

Since different wikis have different analysis chains, I’m going to evaluate these based on the native analysis chain and the English one. The native analysis chain is in place on the wikis in that language, and English has the most comprehensive (but uncustomized) folding/diacritic removal enabled—because English speakers care nothing for diacritics!

Hindi

  • Nuqta are removed by the current Hindi and English analysis chains. Searching for either क़ or क on Hindi Wikipedia or English Wikitionary returns the other. The exact match is ranked first, and the other a bit lower (second or third).
  • Virama/halant are removed by the current Hindi and English analysis chains. Searching for घ् finds घ on either. The specific example of सन् and सन works on English Wiktionary but not on Hindi-language wikis because they are not normalized/stemmed the same; the Hindi language analysis does additional Hindi-specific processing. This would be like the English language analysis treating resume and resumé differently because the first one is a verb and the second one is a noun (English doesn’t do that because English speakers care nothing for diacritics!). I don’t know any Hindi so I’m not digging into this case now, but in general virama are removed.
  • Candra are not removed, but they are converted by the Hindi language analyzer internally such that चॉकलेट. चौकलेट, and चोकलेट are all indexed the same, while चाकलेट is indexed separately. In the English language analyzer, all four are distinct.
  • I did not entirely follow the discussion about halant/viram and anuswara, but both the Hindi and English language analyzers do not seem to consider them equivalent, as सङ्गीत/संगीत, सम्वत/संवत are distinct.

Arabic

  • Some alif variants are indexed the same. ا أ إ are all indexed as ا by both Arabic and English language analyzers. ﺁ is not converted to ا by the Arabic analyzer, but is by the English analyzer. Neither converts ٱ. The specific case of أمس and امس do find each other.
  • Vowel marks (fathah, etc.) are stripped by both Arabic and English language analyzers.
  • Final ه and ة seem to be indexed the same. I tested مربوطة / مربوطهand مفتوحة / مفتوحه and رسالة / رساله. with both the Arabic and English analyzers.

Space-like things

  • Persian and English both break on the zero-width nonjoiner and zero-width spaces; so that seems correct.
  • Thai breaks on neither the zero-width nonjoiner nor the zero-width spaces, and neither is removed from the input string! That’s probably a problem worth fixing. Given that we’re currently using the Elastic Thai language analyzer, I would have expected better.
  • Khmer uses the “default” analyzer and it drops the zero-width nonjoiner and joins things on either side. It treats the zero-width space like a space and breaks words. Not sure if that is desirable or not for a spaceless language like Khmer.

Recommendations

  • I suggest closing this ticket. A lot of the specifics have been dealt with, and it was filed against the pre-Cirrus, pre-Elastic search. It also covers way too many different topics.
  • New tickets should address one language at a time if possible. If there is a specific character that should be treated the same across many or most languages, that might make sense as one ticket, though implementation might still be on a language-by-language basis, and it may turn into a tiny Epic (hey, is that an oxymoron!?)
    • I’m not sure about the alif variants in Arabic. Two are not folded by the Arabic analyzer.
    • The Thai analyzer seems to be doing weird things with non-simple space characters, but it would require digging into it a lot more to figure it out, and tokenizing in Thai is a big topic—as with all the (semi-)spaceless languages.
debt subscribed.

I'm taking @TJones recommendation to close this particular ticket, as it's a few years old and many of the issues brought up in the ticket have already been resolved.

If there are issues not already addressed by updates to Cirrus, please add in new tickets (one per subject/bug).