User:Atitarev from Wiktionary has complained that the normalisation used by Lucene does not suit Hindi and Arabic. In the examples I have been given, composing characters such as U+093C are used add diacritics to characters, and the resulting combinations have no composed form in Unicode. It is requested that the composing marks be stripped before search indexing is done, so that titles which differ only by the combining marks they contain can be returned in "did you mean" and autocomplete results.
A list of affected characters will be given as a comment or attachment.
Version: unspecified
Severity: enhancement