Lucene tokenizes the word in format control characters like ZWJ and ZWNJ causing words in Indic languages, Sinhala broken in unwanted places.
This is the log from the lucened when a string ශ්රීලංකා (Srilanka, written in Sinhala Language) is searched:
25959 [pool-2-thread-1] INFO org.wikimedia.lsearch.search.SearchEngine - search wikidb: query=[ශ්රීලංකා] parsed=[custom(+(+(contents:ශ්^0.2 contents:ශ^0.1) +(contents:රීලංකා^0.2 contents:රලක^0.1)) relevance ([((P contents:"(ශ් ශ) (රීලංකා රලක)"~100) (((P sections:"(ශ් ශ)") (P sections:"(රීලංකා රලක)") (P sections:"(ශ් ශ) (රීලංකා රලක)"))^0.25))^2.0], ((P alttitle:"(ශ් ශ)"^2.5) (P alttitle:"(රීලංකා රලක)"^2.5) (P alttitle:"(ශ් ශ) (රීලංකා රලක)"~20^2.5)) ((P related:"(ශ් ශ)"^12.0) (P related:"(රීලංකා රලක)"^12.0) (P related:"(ශ් ශ) (රීලංකා රලක)"^12.0))) (P alttitle:"ශ් රීලංකා"~20))] hit=[0] in 250ms using IndexSearcherMul:1316871160395
ශ්රීලංකා is 0DC1 + 0DCA + 200D + 0DBB + 0DD3 + 0DBD + 0D82 + 0D9A + 0DCF
or SHA + VIRAMA + ZWJ + RA + VOWEL SIGN II + LA + ANUSVARA + KA + VOWEL SIGN AA
The word is single one and cannot be tokenized further, but we can see that It is tokenized at the place of ZWJ.
The solution would be writing language specific tokenization rules in Lucene.
Version: unspecified
Severity: normal