Page MenuHomePhabricator

Lucene tokenization is wrong for Indic languages
Closed, ResolvedPublic

Description

Lucene tokenizes the word in format control characters like ZWJ and ZWNJ causing words in Indic languages, Sinhala broken in unwanted places.

This is the log from the lucened when a string ශ්‍රීලංකා (Srilanka, written in Sinhala Language) is searched:

25959 [pool-2-thread-1] INFO org.wikimedia.lsearch.search.SearchEngine - search wikidb: query=[ශ්‍රීලංකා] parsed=[custom(+(+(contents:ශ්^0.2 contents:ශ^0.1) +(contents:රීලංකා^0.2 contents:රලක^0.1)) relevance ([((P contents:"(ශ් ශ) (රීලංකා රලක)"~100) (((P sections:"(ශ් ශ)") (P sections:"(රීලංකා රලක)") (P sections:"(ශ් ශ) (රීලංකා රලක)"))^0.25))^2.0], ((P alttitle:"(ශ් ශ)"^2.5) (P alttitle:"(රීලංකා රලක)"^2.5) (P alttitle:"(ශ් ශ) (රීලංකා රලක)"~20^2.5)) ((P related:"(ශ් ශ)"^12.0) (P related:"(රීලංකා රලක)"^12.0) (P related:"(ශ් ශ) (රීලංකා රලක)"^12.0))) (P alttitle:"ශ් රීලංකා"~20))] hit=[0] in 250ms using IndexSearcherMul:1316871160395

ශ්‍රීලංකා is 0DC1 + 0DCA + 200D + 0DBB + 0DD3 + 0DBD + 0D82 + 0D9A + 0DCF
or SHA + VIRAMA + ZWJ + RA + VOWEL SIGN II + LA + ANUSVARA + KA + VOWEL SIGN AA

The word is single one and cannot be tokenized further, but we can see that It is tokenized at the place of ZWJ.

The solution would be writing language specific tokenization rules in Lucene.


Version: unspecified
Severity: normal

Details

Reference
bz31135

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:50 PM
bzimport set Reference to bz31135.
bzimport added a subscriber: Unknown Object (MLST).

[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]

Santhosh, have you tested the results with CirrusSearch ([[mw:Search]])?

TheDJ claimed this task.
TheDJ subscribed.

Provisionally closing as it seems to be fixed upstream with lucent 3.1, which I suspect we are now indirectly using with CirrusSearch.