Lucene tokenization is wrong for Indic languages
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• santhosh
	Sep 24 2011, 1:56 PM

Description

Lucene tokenizes the word in format control characters like ZWJ and ZWNJ causing words in Indic languages, Sinhala broken in unwanted places.

This is the log from the lucened when a string ශ්‍රීලංකා (Srilanka, written in Sinhala Language) is searched:

25959 [pool-2-thread-1] INFO org.wikimedia.lsearch.search.SearchEngine - search wikidb: query=[ශ්‍රීලංකා] parsed=[custom(+(+(contents:ශ්^0.2 contents:ශ^0.1) +(contents:රීලංකා^0.2 contents:රලක^0.1)) relevance ([((P contents:"(ශ් ශ) (රීලංකා රලක)"~100) (((P sections:"(ශ් ශ)") (P sections:"(රීලංකා රලක)") (P sections:"(ශ් ශ) (රීලංකා රලක)"))^0.25))^2.0], ((P alttitle:"(ශ් ශ)"^2.5) (P alttitle:"(රීලංකා රලක)"^2.5) (P alttitle:"(ශ් ශ) (රීලංකා රලක)"~20^2.5)) ((P related:"(ශ් ශ)"^12.0) (P related:"(රීලංකා රලක)"^12.0) (P related:"(ශ් ශ) (රීලංකා රලක)"^12.0))) (P alttitle:"ශ් රීලංකා"~20))] hit=[0] in 250ms using IndexSearcherMul:1316871160395

ශ්‍රීලංකා is 0DC1 + 0DCA + 200D + 0DBB + 0DD3 + 0DBD + 0D82 + 0D9A + 0DCF
or SHA + VIRAMA + ZWJ + RA + VOWEL SIGN II + LA + ANUSVARA + KA + VOWEL SIGN AA

The word is single one and cannot be tokenized further, but we can see that It is tokenized at the place of ZWJ.

The solution would be writing language specific tokenization rules in Lucene.

Version: unspecified
Severity: normal

Details

Reference: bz31135

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:50 PM

• bzimport added projects: Wikimedia-lucene-search-2, I18n, Upstream.

• bzimport set Reference to bz31135.

• bzimport added a subscriber: Unknown Object (MLST).

• santhosh created this task.Sep 24 2011, 1:56 PM

See also: https://issues.apache.org/jira/browse/LUCENE-2747

Actually, Lucene from 3.1 onwards has an Indic tokenizer: http://lucene.apache.org/java/3_4_0/api/all/org/apache/lucene/analysis/in/IndicTokenizer.html

[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]

Santhosh, have you tested the results with CirrusSearch ([[mw:Search]])?

Provisionally closing as it seems to be fixed upstream with lucent 3.1, which I suspect we are now indirectly using with CirrusSearch.

Krenair moved this task from Backlog to Patch merged upstream on the Upstream board.Mar 29 2015, 10:18 PM

Lucene tokenization is wrong for Indic languagesClosed, ResolvedPublicActions

Description

Details

Event Timeline

Lucene tokenization is wrong for Indic languages
Closed, ResolvedPublic
Actions