Page MenuHomePhabricator

No search results at all when searching in Javanese script on jv.wikipedia
Closed, DeclinedPublic

Description

Note: To display the font correctly, visit http://jv.wikipedia.org/wiki/Pitulung:Aksara_Jawa#English

I can't search using Javanese alphabet/script in sites like Javanese Wikipedia or Wiktionary. The word I'm using in this example are: ꦱꦸꦒꦼꦁ (transliterated: "sugeng" or "sugêng"), ꦱꦸꦒꦼꦁꦮꦂꦱꦲꦺꦁꦒꦭ꧀ (transliterated: "sugeng warsa enggal" or "sugêng warsa enggal" without spaces)

For example, in jv.wikt there're http://jv.wiktionary.org/wiki/sugêng_warsa_enggal and it's script form http://jv.wiktionary.org/wiki/ꦱꦸꦒꦼꦁꦮꦂꦱꦲꦺꦁꦒꦭ꧀

I tried to search the "ꦱꦸꦒꦼꦁ" and "ꦱꦸꦒꦼꦁꦮꦂꦱꦲꦺꦁꦒꦭ꧀", but returns zero result (other than title match for the second search term)

Expected result: returns pages that contains the terms, i.e. [[sugêng]], [[sugêng warsa enggal]]

Note: Javanese script is a Scriptio continua script. I don't know if that affects the Lucene search or not (http://en.wikipedia.org/wiki/Scriptio_continua)

Another example in Wikipedia: http://jv.wikipedia.org/wiki/ꦠꦺꦃ Trying to search the title ("ꦠꦺꦃ" - "tèh​") or any word in the content will return zero result.


Version: wmf-deployment
Severity: major
URL: https://jv.wikipedia.org/w/index.php?search=ꦠꦺꦃ&uselang=en&fulltext=Search

Details

Reference
bz44350

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:40 AM
bzimport set Reference to bz44350.
bzimport added a subscriber: Unknown Object (MLST).

Confirming.

I am logged in, I go to http://jv.wikipedia.org/wiki/Kaca_Utama and enter
ꦠꦺꦃ
in the Search field ("Golèk") and click the dropdown ("ngisi") that pops up.

I get zero results:
"Wonten kaca kanthi nama "ꦠꦺꦃ" ing wiki punika"
however
http://jv.wikipedia.org/wiki/%EA%A6%A0%EA%A6%BA%EA%A6%83 does exist.

Maybe this requires fixing bug 39381 and bug 43359 first, but I'm likely wrong.

As the summary for this component says "For issues with settings of the deployed version on Wikimedia servers see "Wikimedia → lucene-search-2" I am moving this report.

The outcome of this problem has some similarities to bug 43663, hence CC'ing Ram who is investigating that other bug report too.

Nowadays the output is always

Kesalahan terjadi saat mencari: The search backend returned an error:

when searching for a string.

CC'ing Nik on this.

Updated URL. The error (presumably with the ElasticSearch backend) is now (no error details given):

An error has occurred while searching: The search backend returned an error:

jvwiki is still trying to use MWSearch/lucene-search which doesn't support Javanese. CirrusSearch/Elasticsearch doesn't have any explicit support for Javanese either. I tried this morning on my sandbox and Javanese script doesn't cause CirrusSearch to crash which seems like an improvement over MWSearch.

Lack of explicit support CirrusSearch's case means that it won't know how to segment the words in Javanese script so it'll only be able to do things like match exact titles and whole sentences.

Fortunately, CirrusSearch is in a much better position to get a word segmented for Javanese script because it is using a modern version of Lucene. Unfortunately I couldn't find one that already exists and writing one is a project.

Updated URL: mixing Wiktionary entry (ꦱꦸꦒꦼꦁ - welcome) and Wikipedia entry (ꦠꦺꦃ - tea)

[And here I am just stumbled upon this error again couple days ago, and wondering if I have submitted a bug yet or not :)]

I believe that worked because it found an exact page match and didn't actually dive into the full text search engine. I believe the fulltext=Search parameter forces a full text search even if there is a matching article title.

This should not be an issue for CirrusSearch as it has better support for non-English languages. Since we're in the process of migrating from Lucene to CirrusSearch, I'm marking this as RESOLVED WONTFIX.

If you continue to experience issues with searching in your language with CirrusSearch, feel free to open a bug under MediaWiki extensions -> CirrusSearch.

OK, thanks for everyone who've been working on this bug for the past year!