Page MenuHomePhabricator

No normalization for ancient greek accents in searches
Closed, ResolvedPublic

Description

I am a PHP developer trying to use mediawiki for an ancient greek dictionary.
One feature this wictionary should have is the possibility to search for a word without the input of accents and diacritic letters and retrieve all the relative words that contain diacritic in search results.

For instance, if I input the green world αλφα (alpha) it shoud retrieve also ἄλφα (with diacritics), if it is present in article database.
This happens in modern greek wiktionary for words with accents, but it does not seem to work for ancient greek, cince it has different kind of diacritics.

My question is about the availability of this feature.
In case thhis feature is not available, my need is to have indication about the best way to implement it.

Paolo


Version: unspecified
Severity: normal

Details

Reference
bz73605

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:53 AM
bzimport set Reference to bz73605.
bzimport added a subscriber: Unknown Object (MLST).

Thanks for taking the time to report this!

I tried the search on https://el.wikipedia.org (which uses the CirrusSearch extension) and αλφα finds άλφα but ἄλφα only seems to find ἄλφα.
Which search backend/extension do you use? Which MediaWiki version is this?

Cirrus uses Elasticsearch for the anlaysis which in turn uses Apache Lucene. I imagine the right place to implement this is there.

It looks like https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/el/GreekLowerCaseFilter.java implements the normalization. I'd file a bug over there. It doesn't _look_ like adding the extra normalization would be that hard. I suppose you'd have to decide with them whether they should be enabled by default (so you could just add them to that file) or optional. If optional you'd just make a new filter I believe.

After its released in Lucene and Elasticsearch we could enable it by default for Greek across the site I think.

(In reply to Andre Klapper from comment #1)

Thanks for taking the time to report this!

I tried the search on https://el.wikipedia.org (which uses the CirrusSearch
extension) and αλφα finds άλφα but ἄλφα only seems to find ἄλφα.
Which search backend/extension do you use? Which MediaWiki version is this?

Thank you Andre for the reply.
This is the same situation I have found in my searches

My need is being able to search and retrieve ancient greek worlds even with vowels ortographical details specified ( άλφα searchstring retrtieves άλφα, αλφα and άλφα) and without vowels ortograhical details specified (αλφα searchstring retrtieves άλφα, αλφα and άλφα)

The fact it works for modern greek but not for ancient suggest me that in this case ancient greek is not supported, while modern, which has different ortographical details, works.

(In reply to Andre Klapper from comment #1)

Thanks for taking the time to report this!

I tried the search on https://el.wikipedia.org (which uses the CirrusSearch
extension) and αλφα finds άλφα but ἄλφα only seems to find ἄλφα.
Which search backend/extension do you use? Which MediaWiki version is this?

About the second part of the question, I am at a first preliminary step for this project and did not install a mediawiki for this at the moment, so I made tests only on public mediawiki instances for the moment, for instance el.wiktionary.org

I will do local test in the next days. About search backend or extensions do you have any suggestions?

Thanks again

Paolo

(In reply to Nik Everett from comment #2)

Thank you Nik, I had a look at that file.
I am not an experienced mediawiki developer, but if the problem is really related to that, maybe I can provide some help in adding extra normalization.

Thanks

Paolo

(In reply to paolo anghileri from comment #5)

If you want to propose a change to implement it in Lucene then link it here and I'll jump over there and help. I'm not a Lucene committer but I can certainly review it and prod a committer.

(In reply to paolo anghileri from comment #4)

I will do local test in the next days. About search backend or extensions do
you have any suggestions?

Use CirrusSearch. Its the search backend that we use on all of our wikis. Its better than the built in MySQL search in just about every way. Its the only option to get that normalization from Lucene to take effect as well.

(In reply to Nik Everett from comment #6)

Provided I am not a wikimedia expert and did not explore yet CirruSearch code, as a CirruSearch developer do you think this normalization should go through Lucene or is it possible to implement it direcly in CirrusSearch extension, or maybe in its dependency elasticsearch?

Otherwise, if this can be done only passing through Lucene, I'll try adding extra normalization in Lucene and propose a commitment for that.

(In reply to paolo anghileri from comment #7)

(In reply to Nik Everett from comment #6)

Provided I am not a wikimedia expert and did not explore yet CirruSearch
code, as a CirruSearch developer do you think this normalization should go
through Lucene or is it possible to implement it direcly in CirrusSearch
extension, or maybe in its dependency elasticsearch?

Otherwise, if this can be done only passing through Lucene, I'll try adding
extra normalization in Lucene and propose a commitment for that.

Try getting it in Lucene. Anything in Cirrus would be a nasty hack.

(In reply to Nik Everett from comment #8)

Thanks Nik, I'll try following this way.
As you suggested I'll provide you a link for the Lucene commitment here soon, so you can review it.

Thanks for your suggestions

Paolo

(In reply to Nik Everett from comment #8)

I have made some searches in Lucene and Elasticsearch.

What I have found is that Lucene ICUTransformFilterFactory can render this resuls:

Μῆνιν ἄειδε, θεὰ, Πηληϊάδεω Ἀχιλλῆος ->

1.μηνιν
2.αειδε
3.θεα
4.πηληιαδεω
5.αχιλληοσ

So I guess ancient greek normalization functionality is already implemented in Lucene.

In CirrusSearch sources I have seen that this extension makes some use of ICU.

I have installed MediaWIki, Elstica, Elasticsearch and Elasticsearch ICU analysis plugin but at the moment It does not get normalization for ancient greek characters with grammatical details.

My question is about CIrrusSearch ICU implementation.
Does it use this transform filter? If not, is it possible to implement this functionality in the extension?

Thanks for your help

Paolo

Restricted Application added a subscriber: StudiesWorld. · View Herald Transcript
Deskana triaged this task as Lowest priority.Dec 4 2015, 5:28 AM
Deskana moved this task from Needs triage to Search on the Discovery-ARCHIVED board.
Deskana subscribed.

in el.wiktionary we have a LUA module https://el.wiktionary.org/wiki/Module:Kleida-el which constructs the DEFAULTSORT for Greek (ancient and modern) lemmas. At this moment works only for the 32 first letters. But the idea can be transfered here. Also since we already use it and allmost all lemmas have a DEFAULTSORT that can be used for such a search.

debt added subscribers: TJones, debt.

This might be best fixed in Lucene (rather than CirrusSearch); seeing if @TJones has any further insights on the issue.

It's been two and a half years, so I don't know if it is too late to help @Panghileri, but I do know what to do!

This can be accomplished in MediaWiki / Elasticsearch without making changes upstream to Lucene. The Elasticsearch documentation shows how to "unpack" language analyzers. So, instead of using the one-piece analyzer, you can break it into its parts and then customize it—by adding, changing, or removing parts. In this case, you could add ICU folding, which should cover normalizing Greek diacritics for both modern and ancient Greek. The info for the Greek analyzer is here; my guess is that you'd want to add ICU-folding after the Greek stemmer.

I have unpacked a few of these, and each time there are slight changes between the one-piece version and the unpacked version, usually in the treatment of a few Unicode characters. I did an analysis for ASCII-folding in French (T142620), and my detailed write up is here. The actual code change was done in T144429, and the Gerrit patch is available for reference.

We have since implemented a feature that converts ASCII-folding to ICU-folding if the ICU plugin is available. I think you'll need the ICU plugin to get the non-ASCII folding you want.

Oddly, on Greek Wikipedia it is still the case that αλφα finds άλφα but ἄλφα only finds ἄλφα—we haven't unpacked it yet. English wikis, however, do have ICU-folding enabled, and English Wiktionary has lots of modern and ancient Greek words in it. Searching for αλφα finds both άλφα and ἄλφα, so it seems that ICU folding does the trick.

You can see the config for English Wiktionary here. If you search for "text" on the page (with quotes) you can see the config with ICU Folding enabled.

Some additional thoughts:

  • Depending on your set up, it might make more sense to create a custom Elasticsearch analyzer and use it rather than going through the MediaWiki config. Not sure about that.
  • If you don't want to or can't install the ICU plugin, and you only really care about Greek, and you want all the versions of a letter—with or without diacritics—to be equivalent, you could write a custom character filter to map each of the accented versions to the unaccented versions. It'd be tedious, but wouldn't require an additional plugin.
  • If you are using the MediaWiki set up and you want folding on the plain field (search for "plain" on in the English Wiktionary config), you might want to use preserve_original and preserve_original_recorder, which are in a custom Cirrus plugin; they do the same thing as ascii_folding_preserve for ICU folding.

@debt, I suggest closing this ticket, since I think we've answered the question—albeit very tardily. I don't know if we should open a ticket for Greek language wikis to enable ICU folding there or not—I didn't find one, and I don't know if it's a problem there or not. T132637 was for English Wiktionary, which has the desired folding.

A bug, which may be related, still exists for Greek terms (in ALL projects, even in en.wiktionary) . Typing in the search box anything that ends in accented letter does not provide any suggestions that include the last letter (even if they exist ex. καλά). Copying and pasting works. Also typing anything after (ex. a space) works. It seems (to user) like the search is not done by the really typed letters, and the code is "waiting" for something.

A bug, which may be related, still exists for Greek terms (in ALL projects, even in en.wiktionary) . Typing in the search box anything that ends in accented letter does not provide any suggestions that include the last letter (even if they exist ex. καλά). Copying and pasting works. Also typing anything after (ex. a space) works. It seems (to user) like the search is not done by the really typed letters, and the code is "waiting" for something.

@Xoristzatziki, I have a partial answer for you. If that's not good enough, you should open a separate bug, because the scope is much bigger than accented Greek characters. If you do open a separate bug, please include your operating system and browser info, because I'm pretty sure this is a Javascript issue.

For my quick test, I'm on a Mac, using the American, French, and Greek keyboards, and I tested in Chrome, Safari, and Firefox. To my surprise, they all behave the same.

If you use dead keys (keys that put an accent or other diacritic on the next letter you type), the Javascript "keypress" event listener doesn't get the message that anything has happened. I tested this with both Latin and Greek letters on the Mac keyboards.

As I understand it, the Greek keyboard uses a dead key to add ´ and ¨ to vowels. Similarly in the Mac American keyboard has dead keys for several diacritics (I use ´ ¨ ˆ ` ˜ regularly). If I type resumé to search on English Wiktionary, I also don't get any more suggestions for the final é. (BTW, it happens for non-final letters, too, if you pause, but it's easy to miss if you keep typing).

On the French keyboard, é has its own key, and when I type resumé using that keyboard, it behaves as expected.

On the Mac Greek keyboard (so this probably does not apply to Windows or Linux), I can type ά by typing option+shift+α. If I type καλά this way it gets suggestions as expected. Similarly, you can use option+shift+<x> to type other accented vowels: 1/έ 2/ί 3/ή 4/ό 0/ύ ./ώ —I didn't see any precomposed versions with diaeresis (i.e, ϊ or ϋ ). These non-dead-key versions generate new suggestions.

So, the problem isn't accented characters per se, but rather characters that have to be typed with dead keys, at least on a Mac keyboard. I'm not familiar with the UI code that's handling all this, so I have no idea how easy it would be to fix, but searching online shows a lot of people complaining about this, but no obvious solutions.

The problem is in the code for sure. The accent in dead keys in Greek keyboard are typed first so the last key pressed is a non dead key. onkeyup works in my tests.

TJones added a subscriber: Jdrewniak.

I think we can close this ticket: on both English Wiktionary and Greek Wiktionary, searches for αλφα match άλφα and ἄλφα. On English Wiktionary, this seems to be thanks to enabling either the icu_normalizer or icu_folding (search for "text" with quotes in the Cirrus Settings Dump). On Greek Wiktionary, the Greek language analyzer is configured (search for "text" with quotes in the Cirrus Settings Dump). I'm not sure which part of the Greek analyzer is doing it, but the config is there. That should be enough for @Panghileri to configure their own wiki in Greek, or more generically with icu_normalizer/icu_folding.

As for @Xoristzatziki's UI dead key bug, I opened T177251. @debt / @Jdrewniak, is this something we can fix in the search box, or is it something higher up the hierarchy of UI elements that will ripple around possibly causing problems. (I mean, I don't see any problems it would cause, but I always assume there at least might be some.) I didn't add a project tag to T177251 because I don't know.

debt claimed this task.

Closing this ticket, as it appears that sometime in the last 3 years (when the ticket was opened) this issue has been fixed; thanks for the summary @TJones.

@Xoristzatziki, there is a new ticket opened up for the 'dead key' issue that you found: T177251 that will hopefully be fixed in the near future.