Page MenuHomePhabricator

Search in yi: should ignore diacritics and identify ligatures
Closed, ResolvedPublic

Description

Author: wiki.pedia

Description:
This relates specifically to all projects using Yiddish (yi).

Yiddish has a number of ligatures. When a search term includes such a ligature it should be able to identify the corresponding term spelt fully without using the ligature.

Likewise searches should ignore the presence of diacritics which some writers use.

Currently Wikimedia projects fail to make this identification. As a result it is necessary to set up numerous synonyms for pages to catch alternative (but essentially identical) spellings of the same word. This applies to almost every word in the language.

By way of comparison, Google search makes the correct identifications. [English Wikimedia projects successfully convert u/c letters in the middle of words.]

I can supply a list of Unicode codes to be identified.


Version: unspecified
Severity: enhancement

Details

Reference
bz18764

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:36 PM
bzimport set Reference to bz18764.

rainman wrote:

We currently use unicode decomposition in order to get rid of all of diacritics, but from what you're saying I gather that it doesn't do the job for Yiddish. A table of unicode characters mapping one to the other form for Yiddish would be very useful.

wiki.pedia wrote:

A simple test by entering a search term with diacritics shows that they are not stripped.

The following should be ignored
HEBREW POINT PATAH 05B7
HEBREW POINT QAMATS 05B8
HEBREW POINT DAGESH OR MAPIQ 05BC
HEBREW POINT RAFE 05BF

The following should be identified with their decomposed forms
HEBREW LIGATURE YIDDISH DOUBLE VAV 05F0 = 05DS 05DS
HEBREW LIGATURE YIDDISH VAV YOD 05F1 = 05DS 05D9
HEBREW LIGATURE YIDDISH DOUBLE YOD 05F2 = 05D9 05D9
HEBREW LIGATURE YIDDISH YOD YOD PATAH FB1F = 05D9 05D9
HEBREW LETTER YOD WITH HIRIQ FB1D = 05D9

These are the most common ones

Assigning to Robert for followup.

rainman wrote:

The decomposed forms you suggest are not part of the unicode standard.

Can you give us some sample search terms with and without diacritics to have something to test with.

wiki.pedia wrote:

For example

פּאריז --> פאריז
װאנט --> וואנט

rainman wrote:

OK, I've added the exceptions you requested on yi projects. Since these are not part of unicode standard and I don't know yiddish if you want further exceptions you would need to explicitly tell us which.

rainman wrote:

Would be even better if you could provide patches like this r59327 so these don't need to be retyped.

wiki.pedia wrote:

Thanks for this. When will it become effective?

Will try to do provide patches in future.

I am not clear what is not part of the Unicode standard. Is U+05F0 (in װאנט) not a Unicode point? U+05BC?

rainman wrote:

It is deployed on yiwiki/wiktionary/wikisource ... Unicode has way of decomposing characters into simpler characters, i.e. to remove accents, but your custom decompositions rules are not part of it.

wiki.pedia wrote:

I am puzzled by this. I typed
נאװעמבער
into the search box in yiwiki. It does not find the article named
נאוועמבער

However, if I type

NoVember

in the search box in the English Wikipedia, it does find the article named

November

I am not clear what this has to do with Unicode decomposition.

rainman wrote:

I get identical search results for both. But it looks like you want "Go" to directly go to the article ... In that case we would need to modify the TitleKey extension in non-trivial ways, and if you want linking to work, then also MediaWiki internals again in non-trivial ways..

wiki.pedia wrote:

You are right. I am sorry that I did not express myself sufficiently clearly in the original message.

I hadn't realized that this is so complicated to implement. The strange thing is that it does work the other way round! If I type a word with diacritics, the Go box will produce a dropdown list with the corresponding terms which do contain diacritics.

Thank you for your assistance (and for your patience).