Page MenuHomePhabricator

Suggest results which differ in diacritics (missing ascii normalized lookup)
Closed, ResolvedPublic

Description

Steps to reproduce:

  1. visit http://hu.wikipedia.org
  2. type "kurtvirag" in the search box

Expected: [[hu:Kürtvirág]] is suggested

Actual: no suggestions

This is particularly problematic because people often don't have access to the right type of keyboard, and have only very inconvenient ways of entering characters with diacritics. On mobile, entering diacritics is inconvenient even when the keyboard is set up correctly.

The old behavior was to drop all diacritics for indexing, which was not great, but better than the current one.

The ideal behavior would be to index both the exact and the stripped title, and give more weight to the first; so search suggestions with different diacritics would not crowd out better matches but would still appear if there is no perfect match.


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=62322

Details

Reference
bz67521

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:34 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz67521.
bzimport added a subscriber: Unknown Object (MLST).

Can the new search be configured per site? There is a discussion about this problem on fiwiki as well, and one of us noticed that the search behaves differently on dewiki:

  1. Go to http://de.wikipedia.org/
  2. Type "aanekos" in the search box.

Result: The search suggests "Äänekoski".

(In reply to Tisza Gergő from comment #0)

Steps to reproduce:

  1. visit http://hu.wikipedia.org
  2. type "kurtvirag" in the search box

Expected: [[hu:Kürtvirág]] is suggested

Actual: no suggestions

The ideal behavior would be to index both the exact and the stripped title,
and give more weight to the first; so search suggestions with different
diacritics would not crowd out better matches but would still appear if
there is no perfect match.

Two solutions:
Better suggestions: Add an ascii normalized lookup for suggestions. It looks like German already does this so I'd just have to figure out how and use it in more places.

Weighted search: Everywhere where we search look with the diacritics and without - with gets more boost.

Hmmm - so we already perform some weighted search: exact matches are worth more then normalized (non-conjugated, non-declined, etc) matches. I'm worried adding another layer would be nasty from a performance perspective. The suggestions might be faster. I'm not really sure. I'll have to sleep on it.

(In reply to Mikko Silvonen from comment #1)

Can the new search be configured per site?

It certainly can. If the language is in this list then it already is:
arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.

Both Finnish and Hungarian are in the list so they are getting whatever the Lucene project things are good defaults. I'm happy to customize it from there.

In the mean time, I'm setting this to "Normal" priority. It won't be the top of my list but its certainly on it. Feel free to poke the priority if lack of this makes search horrible for you.

  • Bug 68239 has been marked as a duplicate of this bug. ***

Change 168071 had a related patch set uploaded by Manybubbles:
Prefix search always squashes accents

https://gerrit.wikimedia.org/r/168071

Change 168071 merged by jenkins-bot:
Prefix search always squashes accents

https://gerrit.wikimedia.org/r/168071

Created attachment 16903
Bug 67521, rowiki, testcase 1 (Loïc vs Loic)

At least on rowiki search does not propose as suggestion words which contains diacritic symbols instead of typed standard letters.

Attached:

99.png (373×392 px, 107 KB)

Created attachment 16904
Search results and suggestions for ”Pedro Proença” (rowiki)

Attached:

88.png (531×749 px, 289 KB)

Now is working fine on rowiki too. Ignore my previous 2 posts.

Sadly now we've reverted CirrusSearch due to an outage in the underlying system. We'll reenable it once we figure out what its up. So it'll break again. And then we'll push this change out and rebuild the index and it should be fixed again.