Page MenuHomePhabricator

Special character "å" in the search menu
Closed, ResolvedPublic

Description

Author: v85.wikipedia

Description:
In the new search menu, the letter "å" is interpreted as its base letter "a". This means that when I start typing a word including the letter "å", it is interpreted as though I had written an "a".

Example: I want to look for the Swedish town "Åmål", and I start typing "Åm": The first hits I get are words beggining with "Am", such as "America", "Amsterdam", "American revolutionary war". In Norwegian, Swedish and Danish, "å" is considered to be a separate letter, and not another version of the letter "a", it therefore doesn't make sense to treat it as an "a". This problem does not occur for the Dano-Norwegian letters "ø" and "æ" (they are not confused with any other letters), but it does occur for the Swedish "ö" and "ä" (which are interpreted as "o" and "a", respectively).


Version: unspecified
Severity: normal

Details

Reference
bz24414

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:10 PM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz24414.

Don't know whether it is the search interface or the backend, but this applies to Finnish as well and probably to more languages too. This kind of normalization can't be done unconditionally for every language. Marking as a bug instead of enhancement.

This happens in the backend, regardless of which skin is being used. Moving from Vector skin to Search.

v85.wikipedia wrote:

Let us take the example word "Åmål": In Monobook, the search results would show both "Amal" and "Åmål", in Vector "Amal" and "Åmål" are both rendered as "Amal" in the drop-down menu.

(In reply to comment #3)

Let us take the example word "Åmål": In Monobook, the search results would show
both "Amal" and "Åmål", in Vector "Amal" and "Åmål" are both rendered as "Amal"
in the drop-down menu.

That's a duplicate of bug 24237

Entering Åmål on sv.wikipedia.org I get results starting both with A and Å.

I don't consider this a bug but rather a feature, e.g. when you don't have access to a localized keyboard - normalization.

In case I didn't understand incorrectly, I propose to close this as WONTFIX.

(In reply to comment #5)

I don't consider this a bug but rather a feature, e.g. when you don't have
access to a localized keyboard - normalization.

I don't think that's the purpose of this normalization, and it would be the wrong fix for that problem (the correct fix being the UniversalLanguageSelector).

We are in the process of enabling ICU on more wikis, currently we use ascii folding which does not allow us to add such language specific conditions.
If we decide to enable to ICU folding on for instance sv.wipedia.org searching for åman would only suggest pages starting with åman ignoring results such as Amanda.
Side note: I'm tempted to merge this task with T132637. These two tasks are highly correlated even if It's not clear at first glance.

If we decide to enable to ICU folding on for instance sv.wipedia.org searching for åman would only suggest pages starting with åman ignoring results such as Amanda.

That's exactly what is requested here. å is a separate letter in the alphabet and it does not make sense to show things such as Amanda. There is a slightly stronger case for the opposite, showing Åland when typing Aland, for non-native speakers or for users who are unable to type å.

This is only because the completion suggester ranks higher results where the diacritics match.
If you type a longer string you'll start to see suggestions without a diacritic https://sv.wikipedia.org/w/api.php?action=opensearch&format=json&formatversion=2&search=%C3%85man&namespace=0&limit=10&suggest=true
Sadly as Nikerabbit said if the usecase to displain Åland when typing Alan is more important we might have a problem with ICU folding for both completion and fulltext...
Applying a different technique depending on the presence of the diacritic in the query requires to differentiate the search analyzer from the index analyzer: analyze the search query differently than the content.
But doing so requires indexing the same token in the two forms:

  • Åland would be analyzed as index time as aland and åland
  • but it'd be analyzed only as åland at query time
  • searching for aland would still find Åland

(this is what we do for fulltext search today when asciifolding is activated, all latin characters with a diacritis are folded to their ASCII representation at index time only)

Note that asciifolding is not activated by default on full text searches, it's on by default everywhere only on completion searches.
For instance for Swedish unless the swedish analyzer applies some folding no diacritics are folded, this leads to very inconsistent behavior :
Today on swedish typing sydvast :

  • it suggests Sydväst on completion (because of asciifolding activated by default for completion)
  • Pressing enter will go to the Sydväst page (because the go feature uses a field where asciifolding is activated)
  • But running the search in SpecialSearch yields nearly no results but Sydväst should really be the first result

I think I'll file a separate bug for this.

debt subscribed.

The other bug is this: T155822 :)

@EBernhardson wrote:

Does this already work as expected? The example "åman" is the second result, and all of them start with "åm"

@dcausse wrote:

This is only because the completion suggester ranks higher results where the diacritics match.

Since this came up again yesterday, I wanted to say that I agree with @EBernhardson. I think the original problem is taken care of. Matches with appropriate diacritics rank above those that don't have them, so all the good stuff is at the top of the list, where it should be. That's more important than making sure there the not-so-good results are not at the bottom of the list. Don't let the perfect be the enemy of the good.

@Nikerabbit wrote:

There is a slightly stronger case for the opposite, showing Åland when typing Aland, for non-native speakers or for users who are unable to type å.

Actively encouraging this sort of asymmetry is possible, but probably prohibitively difficult to implement, and not worth it, given the apparent weak need.

On the other hand, when there aren't many competing articles to match, it currently works. It won't work for Aland, because that's already a redirect to Åland and lots of other titles start with Aland-. But the only result for Aboleden is Åboleden. This may stop working if proper ICU folding with exceptions is implemented.

Don't know whether it is the search interface or the backend, but this applies to Finnish as well and probably to more languages too. This kind of normalization can't be done unconditionally for every language. Marking as a bug instead of enhancement.

Nikerabbit is right—we have to set up folding exceptions for every language (with some, like English, presumably allowing pretty much everything to be folded). My criteria for a first draft of what to avoid folding is based on what letters are considered part of the alphabet of the language. Swedish, for example, has å, ä, and ö right there. It's only a first draft, though. Russian, for example, has ё, but it is apparently optional in practice, and е is fine, though ё does get used in citation forms (e.g., encyclopedia article titles and dictionary entries). English Wikipedia often has good documentation on this kind of thing, and we can often check with fluent speakers of the language, too.

I don't consider this a bug but rather a feature, e.g. when you don't have access to a localized keyboard - normalization.

As a quick aside, I don't think this argument works. The Turkmen ÄWERTY keyboard doesn't have C, Q, V, or X. The Azeri QÜERTY keyboard doesn't have W. And of course a typical Russian keyboard doesn't have any Latin characters on it. Merging A and Å in Swedish is like merging V and U or I and J in English—they look kinda similar and they are historically related, but they are now completely different letters. Missing letters on your keyboard for another language is just how it goes.

Anyway, I believe that we have done this correctly for Swedish in T160562: ICU folding is enabled except for å, ä, and ö. (And I think we got Russian ё and е right in T124592).

Given that this task started primarily about å in the languages of the Nordic countries, I suggest we set up folding as below:

  • Danish: fold except for æ, ø and å (Looks like acute accents are optionally used to distinguish homophones or to show stress and should be folded.)
  • Norwegian: fold, except for æ, ø and å. (Looks like accented characters—é, è, ê, ó, ò, â, and ô; and sometimes ü, á, and à in loanwords—are optionally used for homophones and ignored for sorting and should be folded.)
  • Finnish: fold except for å, ä, and ö; it's unclear what to do with w, š and ž:
    • w is officially a variant of v, but a quick test on fiwiki looks like they are distinct in practice.
    • š and ž certainly used on fiwiki, but there are lots of redirects to article titles with and without the caron (ˇ), or, for š examples I found, expanded to sh. Seems inconsistent.
      • A few quick searches indicate that maybe folding š and ž to s and z would be more helpful than not.

I could see doing this as one big task, or breaking it up into one for each language. I'm okay with either, but vote for one task (this one!) as they will probably end up traveling together through the process and would eventually all get re-indexed together (and thus the changes made live together), unless we separate them out.

This should be relatively straightforward to implement once we pick out the letters to leave unfolded. Danish, Norwegian, and Finnish all use analyzers provided by Elastic that can be unpacked and customized. I'd run a quick comparative test to make sure there weren't any noticeable regressions and we'd be ready to deploy the changes and get ready to re-index. (Re-indexing can be subject to delays from other projects and tasks that are going on at the same time.)

  • Finnish: fold except for å, ä, and ö; it's unclear what to do with w, š and ž:
    • w is officially a variant of v, but a quick test on fiwiki looks like they are distinct in practice.
    • š and ž certainly used on fiwiki, but there are lots of redirects to article titles with and without the caron (ˇ), or, for š examples I found, expanded to sh. Seems inconsistent.
      • A few quick searches indicate that maybe folding š and ž to s and z would be more helpful than not.

v and w are separate letters in the alphabet – so I don't think folding them makes sense. š and ž have not officially been part of the alphabet, but I there are proposals to do so. Commonly replaced sh and zh, sometimes even without h as you noticed. I suspect most people in Finland don't know how to type these. There is a dead key but it is not printed in keyboards.

TJones claimed this task.

This is fixed for Swedish Wikipedia (searching for Åm brings up exact match Åm as the first suggestion, and Åmål as the second suggestion. All other suggestions start with Åm and not Am. Probably fixed by T160562, but definitely working now.