Page MenuHomePhabricator

CirrusSearch: near match doesn't prefer exact matches to unicode flattened ones
Closed, ResolvedPublic

Description

You can reproduce this by searching wiktionary for son. It lands you on són.


Version: unspecified
Severity: normal

Details

Reference
bz59841

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:30 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz59841.

I was thinking about this last night and was wondering if it would be bad if we stopped near matches from doing ascii flattening? It already only does this for English and this would be the simplest way of fixing this from a technical perspective.

The downside is that "go" search would get a little more confusing: you might type into the prefix search box and see "són" as the top response because it is linked more frequently then "son".

Wiktionary has 8 pages that all "near match" son with the current analysis setup:
sơn
Son
són
son
sön
SON
søn
soñ

Even with my proposed near match change it still has three:
Son
son
SON

So I'm pretty sure that is a bad idea. Another proposal: restore sorting by the number of incoming links. This would drop you on "son" as expected.

Another proposal: if there is more than a single "near match" then declare that there are non and drop the user to the search page. That will give them more options and never _force_ them to the wrong page. It may be less convenient. Also, it can be done with or without removing asciifolding. Personally I'd prefer to leave the folding in place so prefix matching, which really should have folding, still looks sane.