Page MenuHomePhabricator

When editing sitelinks, target pages are suggested in a misleading order
Closed, ResolvedPublic

Description

in [[Q5607]] : (fr) Modem and (fr) MoDem or two different page but the editor doesn't see the difference and MoDem have the priority but it's only on the preview page if i reload it's good


Version: unspecified
Severity: normal

Details

Reference
bz41635

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:02 AM
bzimport set Reference to bz41635.
bzimport added a subscriber: Unknown Object (MLST).

This appears to be caused by the ranking/sorting system that generates the suggestions. I'm not sure whether we can do anything about this in Wikibase, it may be a bug in MediaWiki's OpenSearch implementation.

Anyway, here is what happens:

Select "French" as the target language for the link, then type "mode" into the input box. The suggestions will be something like:

Mode
Modene
MoDem
Mode (habillement)
MOD
Mod
Modulation
Modernisme

So, no "Modem" there, only "MoDem". But if you type in "modem", you get:

MoDem
Modem
Modem ADSM
....

The sorting seems eratic to me. Can we just fetch the top 100, sort them alphabetically (ignoring case), and then show the top 10?

The odd sorting is because the suggestions are on relevance, if I remember correct. To fetch the top 100 from relevance, then sort alphabetically and show a truncated list really does not give any meaningful list at all.

The bug comes from something that tries to turn the selected entry into a case agnostic selection, and then searches through the list for this entry. The user selection from the list should be retained with upper-/lowercase.

Sorting the result of a prefix search by relevance seems silly to me, but I guess we can't do much about how MWSearch returns that. You are right that just sorting the "best" 100 results doesn't really solve the problem - e.g. you may not see what you are looking for at the right place in the list, even if it exists, because it was not ion the top 100. That would be misleading. However, I think it may still be better than what we have now.

Or can we get the search result directly in alphabetical order? That would be nice.

You may set up this as a user preference, but do not turn it on as a default. While not being completely wrong it is extremely confusing for the end user. The only thing that works is short lists are scoring mechanisms, which is often variations of relevance ranking. If you can present the _complete_ list within some subdomain you can sort alphabetically if you add some visual clue on the scoring. This is often done on time series like newspapers where you want to search within some timeframe.

The two scoring functions I know works on this kind of problem are one for sorting on full terms "sort all found terms on prefixed matches on probability or inverse document frequency or a similar function", possibly with some weighting on shorter terms to make absolute matches go first, and one for sorting on boundary effects "sort all found terms on the probability that syllables start within the right side of the boundary (aka within the prefix) and continues into the found term", possibly with some simplification with Markov chains.

The first form is the most common, and as I recall some comments aso the form used in the live search on Wikipedia, aka the existing Lucene-search.