Page MenuHomePhabricator

Wrong ordering of search results
Closed, ResolvedPublic

Description

Author: incarus6

Description:
Wikidatas order of search result is somehow strange.
If you search for "Wine" in the english version (!) you get the wine-software as first result, then some people called "Wine" and as 20th result the same-named wiki article,
the same with other searching terms.

That makes the search difficult.


Version: unspecified
Severity: major
URL: https://en.wikidata.org/w/index.php?search=Canis+lupus&title=Special%3ASearch

Details

Reference
bz43238

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedNone

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:54 AM
bzimport set Reference to bz43238.
bzimport added a subscriber: Unknown Object (MLST).

incarus6 wrote:

A good example is "mead":

  1. result: disambiguation page
  2. - 31. result: people and things containing that word
  3. result: wanted result (the actual "[...]wiki/Mead" article)

In my eyes the search bar is unusable.

Well, there is a JS solution. Directly beside the "normal" search field is a small triangle (Vector). If you click on this, another search field appears. Now it depends on the interface language. If "German" is set, the Item which is linked with the article "de:Wine" is found with the input of "Wine". If "English" is set, the item linked with "en:Wine" is found. This JS is set by default. Besides, there is a special page named "Special:ItemByTitle". You can use it as well.

Regards, eikes

Much of the problem comes from the fact that Wikidata lacks the necessary text to build the relevance of the article, we need something else. Without this it will be somewhat random which entries turns up first and last.

One way around is to do the searching in Wikipedia and use the hits and ranking from there as hints for an internal search similar to "ItemByTitle". That way we get working relevance ranking by borrowing the values from Wikipedia. The remaining problem is how we should solve this for languages with very few articles.

We could do it the other way around and check relevance in Wikipedia for items found by searching for labels. That would work for all items that has any sitelinks, as we can use the highest ranking article anyhow. Something like a prefixed termsearch getting a list of items, then getting the relevance for the wikipedia articles, then sorting out the highest ranking articles. Mjæ..

To just throw up the ItemByTitle unmodified is in my opinion not a very good solution.

This should probably be done as a modification of the existing opensearch module as it will operate much faster that way.

Still we should at some point build our own result sets from searches, but then we need to figure out how to make the relevance ranking working. The reason is that we can't jump out and search in Wikipedia for the stored queries.

incarus6 wrote:

As first part we could order the search results to their similarity to the searching term, disambiguation pages - if available - can be on the second search result / after the most similar result - as long as they're as similar as the first search result.

The second part can be search results that contains the searching terms in the title as it is, no letters before and after it.

The third part can be search results there the searching term is contained by any way in the result title.

The fourth part would be results there only the result page contains the searching term, but the title of the result doesn't.

Note that wikipedias ranking mechanism is strictly not relevance ranking, but that is another discussion.

incarus6 wrote:

My point was to order the results to their similarity to the searching term.
The most similiar result should be always the same named wiki article and the rest as mentioned in comment 4.

(In reply to comment #6)

My point was to order the results to their similarity to the searching term.
The most similiar result should be always the same named wiki article and the
rest as mentioned in comment 4.

This is definitely a valid bug, see e.g. search for "canis lupus" where the correct result is 18th/19th place.
http://lists.wikimedia.org/pipermail/wikispecies-l/2013-January/000076.html

It may also be useful to order results by the number of pages that link to them. For instance, a few minutes ago I searched for "company" (meaning "business organization") and these were the first results:

  1. Bad Company (13 links)
  2. Ford Motor Company (200 links)
  3. Hyundai Motor Company (64 links)

If we simply order by similarity we might get:

  1. company (disambig) (0 links)
  2. Company (novel) (0 links)
  3. Company (magazine) (0 links)

But most people would prefer this:

  1. company (business) (10000+ links)
  2. company (military) (2 links)
  3. Company (novel) (2 links)

Even without using similarity it would still be an improvement, i.e.:

  1. company (business) (10000+ links)
  2. Ford Motor Company (200 links)
  3. Hyundai Motor Company (64 links)

Change 73405 had a related patch set uploaded by Denny Vrandecic:
(bug 43238) Add very simple weighting for entity search (DO NOT MERGE)

https://gerrit.wikimedia.org/r/73405

Change 73405 merged by jenkins-bot:
(bug 43238) Add very simple weighting for entity search

https://gerrit.wikimedia.org/r/73405

A simple weighting and ranking is now merged, based on sitelinks. This should roll out to Wikidata soon, and then we can see whether it improves the current situation. In the long term, it is still the goal to replace it with something Lucene-based.

Verified in Wikidata demo time July 17th

I don't know whether it's related to this bug (or whether it already has been reported), but it seems like I can't search for any pages on wikidata with special characters. See for example https://www.wikidata.org/w/index.php?search=+Ji%C5%99%C3%AD+Polnick%C3%BD&title=Special%3ASearch - while the page does exist: https://www.wikidata.org/wiki/Q1428346