Page MenuHomePhabricator

improve sort order in entity selector
Closed, ResolvedPublic

Description

It'd be great if we could be smarter about the order of items/properties in the entity selector and put the ones at the top that are likely to be relevant for the current statement.


Version: master
Severity: critical
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=46555

Details

Reference
bz45351

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:22 AM
bzimport set Reference to bz45351.
bzimport added a subscriber: Unknown Object (MLST).

As we now have an increasing number of links, the easiest and fastest way might be to count the number of incoming wikilinks to each potential item, and sort them by most incoming links first. Paris (France) will have more incoming links than Paris (the god) or Paris (Texas), which will have more than other obscure uses.

Addendum: If no hits are available in the current language, try other languages.

The problem is the same as product advice to customers, where customers are the properties and products are the items used. It will also trigger the same scalabillity issues.

That is simple counting is not enough to get good guesses on which items should be sorted first. For example that would mean all municipalities of Brazil (or France) will be listed before municipalities in Norway, which is bad if you try to find municipalities in Norway.

We're not going to make this sort order perfect. But taking the number of site-links should give a good-enough sort order most of the time. This is what counts.

Number of sitelinks doesn't really make sense in this case. A low number actually is an indication that a high count does not make sense because the values you are looking for isn't common. That is it is a feature with negative correlation with your wanted entries. Its a classic automatic data classifier problem.

Addendum: Allow language prefixes, e.g. "de:Berlin", to show the item that has the language link for "Berlin" on de.wikipedia

(In reply to comment #5)

Number of sitelinks doesn't really make sense in this case. A low number
actually is an indication that a high count does not make sense because the
values you are looking for isn't common. That is it is a feature with
negative
correlation with your wanted entries. Its a classic automatic data classifier
problem.

Not sure I understand. I want Paris, France, to show up on top for the search "Paris", as I most likely add a person's birth or death place, or location of an object.

Addendum: For each item, show the "is a(n)" field if no description is set.

nilesh wrote:

(In reply to comment #3)

The problem is the same as product advice to customers, where customers are
the
properties and products are the items used. It will also trigger the same
scalabillity issues.

This is interesting. But I think you meant item=>product and property=>customer, since we are recommending properties to items (then based upon the recommendation scores we can sort the list), much like recommending "products" to "customers".

i) What kind of scalability issues and why?
ii) Do you think this would be a better method (accuracy-wise) for sorting than using incoming wikilinks as a metric?

nilesh wrote:

(In reply to comment #9)

But I think you meant item=>product and
property=>customer, since we are recommending properties to items (then based
upon the recommendation scores we can sort the list), much like recommending
"products" to "customers".

Sorry - my mistake. Please ignore the above section of my comment.

Entity search is now weighted by number of sitelinks.