Page MenuHomePhabricator

Suggestion searching
Closed, ResolvedPublic

Description

Author: nickpj

Description:
This is an enhancement bug to track the suggestion searching feature request, as
discussed recently on wikitech-l.

To summarize:

  • The behaviour of the current MediaWiki search box would change, so that

instead of a straight text field, it would be more akin to an electronic index
that tries to help you find what you are looking for as you type, without
submitting the whole form.

  • The user in their preferences would specify (opt-in) that they want to use

suggestion searching. Since suggestion searching uses AJAX, it would probably be
best to default this to being off so that backwards compatibility is retained
for older non-JavaScript browsers, or clients with slow & expensive &
high-latency connections (e.g. mobile phone devices).

  • Suggestion searching would show a list of the possible page names matching

what the user has typed thus far (limited to say the top 10 matches, possibly
ranked by popularity).

  • As the user types more, the suggestions would become more specific.
  • The user is able to arrow up or arrow down through the list of suggestions to

select / highlight their choice.

  • Pressing enter should probably open the topmost choice in the list of

suggestions, or the highlighted suggestion (if the first item is not the
highlighted one).

  • Potentially the suggestion has autocomplete functionality, whereby the next

few letters are filled in and highlighted where it seems probable that this is
what the user is going to type.

If it helps to visualize what's being described, there are some screenshots to
give an idea of what it could potentially look like here:
http://nickj.org/images/8/80/03-autocompletion-kicks-in.png and here:
http://nickj.org/images/b/b3/04-found-desired-article.png

One potential implementation for this idea would be the server-side program from
Julien Lemoine. A web interface to this program can be accessed at
http://en.suggest.speedblue.org/ (e.g. the current MediaWiki Search box would
behave something like the search box on this site, except integrated into
MediaWiki), and GPL source code can also be downloaded.

However, there are a few things which would be good to see happen to the above
Suggestion Searching to help integrate it into MediaWiki:

  • Currently the index generation uses the pages-articles.xml + all-titles-in-ns0

dump files downloaded from download.wikipedia.org. It would probably be better
to be also be able to generate the indexes directly from the database, instead
of requiring a dump stage first. This would probably be faster, and allow sites
which don't currently generate dump files to also use this.

  • Currently some articles can't be reached using the search suggest because

they're "masked" by more popular articles. An example for the English Wikipedia
would be the "AM" disambiguation page being masked by the pages that start with
"American" (i.e. you cannot get to the "AM" search result). Potentially the
exact matches could be included in the search result (although _maybe_ they want
to be towards the end of the list if they're less popular articles, since
they're probably not what the user is looking for).

  • Currently the index does not include non namespace 0 articles. It would

probably be best to include other namespaces (e.g. Template:, MediaWiki:, Talk:,
etc), so that the suggestion searching box would have "functional-parity" with
the current search box (e.g. should be able to type "Template:Cleanup" into the
search box, and have it appear in the list of possible results).

  • Potential case-sensitive ordering of the results. For example, if the user

searches for "Adfa" on the English Wikipedia, it lists three results, including
"ADFA" (listed first) and "Adfa" (the Welsh town, listed later). Should "Adfa"
come first, because it is an exact match for what the user typed?


Version: 1.8.x
Severity: enhancement

Details

Reference
bz7288

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:26 PM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz7288.
bzimport added a subscriber: Unknown Object (MLST).

ayg wrote:

(In reply to comment #0)

  • The user in their preferences would specify (opt-in) that they want to use

suggestion searching. Since suggestion searching uses AJAX, it would probably be
best to default this to being off so that backwards compatibility is retained
for older non-JavaScript browsers, or clients with slow & expensive &
high-latency connections (e.g. mobile phone devices).

Surely that's a complete waste of this feature? Non-JavaScript browsers just
wouldn't do anything, of course, and users of high-latency connections can just
ignore the results (since they'll be unhelpful).

No need to have it disabled for non-javascript browsers. JavaScript should be
able to handle it so it falls back gracefully.
Slow connections are a problem though, so an option to disable it should be
available anywhere.

This feature has already been implemented in part by the API
(http://en.wikipedia.org/w/api.php) - the opensearch feature.
http://en.wikipedia.org/w/api.php?action=opensearch&search=Te will return first
10 titles beginning with "Te"

Page titles are currently not very relevant. See
http://meta.wikimedia.org/wiki/Proposed_Database_Schema_Changes for suggested
improvement to the search result relevancy.

Moreover, this feature is already being used for the Firefox 2.0 search box.
To use, visit any mediawiki site, click on the search engine selector button,
and select "add wiki" - autocomplete will work, except that there is 500ms
timeout by default set in firefox_install_dir/components/nsSearchSuggestions.js

  • search for "_suggestionTimeout: 500" line, and set much higher timeout if

you are on slow connection.

ayg wrote:

This has been brought up in Mozilla's bug for adding Wikipedia to the default search engine list for Firefox: https://bugzilla.mozilla.org/show_bug.cgi?id=380785. CCing to rainman in case he has any thoughts about implementing this at some point using the relevancy algorithms already existing in Lucene, or if these would be inappropriate. This is maybe better for a built-in approach as suggested at http://www.mediawiki.org/wiki/Proposed_Database_Schema_Changes#table_for_auto-suggest_page_title_search using backlinks or some other metric (in which case redirects need to be counted as well).

rainman wrote:

I think we should be trying to integrate http://suggest.speedblue.org/
Building a prefix tree with articles ordered by rank is probably the
most efficient way to go.

Things to do:

  1. Make the suggest engine rebuild it's prefix tree from lucene index,

and not database dump - frequent rebuilds from db dumps have proven
not to be very reliable. And the needed data is already in the index,
i.e. article titles, and their ranks. This way, we can worry only about
keeping the main index up-to-date.

  1. Figure out an update scheme that will minimize downtime.

Rsync+restart could be enough for starters, but it would be nice
if there would be, say, an extra thread that would check the contents
of some rsync path for updates.

I'll invite Julien (the author of suggest engine) to give some comments
on this as well.

speedblue wrote:

This sounds good.

If you have an efficient way to extract titles and ranks from index, this
is the best way to have an efficient completion structure.

To minimize the downtime, the best solution is to keep the old prefix
trie loaded while the new tree is not build and loaded (on a new tcp port for
example) and to redirect queries to the new version when it is
available. You will have two tree in memory during a short time but
without downtime.

I can provide you some support and improvement for the prefix trie I
implemented, I have some idea to reduce memory footprint and improve
performances.

Best Regards.
Julien

rainman wrote:

We could use CLucene to access the lucene index, but I don't know if
they maintain full compatibility with the latest java lucene file-structure
changes. Or, we could use something like this:
http://svn.wikimedia.org/viewvc/mediawiki/branches/lucene-search-2.1/src/org/wikimedia/lsearch/util/ExtractTitles.java?view=markup

This would open the latest consistent version of the lucene index and
print out a title per line: <rank> <namespace> <title> [<redirects>]
This could then be piped into a trie rebuild tool. Extracting the
complete listing for en.wiki takes less than 2 minutes.

How things stand now, redirects won't be included. They are in the
index but are not stored in raw form, thus not very easy to extract.
But if we would go ahead to integrate this, I believe I could easily
add them without enlarging the index much, and without hurting performance.

So, if this would be worked out, I would be happy to setup a test
on wmf servers, with some help from the root-access people of course :)

  • Bug 12412 has been marked as a duplicate of this bug. ***

Resolving as FIXED -- $wgEnableMWSuggest is available for 1.13.