Page MenuHomePhabricator

"Did you mean . . ." search feature to automatically spellcheck all search terms
Closed, DeclinedPublic

Description

Author: NathanLarson32767

Description:
It would be helpful for Wikipedia to detect typos similar to Google. For instance, if I
type in "Nathaz Larson" on Google, it completes the search but also shows a hyperlink
saying "Did you mean Nathan Larson?"

This would save users time re-entering mistyped search strings.


Version: unspecified
Severity: enhancement

Details

Reference
bz974
ReferenceSource BranchDest BranchAuthorTitle
repos/releng/gitlab-cloud-runner!2T297426mainjhuneidiRestrict images
repos/releng/gitlab-cloud-runner!1ci-helmmainjeltoadd helm diff and helm install as CI jobs
repos/releng/gitlab-settings!9update-gitlab-testmainjeltoupdate gitlab-test settings
Customize query in GitLab

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 7:04 PM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz974.
bzimport added a subscriber: Unknown Object (MLST).

avarab wrote:

*** Bug 2236 has been marked as a duplicate of this bug. ***

  • Bug 3418 has been marked as a duplicate of this bug. ***

muthus_post wrote:

I think we must integrate Aspell with Mediawiki to suggest the spelling,
incase we think the user typed string is *NOT* an exact match, or has
more than one match. Now it makes a lot of sense to have something like
aspell to have the suggestions of strings, and throw them on the user,
if the above conditions are met.

The Lucene search includes a fuzzy match feature which does similar; it's currently
disabled for speed reasons.

muthus_post wrote:

If 'speed reasons' include, the time to complete search,
then Id guess, its not worth disabling. YOu must enable this feature.

Otherwise, youre worried about server usage, etc then probably I think
its an issue.

I think its rational to think that, someone will be more-satisfied if
a search query returns with, some useful results, AND take some time
for it, rather than, turning up nothing in quick time. Im saying
if we can get some results, that will be relevant, its worth trying for..

time to complete search ~= server usage

avarab wrote:

*** Bug 3638 has been marked as a duplicate of this bug. ***

gangleri wrote:

¿is this the same as?

bug 2486: Automatic wiki page name suggestion similar as "Google Suggest"

Tuukka wrote:

Could we consider some algorithm that takes less server resources? For example,
keep a mapping from "lowercaseletteronly" version to any matching "Lower-case
Letter Onlys".

There could also be such a mapping for What links here, Wiktionary, and Commons
links so the "no such article" page could tell if the links actually lead to
something.

gangleri wrote:

*** Bug 5533 has been marked as a duplicate of this bug. ***

vossman77 wrote:

*** Bug 3139 has been marked as a duplicate of this bug. ***

petdog wrote:

Can't you just use the levenshtein function (which is also included in php >=
3.0.17) against the list of titles (of the wiki pages), and choose the [two]
minimum results as the suggested keyword/s?
I hope that wouldn't be too much of a work for the servers, because it would for
sure be a very good thing for the users :D

Sorry for my BAD english, I hope you understand.

__
petdog@gmail.com

harald wrote:

*** This bug has been marked as a duplicate of 2486 ***

vossman77 wrote:

*** Bug 2486 has been marked as a duplicate of this bug. ***

@petdog: run the levenshtein function agains all titles in a wiki for every
search? The english wikipedia has more than a million entries, and several
searches per second.... That would never work. It may be nice for a very small
wiki though - shouldn't be to hard to write a search plugin that does it.

kop wrote:

Why would it be any harder to put levenshtein function values into a database
and search them than it is to put regular words into a database and search them?
Is the load on the servers so high that you can't levenshtein-ize the user's
input to search against the data?

@Karl: that's not how it works. The levenshtein function takes two parameters:
the search term and the title. You can't pre-compute distances for all possible
search terms.

kop wrote:

My bad. Ah well. But you could still use soundex, although you'd have to find
equalivent functions for languages other than english.

Humm....

While you can't pre-compute distances for all possible search terms, you might
be able to pre-compute distances for search terms that are entered by users and
not found. I'm sure google is doing something like this, using the user's input
to guide the search process. Alternately, you could just watch what the users
do. If they don't find something, then they do, and the levenshtein distance is
small, save the found page as a suggestion. Or whatever. There's lots of
information supplied to wikipedia by the user's searches. Research project?
Maybe. Unique opportunity, surely.

  • Bug 6778 has been marked as a duplicate of this bug. ***

Eneas wrote:

What about creating a list with unsuccesful search terms and running new entries
each week/month against the levenshtein function. The results are saved in a table.

When you perform a search with a mistyped search term that is equal to the list
of unsuccessful searches you get the pre-computed result of the algorithm.

  • Bug 8694 has been marked as a duplicate of this bug. ***

jasonspiro4 wrote:

(In reply to comment #20)

What about creating a list with unsuccesful search terms and running new entries
each week/month against the levenshtein function. The results are saved in a

table.

When you perform a search with a mistyped search term that is equal to the list
of unsuccessful searches you get the pre-computed result of the algorithm.

Sounds like a fine workaround to me. Site admins: do you think it would work?
(In reply to comment #5)

[...] someone will be more-satisfied if
a search query returns with, some useful results, AND take some time
for it, rather than, turning up nothing in quick time.

I agree. IMO, spell checking is so useful for so many users that even if it
slowed all searches down by twenty-five percent, it would still be worthwhile.
Should I perhaps take a poll of a few fellow Wikipedia readers at my community
college and they think?

(In reply to comment #20)

What about creating a list with unsuccesful search terms and running new entries
each week/month against the levenshtein function. The results are saved in a

table.

When you perform a search with a mistyped search term that is equal to the list
of unsuccessful searches you get the pre-computed result of the algorithm.

Sounds like a fine workaround to me. Site admins: do you think it would work?

I am taking the liberty of bumping this bug's priority to High.

Don't go reinventing the wheel. This is a generic query task that could apply to any MySQL database. Lucene is ideal, but I can see why it's disabled. One relatively straightforward approach is to do the Levenshtein distance thing, but first use a spatial query to narrow down the number of points you have to evaluate. Search terms (and titles) are associated with vectors specifying the frequency of each letter. We then perform a spatial query using a spatial index to find the nearest 50 or so points in this space, and do Levenshtein against each of them for the final ranking. Unfortunately, the OpenGIS spatial support appears to be for only low numbers of dimensions. Nevertheless this could be carried out by keeping a separate spatial index "off to the side" outside of MySQL and updating it as titles are updated. Just an idea.

lowzl wrote:

The developers may already be aware of this, but there is a Levenshtein distance user-defined function in C for MySQL: http://joshdrew.com/mysql_levenshtein_udf-1.0.tar.gz

Though as the previous poster points out, it would still be necessary to narrow down the number of entries to consider for a large database like Wikipedia.

ayg wrote:

Does Rainman's Lucene 2.0 extension address this?

rainman wrote:

Simetrical: Not yet. Probably in 2.1.

Derrick: One way of narrowing down the results is using ngram indexes. Using ngrams does a fair job in determining if two string are statistically similar, and it's well-suited for spelling errors because it preserve linkage between adjacent letters (I'm not sure if this is the case with letter-frequency based spatial indexes as well?). So, you could make a query on ngram lucene index, fetch first 50 results, and then do some sophisticated analysis trying to figure out which word(s) suite the query best.

There are some caveats however. One is that you want spell-checking to be context-dependent, i.e. the spell checker should always look at the phrase, not individual words. Indexing all phrases is probably not feasible. Possible solution might be to build ngram index in iterations, by adding phrases that could be confused with long words or such. Some other caveats are that you might want to favor spelling that correspond to article titles, that correspond to some interwiki link (in which case you also might want a translation), that you might want to do spell checking even if search returns some results, and so on..

However, this is still theory. I guess it makes a huge difference when you go from a small system of few hundreds of MB of text to a system with tens of GB of text, and want the spell checker always to find the most relevant article and words.

  • Bug 14680 has been marked as a duplicate of this bug. ***
  • Bug 15021 has been marked as a duplicate of this bug. ***

Note that the "did you mean" thingy is now active in the Lucene search backend used on Wikipedia. Yay!

However there's not yet support in the default MySQL backend.

rainman wrote:

Note that the votes are for enabling this on WMF wikis (which is done), not necessarily to implement it in MySQL.

In T2974#36573, @brion wrote:

Note that the "did you mean" thingy is now active in the Lucene search backend used on Wikipedia. Yay!

However there's not yet support in the default MySQL backend.

Is it actually possible to do such a thing with the MySQL backend? That is, is there some standard function/library/whatever to do it? Advanced search features are what Lucene, ElasticSearch etc. are for; it's not unreasonable to rule them out from the scope of the default MySQL search if we'd have to build them ourselves.

Advanced search features are what Lucene, ElasticSearch etc. are for; it's not unreasonable to rule them out from the scope of the default MySQL search if we'd have to build them ourselves.

Per Nemo, I'm inclined to decline this task as out of scope for MW core. I don't see anyone clamoring for this feature to be added to vanilla MW's rudimentary built-in search engine. Serious MW installations will install a proper search solution with did-you-know support.

See my previous comment. MW's built-in search is intentionally basic. If you would like this functionality on your own wiki, you need to install one of the search extensions.