Page MenuHomePhabricator

"did you mean" is not working
Closed, ResolvedPublic

Description

When I first started to use "CirrusSearch" I was very happy that the "Search Suggestions" where working out of the box. In the resent versions (2-3 weeks) there are no search suggestions anymore. Can I somehow turn them on again?

Thank you
Martin


Version: master
Severity: minor
OS: Linux
Platform: Other

Details

Reference
bz55786

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:39 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz55786.
bzimport added a subscriber: Unknown Object (MLST).

Where can I see this? A private wiki instance? So Wikimedia website?
What are exact steps to reproduce this?

Yes, a private wiki. I have send you a direct mail with the link.
I will now also try to reproduce it on one of the Wikimedia test sites.

Thank you!

In case that you refer to the search box in the upper right corner, Search proposals work for me on your wiki (when I enter "Salz" it proposes one page with a name that starts with Salz). Using Firefox 24 here.

Ok, sorry by "Search Suggestions" I meant: searching for something that does NOT exist or is spelled incorrectly like "waser" instead of "wasser". I am truly sorry that I did not make myself clear. Yes "AutoComplete" as I would call it workes pretty good.

Works for me on production: https://en.wikisource.org/w/index.php?title=Special%3ASearch&profile=default&search=pots&fulltext=Search
And in development: http://solr-mw2.instance-proxy.wmflabs.org/w/index.php?search=noble+prize&title=Special%3ASearch

One thing that might have changed from the last time you checked: we only build suggestions from the titles and redirect titles. We used to build suggestions from titles and text. We felt that that produced too many false positives. Also, the search index required to do that took up a bunch of space.

I'm going to attach the query that I always use for debugging suggestions issues to this bug. If you could send it to Elasticsearch and attach the results I'll decipher them for you. So you aren't in suspense: it'll return a bunch of suggestions including the search phrase. Normally CirrusSearch configures Elasticsearch to only return suggestions that have a score of twice what the original search phrase had so you can use the results to figure out if the suggestion that you expected was even being generated and, if so, how it scores.

So, options:

  1. I can make generating suggestions from text a configurable thing. Going from off to on would require a reindex.
  2. You can change suggestion cutoff score and walk the false positive tuning line. The config value is $wgCirrusSearchPhraseSuggestConfidence - just make sure to keep it set to a number. You can change this as much as you like without breaking anything but if you set it to less than 1 then I believe you'll end up getting your search query back as a suggestion all the time.
  3. There is some kind of problem in Elasticsearch, your setup (did you rebuild the index when you pulled), or gremlins.

Suggestion test query

attachment suggest_test.json ignored as obsolete

I would go with Option 1 as I only tried so search for words in the "text" not the title. This would solve the problem for me at least! Thanks again for responding in such a fast matter... (as always).

Suggestion test query Result:

{

took: 14
timed_out: false
_shards: {
    total: 8
    successful: 8
    failed: 0
}
hits: {
    total: 0
    max_score: null
    hits: [ ]
}
suggest: {
    title: [
        {
            text: waser
            offset: 0
            length: 5
            options: [
                {
                    text: waser
                    highlighted: waser
                    score: 0.08398465
                }
            ]
        }
    ]
    redirect: [
        {
            text: waser
            offset: 0
            length: 5
            options: [
                {
                    text: waser
                    highlighted: waser
                    score: 0.27871963
                }
            ]
        }
    ]
}

}

Yup - no suggestions are coming up and you would have got your suggestions from the text. Let me see about getting that working again.

Suggestion test query

attachment suggest_test.json ignored as obsolete

I found a problem with the query I posted earlier so I posted a second copy - this second one also builds the suggestions against the text. See if that provides the suggestions you need.

Suggestion test query v2

attachment suggest_test.json ignored as obsolete

Suggestion test query

Sorry about all the updates, just found the obsoletes field on the uploader and wanted to get rid of the duplicates.

attachment suggest_test.json ignored as obsolete

Result:

{

took: 30
timed_out: false
_shards: {
    total: 8
    successful: 8
    failed: 0
}
hits: {
    total: 0
    max_score: null
    hits: [ ]
}
suggest: {
    title: [
        {
            text: waser
            offset: 0
            length: 5
            options: [
                {
                    text: waser
                    highlighted: waser
                    score: 0.08398465
                }
            ]
        }
    ]
    text_suggest: [
        {
            text: waser
            offset: 0
            length: 5
            options: [
                {
                    text: waser
                    highlighted: waser
                    score: 0.10791662
                }
            ]
        }
    ]
    redirect: [
        {
            text: waser
            offset: 0
            length: 5
            options: [
                {
                    text: waser
                    highlighted: waser
                    score: 0.27871963
                }
            ]
        }
    ]
}

}

Suggestion test query

And one more bug. On the upside, the feature is almost done.

Attached:

{

took: 135
timed_out: false
_shards: {
    total: 8
    successful: 8
    failed: 0
}
hits: {
    total: 0
    max_score: null
    hits: [ ]
}
suggest: {
    title: [
        {
            text: waser
            offset: 0
            length: 5
            options: [
                {
                    text: waser
                    highlighted: waser
                    score: 0.08398465
                }
            ]
        }
    ]
    text_suggest: [
        {
            text: waser
            offset: 0
            length: 5
            options: [
                {
                    text: waser
                    highlighted: waser
                    score: 0.10791662
                }
                {
                    text: wasser
                    highlighted: <em>wasser</em>
                    score: 0.042066924
                }
                {
                    text: water
                    highlighted: <em>water</em>
                    score: 0.017357524
                }
                {
                    text: wsser
                    highlighted: <em>wsser</em>
                    score: 0.011847182
                }
                {
                    text: wash
                    highlighted: <em>wash</em>
                    score: 0.009659772
                }
                {
                    text: wassers
                    highlighted: <em>wassers</em>
                    score: 0.00865565
                }
            ]
        }
    ]
    redirect: [
        {
            text: waser
            offset: 0
            length: 5
            options: [
                {
                    text: waser
                    highlighted: waser
                    score: 0.27871963
                }
            ]
        }
    ]
}

Hmmm. This:

{
    text: waser
    highlighted: waser
    score: 0.10791662
}
{
    text: wasser
    highlighted: <em>wasser</em>
    score: 0.042066924
}

Says that Elasticsearch thinks that "waser" is still a better option than "wasser". I find that it actually works better for me when searching for phrases. I'm not super sure why at this point. For example, I have a page which contains the phrase "test catapult" but when I search for "catapul" I don't get a suggestion. I do get one when I search for "test catapul" or "tets catapul".

I'll add it to my todo list to figure out why that happens. For now, I'll proceed with making text suggestions configurable.

Change 90132 had a related patch set uploaded by Manybubbles:
Optionally pull suggestions from text

https://gerrit.wikimedia.org/r/90132

With the referenced patch you can turn on getting suggestions from text by setting $wgCirrusSearchPhraseUseText = true; and doing an in place reindex:
php updateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now
php forceSearchIndex.php --forceUpdate

What I do not understand: how can Elasticsearch think "waser" is a better option even if the word "waser" is NOT FOUND at all.

so here some more tests with 2 words:
correct spelling would be "geraspelter Schokolade bestreuen" but I entered "geraspelter Shokolade bestreuen" without the c in "Schokolade".

Here the output:

{

took: 119
timed_out: false
_shards: {
    total: 8
    successful: 8
    failed: 0
}
hits: {
    total: 0
    max_score: null
    hits: [ ]
}
suggest: {
    title: [
        {
            text: geraspelter Shokolade bestreuen
            offset: 0
            length: 31
            options: [
                {
                    text: geraspelter shokolade bestreuen
                    highlighted: geraspelter shokolade bestreuen
                    score: 0.00026727197
                }
            ]
        }
    ]
    text_suggest: [
        {
            text: geraspelter Shokolade bestreuen
            offset: 0
            length: 31
            options: [
                {
                    text: geraspelter schokolade bestreuen
                    highlighted: geraspelter <em>schokolade</em> bestreuen
                    score: 0.011861463
                }
                {
                    text: geraspelte schokolade bestreuen
                    highlighted: <em>geraspelte schokolade</em> bestreuen
                    score: 0.0057594595
                }
                {
                    text: geraspelter shokolade bestreuen
                    highlighted: geraspelter shokolade bestreuen
                    score: 0.0005670466
                }
                {
                    text: geraspelten schokolade bestreuen
                    highlighted: <em>geraspelten schokolade</em> bestreuen
                    score: 0.000060482846
                }
                {
                    text: geraspelt schokolade bestreuen
                    highlighted: <em>geraspelt schokolade</em> bestreuen
                    score: 0.00000123148
                }
                {
                    text: geraspelten shokolade bestreuen
                    highlighted: <em>geraspelten</em> shokolade bestreuen
                    score: 0.0000010980223
                }
                {
                    text: geraspelte shokolade bestreuen
                    highlighted: <em>geraspelte</em> shokolade bestreuen
                    score: 0.0000010932401
                }
                {
                    text: geraspelt 1 schokolade bestreuen
                    highlighted: <em>geraspelt 1 schokolade</em> bestreuen
                    score: 0.0000010556109
                }
                {
                    text: geraspelter schoklolade bestreuen
                    highlighted: geraspelter <em>schoklolade</em> bestreuen
                    score: 6.116578e-7
                }
                {
                    text: geraspelt shokolade bestreuen
                    highlighted: <em>geraspelt</em> shokolade bestreuen
                    score: 5.8213175e-7
                }
                {
                    text: geraspelt 1 shokolade bestreuen
                    highlighted: <em>geraspelt 1</em> shokolade bestreuen
                    score: 4.989969e-7
                }
                {
                    text: geraspelter schokoladen bestreuen
                    highlighted: geraspelter <em>schokoladen</em> bestreuen
                    score: 4.862056e-7
                }
            ]
        }
    ]
    redirect: [
        {
            text: geraspelter Shokolade bestreuen
            offset: 0
            length: 31
            options: [
                {
                    text: geraspelter shokolade bestreuen
                    highlighted: geraspelter shokolade bestreuen
                    score: 0.009769141
                }
            ]
        }
    ]
}

}

Great Nik. Thank you again! I will try it out tomorrow...

(In reply to comment #20)

What I do not understand: how can Elasticsearch think "waser" is a better
option even if the word "waser" is NOT FOUND at all.

so here some more tests with 2 words:
correct spelling would be "geraspelter Schokolade bestreuen" but I entered
"geraspelter Shokolade bestreuen" without the c in "Schokolade".
<snip>

It gets it right that time, at least. You may want to try hitting the <wikiname>_content alias rather than the <wikiname> alias. I see that producing better results on my side.

Still, I'll have to look into it.

when using <wikiname>_content and searchingn for "waser" this is the result if this is any help:

{

took: 14
timed_out: false
_shards: {
    total: 4
    successful: 4
    failed: 0
}
hits: {
    total: 0
    max_score: null
    hits: [ ]
}
suggest: {
    title: [
        {
            text: waser
            offset: 0
            length: 5
            options: [
                {
                    text: waser
                    highlighted: waser
                    score: 0.06573563
                }
            ]
        }
    ]
    text_suggest: [
        {
            text: waser
            offset: 0
            length: 5
            options: [
                {
                    text: wasser
                    highlighted: <em>wasser</em>
                    score: 0.03317656
                }
                {
                    text: water
                    highlighted: <em>water</em>
                    score: 0.017357524
                }
                {
                    text: wsser
                    highlighted: <em>wsser</em>
                    score: 0.009198759
                }
                {
                    text: wassers
                    highlighted: <em>wassers</em>
                    score: 0.00865565
                }
                {
                    text: waser
                    highlighted: waser
                    score: 0.007820548
                }
                {
                    text: wash
                    highlighted: <em>wash</em>
                    score: 0.0075003416
                }
            ]
        }
    ]
    redirect: [
        {
            text: waser
            offset: 0
            length: 5
            options: [
                {
                    text: waser
                    highlighted: waser
                    score: 0.27871963
                }
            ]
        }
    ]
}

}

and here the other one:

{

took: 63
timed_out: false
_shards: {
    total: 4
    successful: 4
    failed: 0
}
hits: {
    total: 0
    max_score: null
    hits: [ ]
}
suggest: {
    title: [
        {
            text: geraspelter Shokolade bestreuen
            offset: 0
            length: 31
            options: [
                {
                    text: geraspelter shokolade bestreuen
                    highlighted: geraspelter shokolade bestreuen
                    score: 0.00012816112
                }
            ]
        }
    ]
    text_suggest: [
        {
            text: geraspelter Shokolade bestreuen
            offset: 0
            length: 31
            options: [
                {
                    text: geraspelter schokolade bestreuen
                    highlighted: geraspelter <em>schokolade</em> bestreuen
                    score: 0.009209847
                }
                {
                    text: geraspelte schokolade bestreuen
                    highlighted: <em>geraspelte schokolade</em> bestreuen
                    score: 0.004471939
                }
                {
                    text: geraspelten schokolade bestreuen
                    highlighted: <em>geraspelten schokolade</em> bestreuen
                    score: 0.000036463684
                }
                {
                    text: geraspelt schokolade bestreuen
                    highlighted: <em>geraspelt schokolade</em> bestreuen
                    score: 0.00000123148
                }
                {
                    text: geraspelt 1 schokolade bestreuen
                    highlighted: <em>geraspelt 1 schokolade</em> bestreuen
                    score: 0.0000010556109
                }
                {
                    text: geraspelter schoklolade bestreuen
                    highlighted: geraspelter <em>schoklolade</em> bestreuen
                    score: 6.116578e-7
                }
                {
                    text: geraspelt shokolade bestreuen
                    highlighted: <em>geraspelt</em> shokolade bestreuen
                    score: 5.8213175e-7
                }
                {
                    text: geraspelter shokolade bestreuen
                    highlighted: geraspelter shokolade bestreuen
                    score: 5.2390885e-7
                }
                {
                    text: geraspelten shokolade bestreuen
                    highlighted: <em>geraspelten</em> shokolade bestreuen
                    score: 5.1398877e-7
                }
                {
                    text: geraspelte shokolade bestreuen
                    highlighted: <em>geraspelte</em> shokolade bestreuen
                    score: 5.117502e-7
                }
                {
                    text: geraspelt 1 shokolade bestreuen
                    highlighted: <em>geraspelt 1</em> shokolade bestreuen
                    score: 4.989969e-7
                }
                {
                    text: geraspelter schokoladen bestreuen
                    highlighted: geraspelter <em>schokoladen</em> bestreuen
                    score: 4.862056e-7
                }
            ]
        }
    ]
    redirect: [
        {
            text: geraspelter Shokolade bestreuen
            offset: 0
            length: 31
            options: [
                {
                    text: geraspelter shokolade bestreuen
                    highlighted: geraspelter shokolade bestreuen
                    score: 0.009769141
                }
            ]
        }
    ]
}

}

(In reply to comment #23)

when using <wikiname>_content and searchingn for "waser" this is the result
if this is any help:
<snip>

options: [
    {
        text: wasser
        highlighted: <em>wasser</em>
        score: 0.03317656
    }

<snip>

{
    text: waser
    highlighted: waser
    score: 0.007820548
}

<snip>

That is much better. See how "wasser"'s score is four times "waser"'s? That is enough to get it suggested.

Off the cuff my guess is that the reason we see "waser" get a really high score when you use the <wikiname> alias is because everything's is MAX(per shard score) and the per shard score is based off of the number of terms in the shard. Since the <wikiname> alias combines both the <wikiname>_content and the <wikiname>_general aliases which might have vastly different sizes you could end up with bogus scores.

The upshot from the perspective of a user is that suggestions work a lot less well when querying across content and non-content namespaces. Which I think is _reasonably_ rare.

Change 90132 merged by jenkins-bot:
Optionally pull suggestions from text

https://gerrit.wikimedia.org/r/90132

  1. I updated to the newest master on git.
  2. I updated LocalSettings.php with $wgCirrusSearchPhraseUseText = true;
  3. I then ran: php updateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now php forceSearchIndex.php --forceUpdate

I searches for "waser" and also for "shokolade" but I still did not get any suggestions :-(

Am I forgetting something? Thanks again... Martin

One more thing: I also tried misspelling words that are not in the text but in the title of the page. Also no "did you mean" suggestions :-(

It might be simplest for me to connect to your wiki and es instance and have a look at what is going on. I'm really not sure. Did the script complete successfully? If you'd like me to have a look send me an email with connection information.

I'm sorry this has been so much trouble!

E-Mail is on the way to you...

We worked this out over email - it was a code not rebased problem.

For posterity: suggestions don't work if the first letter isn't right. I'm not filing a bug about that yet but it should be noted.