Page MenuHomePhabricator

Wrong terms are highlighted / snippet does not contain search phrase
Closed, ResolvedPublic

Description

The highlighted terms in a CirusSearch result are not always the most sensible ones.

Example: when searching for 'partial forms'[1], the top results include (sub)pages of the Extension:Semantic_Forms. This is to be expected, since partial forms are a functionality of that extension. However, the snippet that is shown in the search result for these pages does not contain the phrase ´partial forms´ even though this occurs on the page. Instead, for all pages the snippet shown is:


Semantic *Forms* - navigation (view) Basics Main page (talk) · Download and installation · Quick start

with 'Forms' being highlighted.

[1] https://www.mediawiki.org/w/index.php?search=partial+forms&button=&title=Special%3ASearch&srbackend=CirrusSearch


Version: unspecified
Severity: normal
Whiteboard: Elasticsearch_Open_Bug
See Also:
https://github.com/elasticsearch/elasticsearch/issues/4351

Details

Reference
bz53529

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:02 AM
bzimport set Reference to bz53529.

Triaging to normal because the results still make sense. It is more important then some of the other bugs I've set to normal. I might have to push those down to low and the low ones to lowest....

This one is fun.

First, we have to tell elasticsearch to order the highlights by score. I was under the impression this is the default. It isn't. Document order is. This is here: https://gerrit.wikimedia.org/r/#/c/82856/

Next, we have to convince elasticsearch to really boost perfect phrase matches. This can't be merged because of a bug in elasticsearch (https://github.com/elasticsearch/elasticsearch/issues/3503) that will be fixed in the next release. The commit has probably atrophied a bit because it has been sitting around but eventually we'll be able to merge it here: https://gerrit.wikimedia.org/r/#/c/79087/

And finally it looks like elasticsearch doesn't take rescores into account when it highlights (https://github.com/elasticsearch/elasticsearch/issues/3630). When that is released and we've merged the phrase boosts, then this bug should be solved.

I'm whiteboarding this Elasticsearch_Open_Bug until https://github.com/elasticsearch/elasticsearch/issues/3630 is merged and I know which release it'll go with.

So just pushing sorting by score [1] seems to have helped the situation quite a bit. I'm not closing this because I'm using it to track the open elasticsearch issue [2]. Also, the boost perfect phrase matches work [3] is ready to be merged _but_ merging it would cause Bug 54526 which I filed in preparation for the merge.

[1] https://gerrit.wikimedia.org/r/#/c/82856/
[2] https://github.com/elasticsearch/elasticsearch/issues/3630
[3] https://gerrit.wikimedia.org/r/#/c/79087/

https://github.com/elasticsearch/elasticsearch/issues/3630 has just been closed and will be released with 0.90.6. There is light at the end of this bug!

Switching to Elasticsearch_0.90.7 becuase 0.90.6 has an issue that we don't want to suffer and they are cutting a new one "in a couple of days".

I'm taking this to work on the final leg: when we boost perfect phrase matches then consider that boost when sorting the highlighting.

Looks like there a bug in Lucene which is reflecting into Elasticsearch stopping me from finishing this. It stops highlighting from considering some boosts. In my case, it stops it from considering the boosts I use to boost perfect phrase matches....

This should (finally) be all fixed.