Page MenuHomePhabricator

CirrusSearch: Using * should search exact matches, not stemmed matches
Open, MediumPublic

Description

CirrusSearch: Using * should search exact matches, not stemmed matches. When a user uses something like * in their words they expect to match terms their own way. It doesn't help that Elasticsearch/lucene/whatever doesn't analyze fuzzy query terms which can cause "*chokolade" not to match "schokolade" because the stemmer has removed the "e" from the end of the word.


Version: unspecified
Severity: normal

Details

Reference
bz56163

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:35 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz56163.
bzimport added a subscriber: Unknown Object (MLST).

Change 94373 had a related patch set uploaded by Manybubbles:
Term containing * match against unstemmed text

https://gerrit.wikimedia.org/r/94373

Change 94374 had a related patch set uploaded by Manybubbles:
Tests for term containing * match unstemmed text

https://gerrit.wikimedia.org/r/94374

It's working! I did a reindex after the git update afterwards:
php updateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now
php forceSearchIndex.php --forceUpdate
Then "Sch*kolade" and also "*chokolade" returned results.

Great work everyone!!! And thank you again...

Change 94373 merged by jenkins-bot:
Term containing * match against unstemmed text

https://gerrit.wikimedia.org/r/94373

Change 94374 merged by jenkins-bot:
Tests for term containing * match unstemmed text

https://gerrit.wikimedia.org/r/94374

The problem seems to be back with the master version of CirrusSearch and Elastica and mw 1.24.1.

Ok - its working for me on the branches I'm on now (basically master everywhere, settings that are mostly like enwiktionary.) Can you post the result of adding &cirrusDumpQuery=yes to the end of a search and post these too:

  • http://localhost:8080/w/api.php?action=cirrus-config-dump&format=json
  • http://localhost:8080/w/api.php?action=cirrus-settings-dump&format=json
  • http://localhost:8080/w/api.php?action=cirrus-mapping-dump&format=json

I probably won't need them all but they might help.

Hello Menybubbles, sorry for getting back to you so late.

This is the output of: ?title=Spezial%3ASuche&profile=default&search=*chokolade&fulltext=Search&cirrusDumpQuery=yes

{"description":"full_text search for '*chokolade'","path":"wiki_wiki_koch_content\/page\/_search","params":{"search_type":"dfs_query_then_fetch","timeout":"20s"},"query":{"_source":["id","title","namespace","redirect.*","timestamp","text_bytes"],"fields":"text.word_count","query":{"bool":{"minimum_number_should_match":1,"should":[{"query_string":{"query":"chokolade","fields":["all.plain^1","all^0.5"],"auto_generate_phrase_queries":true,"phrase_slop":0,"default_operator":"AND","allow_leading_wildcard":false,"fuzzy_prefix_length":2,"rewrite":"top_terms_128"}},{"multi_match":{"fields":["all_near_match^2"],"query":"*chokolade"}}]}},"highlight":{"pre_tags":["<span class=\"searchmatch\">"],"post_tags":["<\/span>"],"fields":{"title":{"number_of_fragments":0,"type":"fvh","order":"score","matched_fields":["title","title.plain"]},"redirect.title":{"number_of_fragments":1,"fragment_size":10000,"type":"fvh","order":"score","options":{"skip_if_last_matched":true},"matched_fields":["redirect.title","redirect.title.plain"]},"category":{"number_of_fragments":1,"fragment_size":10000,"type":"fvh","order":"score","options":{"skip_if_last_matched":true},"matched_fields":["category","category.plain"]},"heading":{"number_of_fragments":1,"fragment_size":10000,"type":"fvh","order":"score","options":{"skip_if_last_matched":true},"matched_fields":["heading","heading.plain"]},"text":{"number_of_fragments":1,"fragment_size":150,"type":"fvh","order":"score","no_match_size":150,"matched_fields":["text","text.plain"]},"auxiliary_text":{"number_of_fragments":1,"fragment_size":150,"type":"fvh","order":"score","options":{"skip_if_last_matched":true},"matched_fields":["auxiliary_text","auxiliary_text.plain"]}},"highlight_query":{"query_string":{"query":"chokolade","fields":["title.plain^20","redirect.title.plain^15","category.plain^8","heading.plain^5","opening_text.plain^3","text.plain^1","auxiliary_text.plain^0.5","title^10","redirect.title^7.5","category^4","heading^2.5","opening_text^1.5","text^0.5","auxiliary_text^0.25"],"auto_generate_phrase_queries":true,"phrase_slop":1,"default_operator":"AND","allow_leading_wildcard":false,"fuzzy_prefix_length":2,"rewrite":"top_terms_128"}}},"suggest":{"text":"*chokolade","suggest":{"phrase":{"field":"suggest","size":1,"max_errors":2,"confidence":2,"direct_generator":[{"field":"suggest","suggest_mode":"always","max_term_freq":0.5,"prefix_length":2}],"highlight":{"pre_tag":"<em>","post_tag":"<\/em>"}}}},"stats":["suggest","full_text"],"size":20,"rescore":[{"window_size":8192,"query":{"rescore_query":{"function_score":{"functions":[{"script_score":{"script":"log10((doc['incoming_links'].isEmpty() ? 0 : doc['incoming_links'].value) + 2)","lang":"groovy"}}]}},"query_weight":1,"rescore_query_weight":1,"score_mode":"multiply"}}]}}

About the other results. I was to dumb to get those working ... sorry. I only get:

No such action
The action specified by the URL is invalid. You might have mistyped the URL, or followed an incorrect link.

And also: I tried the master version of elastica and cirrusserach from today with my current 1.24.1 installations. With those versions the "MW Version" page is totally broken and does not load. Just for your info.

And also: I tried the master version of elastica and cirrusserach from today with my current 1.24.1 installations. With those versions the "MW Version" page is totally broken and does not load. Just for your info.

Hmmm...... I'm willing to be that has something to do with the composer changes. Ping @bd808 on if folks from 1.24 can use our plugins on master. For the most part Cirrus's master branch should be compatible with 1.24. I've been able to cut 1.24 based releases when people asked simply by merging down to a 1.24 branch, retesting, and fixing one or two small issues.

@SmartK - I'll have a look at the generated query and get right back to you.

@SmartK - can you try with the * coming in the middle of the word or at the end? I believe starting with the * doesn't work for several reasons. Cirrus itself seems to reject it *and* Cirrus sends allow_leading_wildcard=false to elasticsearch. I imagine we can make that configurable without trouble. The trick is that leading wildcards are known to be slow slow slow. Which is an attack vector for us so we disable them.

And also: I tried the master version of elastica and cirrusserach from today with my current 1.24.1 installations. With those versions the "MW Version" page is totally broken and does not load. Just for your info.

Hmmm...... I'm willing to be that has something to do with the composer changes. Ping @bd808 on if folks from 1.24 can use our plugins on master.

Elastica from git master needs to have Composer run to import the "ruflin/elastica" library or somehow otherwise get that library into the PHP autoloader (eg by cloning mediawiki/vendor to $IP/vendor). The tarball creation process should take care of this as well as far as I know. The git master version of CirrusSearch is using MWLoggerFactory and PSR-3 logging so it needs a 1.25 version of MediaWiki core.

And also: I tried the master version of elastica and cirrusserach from today with my current 1.24.1 installations. With those versions the "MW Version" page is totally broken and does not load. Just for your info.

Hmmm...... I'm willing to be that has something to do with the composer changes. Ping @bd808 on if folks from 1.24 can use our plugins on master.

Elastica from git master needs to have Composer run to import the "ruflin/elastica" library or somehow otherwise get that library into the PHP autoloader (eg by cloning mediawiki/vendor to $IP/vendor). The tarball creation process should take care of this as well as far as I know. The git master version of CirrusSearch is using MWLoggerFactory and PSR-3 logging so it needs a 1.25 version of MediaWiki core.

Ah. A shame I guess. For a while Cirrus master was pretty safe to run against 1.24.1. Given that 1.24.1 is LTS it makes sense to have 1.24.1 compatible backports of Cirrus every once in a while.

@Manybubbles: I tried running * in the middle and at the end of a word:

  • at the end it works fine
  • in the middle as well

I just thought it worked in an earlier version, that is why I wanted to report it.
...
...
So I just read my old reports above in this ticket: so it seemed to work in November 2013 with an old version of elasticsearch and an old version of cirrussearch, if that helps...

Cool. I've filed this as T91666 because I don't feel like reopening this bug :).... I don't know how to prioritize it. Is this actively a problem for you or just something you've noticed that doesn't feel right?

No it is not actively a problem, just a nice to have. Thank you

Restricted Application added a subscriber: Aklapper. · View Herald Transcript