Page MenuHomePhabricator

CirrusSearch seems to stem the word "used" to "us"!
Closed, ResolvedPublic

Description

CirrusSearch seems to stem the word "used" to "us" sometimes!

<elasticsearch>/nikwiki_general/_analyze?analyzer=text&text=used returns
{

"tokens": [
  {
    "token": "us",
    "start_offset": 0,
    "end_offset": 4,
    "type": "<ALPHANUM>",
    "position": 1
  }
]

}


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=54875

Details

Reference
bz54022

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:08 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz54022.

I might be able to fix this by switching stemmers. I'll do some more research tomorrow.

Change 86854 had a related patch set uploaded by Manybubbles:
Tests for places where kstem beats porter stemmer.

https://gerrit.wikimedia.org/r/86854

"The kstem token filter is a high performance filter for english"
http://www.elasticsearch.org/guide/reference/index-modules/analysis/kstem-tokenfilter/

So I don't need to test what the effects are of this change for other languages?

Change 86854 merged by jenkins-bot:
Tests for places where kstem beats porter stemmer.

https://gerrit.wikimedia.org/r/86854

Right, this only effects English.

Unfortunately (or fortunately for a small set of use cases) there aren't as many different options for languages other than English. I believe we have five options, in order of how much they increase recall and decrease precision:

  1. No stemming
  2. Minimal (just possessives)
  3. KStem
  4. Porter Stemmer
  5. Porter Stemmer via Snowball

A few other languages have "minimal" (or "light") stemmers in addition to their more aggressive versions. In all cases other than English at this point we use the Elasticsearch default which is the more aggressive version.

Switching from the Elasticsearch default to a customized version isn't hard and we're totally willing to do it.

Sorry for going offtopic with my stupid questions, mainly I'd like to make a list of possible weaknesses e.g. for Italian analysis so that users can specifically test them a bit.

(In reply to comment #7)

Right, this only effects English.

Unfortunately (or fortunately for a small set of use cases) there aren't as
many different options for languages other than English. I believe we have
five options, in order of how much they increase recall and decrease
precision:

  1. No stemming
  2. Minimal (just possessives)
  3. KStem
  4. Porter Stemmer
  5. Porter Stemmer via Snowball

A few other languages have "minimal" (or "light") stemmers in addition to
their
more aggressive versions. In all cases other than English at this point we
use
the Elasticsearch default which is the more aggressive version.

Our default is standard i.e. http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-tokenizer/ or the language default for those which have one ( http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer/ ) so the stopwords we're using are those linked from http://www.elasticsearch.org/guide/reference/index-modules/analysis/snowball-analyzer/ ?

Switching from the Elasticsearch default to a customized version isn't hard
and
we're totally willing to do it.

Good! I guess you'll need help from native speakers and that they'll need some pointers from the docs on how to help.
30 languages < 285, so maybe – when you start expanding to many languages – as a starting point cutoff_frequency can be used to replace stopwords lists where one is not available as mentioned in https://gibrown.wordpress.com/2013/05/01/three-principles-for-multilingal-indexing-in-elasticsearch/ ? That would be a possible enhancement to file separately.

Yeah, it is probably worth opening a new bug with specific things, but you are right about help from native speakers.

As far as stopwords go there is a thing in elasticsearch called a common_terms query that can be used to kind of simulate having stopwords. In some respects it is better than having stopwords so folks can turn them off and use it instead. But getting it working with the query syntax that we use now is going to be rough.

Additionally we probably want to turn CirrusSearch on even for languages that aren't in that 30 mostly because we're likely to be better than lucene-search. Except in Esperanto.