Page MenuHomePhabricator

Quoting terms in CirrusSearch doesn't turn off stemming
Closed, ResolvedPublic

Description

Quoting phrases in CirrusSearch doesn't turn off stemming and the phrase slop is too high. It should probably be 0 which is what people expect.

It might be nice to let users do stemmed phrase searches, maybe with "phrase"~. We could also let them set the phrase slop with something like "phrase"~3 (set slow to 3 and use stemming) or "phrase"3 (set slow to 3).


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=54526

Details

Reference
bz54020

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:08 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz54020.
bzimport added a subscriber: Unknown Object (MLST).

Setting this aside for now as I _can_ implement turning off stemming but then I lose highlighting on the quoted term.

BTW, ~3 is the standard syntax for setting phrase slop so we shouldn't change that. We can still have a syntax that turns on stemming for phrases but I'm not sure what it should be.

Raising importance because someone cared enough about the problem to send an email about it.

I'll add a fix for this even though it'll break highlighting for quoted terms. This is the upstream issue that causes the loss of highlighting: https://github.com/elasticsearch/elasticsearch/issues/3750

Change 85908 had a related patch set uploaded by Manybubbles:
Quotes turn off stemming.

https://gerrit.wikimedia.org/r/85908

Change 85910 had a related patch set uploaded by Manybubbles:
Tests for quotes turning of stemming.

https://gerrit.wikimedia.org/r/85910

Just for posterity:
The proposed solution to this bug, and every other solution I can think of, causes Bug 54526. I'm happy to be told that fixing this isn't worth Bug 54526 and I'll make sure the commits for this are help in gerrit until I can fix 54526 upstream _and_ we update to the version with the fix. That will probably take at least a month.

Yes, I know "probably take at least" is very wishy washy. I can't predict Elasticsearch's release schedule or how long it'll take to fix the bug. I can say that LuceneSearch seems to have figured out some kind of solution to the problem years ago with an old version of Lucene.

Change 85910 merged by jenkins-bot:
Tests for quotes turning of stemming.

https://gerrit.wikimedia.org/r/85910

Change 85908 merged by jenkins-bot:
Quotes turn off stemming.

https://gerrit.wikimedia.org/r/85908