Page MenuHomePhabricator

CirrusSearch should provide a way to find hyphenated words, as Lucene-search always has
Open, LowPublicFeature

Description

Author: SpontaneousGrumbler

Description:
Before CirrusSearch can be considered a replacement for Lucene-search (not a downgrade), it needs to be able to find hyphenated words, such as "he was assigned to follow-up on the discovery", without finding "follow up". Lucene-search finds both hyphenated and unhyphenated forms if "follow up" is searched; it only finds the hyphenated form if "follow-up" searched. This allows editors to find and fix cases of improper punctuation. This change will allow CirrusSearch to match what Lucene-search does now. Even nicer would be to provide a way find an actual space and not hyphenation, such as "He was well-known in Europe."


Version: unspecified
Severity: enhancement

Details

Reference
bz70950

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:50 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz70950.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to SpontaneousGrumbler from comment #0)

find hyphenated words, such as "he was
assigned to follow-up on the discovery", without finding "follow up".

It's not about hyphens specifically, tweaked summary. Currently, you can use "insource:".

This has been discussed at https://www.mediawiki.org/wiki/Thread:Help_talk:CirrusSearch/%22Really%22_exact_matches and https://www.mediawiki.org/wiki/Help:CirrusSearch#insource: is now a bit clearer (while https://www.mediawiki.org/wiki/Help:CirrusSearch#Quotes_and_exact_matches is probably a bit confusing).

Certainly "search the pre-tokenized version of the source" is not particularly clear...

SpontaneousGrumbler wrote:

(In reply to Nemo from comment #1)

(In reply to SpontaneousGrumbler from comment #0)

find hyphenated words, such as "he was
assigned to follow-up on the discovery", without finding "follow up".

It's not about hyphens specifically, tweaked summary. Currently, you can use
"insource:".

This has been discussed at
<https://www.mediawiki.org/wiki/Thread:Help_talk:CirrusSearch/
%22Really%22_exact_matches> and
https://www.mediawiki.org/wiki/Help:CirrusSearch#insource: is now a bit
clearer (while
https://www.mediawiki.org/wiki/Help:CirrusSearch#Quotes_and_exact_matches
is probably a bit confusing).

Certainly "search the pre-tokenized version of the source" is not
particularly clear...

Anyone else have vertigo after following the discussion from the CirrusSearch help page, where this was first brought up, then to the Bugzilla report, then from there back to the CirrusSearch help page as a proposed solution? The insource: feature is no help at all for this problem. The regex flavor runs for a long time and then falls off the edge of the earth. The other flavor doesn't pay any more attention to hyphens than the straightforward search. Let's stop dodging the issue and get to work fixing the problem. How do I get the summary changed back to "CirrusSearch should provide a way to find hyphenated words"? The updated summary about "exact match" seems to be a setup for deflecting this back to some pie-in-the-sky solution using the insource: feature.

(In reply to SpontaneousGrumbler from comment #2)

The updated summary about "exact
match" seems to be a setup for deflecting this back to some pie-in-the-sky
solution using the insource: feature.

Haha, well put, maybe you're right: but I think not. I changed the summary to cover the other users' scenario as well because (long story short) I think the ElasticSearch "feature" doing this is the same.

I think he's right in that hyphenated words are the only thing that lsearchd has special handling for. There could be more - the code is vast and I haven't read it all - but I don't think there are. I've set the summary back to how SpontaneousGrumbler@gmail.com originally filed it. Are there any constructs other than hyphenated words that have this problem?

The problem with adding lsearchd's support for hyphenated words to Cirrus is that it relies on some pretty gnarly hacks that we can't easily replicate. My hope was that regexes would give you more power to find more things and that they'd be tolerably fast.

At this point I'm not willing to reimplement the hyphenation hack - its just too much work and it only handles the hyphens. I'm very happy to work to make the regex search faster. Adding another clause (<<insource:"follow-up" insource:/follow-up/>> for example) speeds it up but if there are other regex searches in front of you (there is a queue that all users share) it gets slow again. I can certainly work on that.

Even when Cirrus is the primary search backend for enwiki you'll still be able to use lsearchd for a few months with a url parameter (&srbackend=LuceneSearch) and we'll monitor which queries still hit that system before we disable it entirely. We're in no hurry there.

As to the discussion being in three places - I'm not sure what to say. I have trouble keeping track of anything outside of bugzilla.

(In reply to Nik Everett from comment #4)

I think he's right in that hyphenated words are the only thing that lsearchd
has special handling for. There could be more - the code is vast and I
haven't read it all - but I don't think there are. I've set the summary
back to how SpontaneousGrumbler@gmail.com originally filed it. Are there
any constructs other than hyphenated words that have this problem?

In English I can't think of any, but I'd really like to look further into what lsearchd is doing here. I don't think the original request is unreasonable, although I agree that it's not the most straightforward thing for us to implement.

I'm very happy to work to
make the regex search faster. Adding another clause (<<insource:"follow-up"
insource:/follow-up/>> for example) speeds it up but if there are other
regex searches in front of you (there is a queue that all users share) it
gets slow again. I can certainly work on that.

We can always improve insource :)

ryazanov wrote:

(In reply to Chad H. from comment #5)

In English I can't think of any, but I'd really like to look further into
what lsearchd is doing here. I don't think the original request is
unreasonable, although I agree that it's not the most straightforward thing
for us to implement.

English is not the only language in the world. ;–) But even for it, for example, capitalization is another important "exact" thing.

Other things from my experience: some strange people might write, for example, "km\h" instead of "km/h"; sometimes hyphens and dashes are confused in compound words; it might be useful to distinguish between phrases (with spaces), URLs (with dots) or emails and some fancy names (such as "Folding@home").

I don't think that it is very difficult to add a post-filter to the current "exact search" that will check for "truly exact" (character-wise) matches. It shouldn't be difficult to add some modifiers (in the spirit of current "~") to trigger this behavior.

Search maintainers: Is this really high priority? Looks rather like backlog IMHO...

Aklapper lowered the priority of this task from High to Low.Feb 5 2015, 3:11 PM
Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:12 AM