Page MenuHomePhabricator

Cirrus unable to find insource:"mazovia.pl" on pl.wp where the phrase occurs in a URL
Closed, ResolvedPublic

Description

Cirrus unable to find insource:"mazovia.pl" on pl.wp where the phrase occurs in a URL. insource:/mazovia\.pl/, however, does find it, but takes ages to run. I can't tell if this is a bug or expected behavior for some reason.

For example, https://pl.wikipedia.org/w/index.php?title=Specjalna%3ASzukaj&profile=default&search=insource%3A%22mazovia.pl%22&fulltext=Search should find https://pl.wikipedia.org/wiki/Elżbieta_Lanc , but doesn't.


Version: master
Severity: normal

Details

Reference
bz70873

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:45 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz70873.
bzimport added a subscriber: Unknown Object (MLST).

Its not right but not unexpected. insource:"" segments words in the same way that we segment regular text. I can't think of a workaround for you at this point either - the insource:// is going to be slow unless you have another filter like insource:"" but insource:"" is too much for you.

Thinking out loud for a solution: I wonder if its safe to trick the language analyzer by pretending that ".", ":", "/" are " ". That'll cause splits where we want them I think. I'm not sure if that is right for all text in all languages but, maybe?

Oh, so URLs are one "segment", and this doesn't find "substrings"? That makes sense.

Splitting on these characters sounds reasonable to me. There are some cases like "AC/DC", but that shouldn't cause any problems, right?

(In reply to Bartosz Dziewoński from comment #2)

Oh, so URLs are one "segment", and this doesn't find "substrings"? That
makes sense.

Splitting on these characters sounds reasonable to me. There are some cases
like "AC/DC", but that shouldn't cause any problems, right?

You've got it. The way search works is that all the words are segmented (tokenized) and then normalized and then indexed for quick lookup. The trick is that each language is subtly different and I only speak English so I can only validate that choices make sense there. And its hard to propose changes that cross many languages.

Anyway, I'll see if I can make a tool to easily look at how words are segmented in your language. And I'll see if I can make it easy to experiment a bit with stuff.

This was probably fixed by my change to the analyzer to consider . like space. It wasn't just fixed magically - just as the solution to another bug.