Cirrus unable to find insource:"mazovia.pl" on pl.wp where the phrase occurs in a URL
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	matmarex
	Sep 15 2014, 11:59 PM

Description

Cirrus unable to find insource:"mazovia.pl" on pl.wp where the phrase occurs in a URL. insource:/mazovia\.pl/, however, does find it, but takes ages to run. I can't tell if this is a bug or expected behavior for some reason.

For example, https://pl.wikipedia.org/w/index.php?title=Specjalna%3ASzukaj&profile=default&search=insource%3A%22mazovia.pl%22&fulltext=Search should find https://pl.wikipedia.org/wiki/Elżbieta_Lanc , but doesn't.

Version: master
Severity: normal

Details

Reference: bz70873

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:45 AM

• bzimport added a project: CirrusSearch.

• bzimport set Reference to bz70873.

• bzimport added a subscriber: Unknown Object (MLST).

matmarex created this task.Sep 15 2014, 11:59 PM

Its not right but not unexpected. insource:"" segments words in the same way that we segment regular text. I can't think of a workaround for you at this point either - the insource:// is going to be slow unless you have another filter like insource:"" but insource:"" is too much for you.

Thinking out loud for a solution: I wonder if its safe to trick the language analyzer by pretending that ".", ":", "/" are " ". That'll cause splits where we want them I think. I'm not sure if that is right for all text in all languages but, maybe?

Oh, so URLs are one "segment", and this doesn't find "substrings"? That makes sense.

Splitting on these characters sounds reasonable to me. There are some cases like "AC/DC", but that shouldn't cause any problems, right?

(In reply to Bartosz Dziewoński from comment #2)

Oh, so URLs are one "segment", and this doesn't find "substrings"? That
makes sense.

Splitting on these characters sounds reasonable to me. There are some cases
like "AC/DC", but that shouldn't cause any problems, right?

You've got it. The way search works is that all the words are segmented (tokenized) and then normalized and then indexed for quick lookup. The trick is that each language is subtly different and I only speak English so I can only validate that choices make sense there. And its hard to propose changes that cross many languages.

Anyway, I'll see if I can make a tool to easily look at how words are segmented in your language. And I'll see if I can make it easy to experiment a bit with stuff.

Works now!

This was probably fixed by my change to the analyzer to consider . like space. It wasn't just fixed magically - just as the solution to another bug.

• Deskana moved this task from Inbox to Resolved/Invalid/Declined/Legacy on the CirrusSearch board.Apr 20 2015, 4:05 AM

Cirrus unable to find insource:"mazovia.pl" on pl.wp where the phrase occurs in a URLClosed, ResolvedPublicActions

Description

Details

Event Timeline

Cirrus unable to find insource:"mazovia.pl" on pl.wp where the phrase occurs in a URL
Closed, ResolvedPublic
Actions