Page MenuHomePhabricator

CirrusSearch word segmentation not useful for JS and CSS pages
Closed, ResolvedPublic

Details

Reference
bz63861

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:19 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz63861.

This is caused by word segmentation rather then a problem getting the text into the index. You can find them if you search like so:
https://test.wikipedia.org/w/index.php?title=Special%3ASearch&profile=all&search=tem.getInitial&fulltext=Search

I'll solve this by improving the setting that Cirrus has called "aggressive splittings" and rolling it out to all documents. It might cause some unintended results to show up in other places but they should be sorted below the more exact matches.

This search doesn't only returns one of the two results. I can get both if I search for "tem.getInitial OR this.getInitial":
https://test.wikipedia.org/w/index.php?title=Special%3ASearch&profile=all&search=tem.getInitial+OR+this.getInitial&fulltext=Search
But that only works if I know the only prefixes will be "tem." and "this.".

I don't know if this will affect the "aggressive splittings" you mention above.

Helder: yeah, that's the splitting. Without it "tem.getInitial" and "this.getInitial" are separate terms. With it the terms are "tem", "getInitial", and "this". I'll push a more precise test case now

Change 125764 had a related patch set uploaded by Manybubbles:
Better test case for word splitting in js

https://gerrit.wikimedia.org/r/125764

Technically this was resolved in https://gerrit.wikimedia.org/r/#/c/125731/ but that extra commit adds a better test case.

It'll require a reindex after deployment, and only hits English. I'm working on other languages but that is more complicated unfortunately.

Change 125764 merged by jenkins-bot:
Better test case for word splitting in js

https://gerrit.wikimedia.org/r/125764

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
He7d3r set Security to None.