Page MenuHomePhabricator

Regression: using unicode normalization analyzer misses results in search
Closed, ResolvedPublic

Description

Hello, since the unicode normalization analyzer was installed for Hebrew some expected search results are missed.

How to reproduce:

Compare results form this search in wikidata:

https://www.wikidata.org/w/index.php?search=%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99&title=Special%3ASearch&go=%D7%9C%D7%93%D7%A3

to the same search in hebrew wiki:

https://he.wikipedia.org/w/index.php?search=%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99&title=%D7%9E%D7%99%D7%95%D7%97%D7%93%3A%D7%97%D7%99%D7%A4%D7%95%D7%A9&go=%D7%9C%D7%A2%D7%A8%D7%9A

One would expect the five results showing in wikidata search would show up in hebrew wiki, but The first and last result on wikidata don't appear on hebrew wiki search results.

Best


Version: unspecified
Severity: normal

Details

Reference
bz66243

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:20 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz66243.

Result number 1 and 5 in wikidata look like result number 1 and 2 on hewiki. I wonder if we lost those pages temporarily? That'd be bad.

This is a better comparison:
https://www.wikidata.org/w/index.php?search=%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99&title=Special%3ASearch&go=%D7%9C%D7%93%D7%A3
to
https://he.wikipedia.org/w/index.php?search=%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99&title=%D7%9E%D7%99%D7%95%D7%97%D7%93%3A%D7%97%D7%99%D7%A4%D7%95%D7%A9&go=%D7%9C%D7%A2%D7%A8%D7%9A&fulltext=1
The first result in wikidata (https://www.wikidata.org/wiki/Q7003270) isn't in the hewiki results. On further digging, the page exists at (https://he.wikipedia.org/wiki/%D7%A7%D7%9C%D7%99%D7%A4%D7%95%D7%A8%D7%93_%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99) but when I try to fetch it from the search index it isn't in there:
manybubbles@elastic1003:~$ curl localhost:9200/hewiki_content/page/495403
{"_index":"hewiki_content_1401724632","_type":"page","_id":"495403","found":false}
So what is the deal?

(In reply to Nik Everett from comment #3)

So what is the deal?

That is rhetorical - I'm going to figure it out.

I added that page back into the index:
manybubbles@terbium:~$ mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki hewiki --fromId 495402 --toId 495403
Indexed 1 pages ending at 495403 at 6/second
Indexed a total of 1 pages at 6/second
manybubbles@terbium:~$

That's just remediation. Now to figure out why it wasn't in there in the first place.

Change 138835 had a related patch set uploaded by Manybubbles:
Add a maintenance script to make the index sane

https://gerrit.wikimedia.org/r/138835

I've written a tool to scan the index and look for insanity. I'm tempted to chalk some insanity in Hebrew up to the hebrew analyzer which was buggy and we had it in production for two weeks. The tool should heal whatever damage it did. Then we'll run it again a few days later and see if we get _more_ insanity. That'll have the benefit of being recent.

Change 138835 merged by jenkins-bot:
Add a maintenance script to make the index sane

https://gerrit.wikimedia.org/r/138835

Saneitizer seems to have done the trick here. I'm going to claim it was the broken analyzer. If we lose more pages I'll revise that claim.