Page MenuHomePhabricator

Stemming: CirrusSearch does not find Биология for a Биологии query on ru.wp (but LuceneSearch does)
Closed, ResolvedPublic

Description

Author: gryllida

Description:
Examples at https://www.mediawiki.org/wiki/Thread:Help_talk:CirrusSearch/%22Better%22_support_of_searching_in_other_languages


Version: unspecified
Severity: normal

Details

Reference
bz69766

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:44 AM
bzimport added projects: CirrusSearch, I18n.
bzimport set Reference to bz69766.

Paraphrasing the link:
Searched for биологии and биология wasn't in the search results but it should have been.

Other information:
Searching for биология returns both биологии and биология as it should.

gryllida wrote:

Can someone please give this a higher priority.
A gadget (one in mediawiki:*), not a userscript, is becoming broken as lucene search is being removed from wikis.

Removed link to outdated plan.

Which exact gadget (link please)? How is it "broken"?

gryllida wrote:

This one: https://ru.wikipedia.org/wiki/MediaWiki:Gadget-wikilinker.js
Its documentation: https://ru.wikipedia.org/wiki/Википедия:Гаджеты/Викиссыльщик

Description of the gadget issues:

TL;DR: they actually look up "биолог*" as they do client-side stemming in js to format wikilinks correctly. The cirrus search gives weird results, as it misses [[Биология]] on RU.WN for some reason, while it doesn't miss it on RU.WP.

This should probably go to a separate bug, or it probably should not - I have not yet analysed this behaviour enough to understand whether it has anything to do with the original issue described in this bug.


The gadget has 3 versions.

  1. The old version (old version in diff [1]) uses default search engine.

loadXMLDoc(wgServer + wgScriptPath + '/api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=' + preparedText);

  1. The "new" version (new version in diff [1]) has explicitly set to use lucene search. This was done in this edit [1] with comment that cirrus search gives unreliable results. This version works at Russian Wikipedia, but stopped working at Russian Wikinews around the end of July. It gives a 'HTTP timeout' message.

loadXMLDoc(wgServer + wgScriptPath + '/api.php?action=query&list=search&srbackend=LuceneSearch&srlimit=5&srprop=&srredirects=1&format=json&srsearch=' + preparedText);

  1. A local Russian Wikinews [2] (and maybe other projects) version is exactly the same as version (1).

var xmlDocUrl = 'ru.wikipedia.org/w/api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&origin=' + document.location.protocol + '' + document.location.hostname + '&srsearch=' + preparedText;

[1] https://ru.wikipedia.org/w/index.php?title=MediaWiki:Gadget-wikilinker.js&diff=60626342&oldid=56265657
[2] https://ru.wikinews.org/wiki/MediaWiki:Gadget-wikilinker.js


I looked at raw net log in a browser and realised that apparently the scripts all use client-side stemming and look up "биолог*". Sorry, I probably wrongly identified the issue causing the gadget being broken. Now, some more analysis follows, below; I would file a new bug, but I did not yet identify the exact issue behind the problem and whether it is different.


This means we have these 3 sort of queries:

  1. api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=биолог*
  2. api.php?action=query&list=search&srbackend=LuceneSearch&srlimit=5&srprop=&srredirects=1&format=json&srsearch=биолог*
  3. api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=биолог*

OK, 1 and 3 are the same. Forget 3.


This means we have these 2 sort of queries:

  1. api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=биолог*
  2. api.php?action=query&list=search&srbackend=LuceneSearch&srlimit=5&srprop=&srredirects=1&format=json&srsearch=биолог*

Results:

Russian Wikipedia:

  1. Биология
  2. Биология

Russian Wikinews:

  1. Интервью с исследователем органов чувств Домиником Кларком о шмелях и электрических полях цветков
  2. HTTP timeout

Note that "Биология" also exists at Russian Wikinews (although it is a redirect).

Now Russian Wikinews no longer can use Wikilinker to link to local articles.

Don't worry about filing a new bug. I've got it from here.

This looks to have been caused by us not using unicode style regexes when detecting the * syntax. We have a feature that was supposed to run those prefix queries against the unstemmed copy of the text but it wasn't kicking in for cyrillic because php hates me.

Merged. It'll go to test wikis and mediawiki.org today, non-wikipedias on Tuesday, and wikipedias on Thursday.

gryllida wrote:

Presumably this is in production now. Sorry, I don't see this work as expected now.

http://ru.wikinews.org/w/api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=%D0%B1%D0%B8%D0%BE%D0%BB%D0%BE%D0%B3*&format=xml does not return 'Биология' or 'Category:Биология', but it should (especially the former). Instead, it returns "Интервью с исследователем органов чувств Домиником Кларком о шмелях и электрических полях цветков" and other long article names.

Now you've hit something else! Cirrus will only find results from cross namespace redirects if the target of the redirect is included in the search. This finds the category:
http://ru.wikinews.org/w/api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=%D0%B1%D0%B8%D0%BE%D0%BB%D0%BE%D0%B3*&format=xml&srnamespace=0,14

While its possible for me to fix this its a pretty difficult change and hasn't caused _too_ many problems. If possible, can you get the tool to work around this?

gryllida wrote:

Cirrus will only find results from cross namespace redirects if the target of the redirect is included in the search.

Doesn't make sense to me:

  1. For some reason, in your URL I don't see "Бология" in the results, although the page it redirects to ("Category:Биология" = "Категория:Биология") is included. This behavior appears to be inconsistent with your comment.
  2. A search tool should be able to find namespace pages even if they are redirects. Their title /does/ match, after all. It probably makes no sense for end users who lack interest in categories. But then please consider making "do not follow redirects" an option.

If possible, can you get the tool to work around this?

This wikilinker gadget needs to produce [[биология|биологии]], [[биология|биологией]], [[биология|биология]] links reliably. [[Category:биология|биологии]] sort of links are against project policies. So I guess no, I can't work around this, unless I missed some pretty things, or unless I'm willing to do such an ugly thing as check for "Category:$1" pattern in the result and manually check whether $1 main namespace page exists. One would think that this has to be done server-side. (Lucene Search worked fine with it, btw.)

Sorry for not updating this earlier. The answer is no. Redirects won't appear in Cirrus's results. Your welcome to continue using lsearchd until its turned off in a few months by adding &srbackend=LuceneSearch to the url parameters but Cirrus isn't going to show you the redirect page as a result. It'll always come back as [[Category:биология|биологии]] with Cirrus and at some point we'll shut down lsearchd and you won't be able to select it any more.

As to searches in the main namespace finding main namespace redirects to the category namespace - it probably should but its not going to happen soon.

Sorry this isn't what you wanted to hear.

gryllida wrote:

Can I open a new bug about redirects=yes param to Cirrus Search? Not having it as a default may be reasonable, of course, but as I said above, we workaround is ugly...

This wikilinker gadget needs to produce [[биология|биологии]],
[[биология|биологией]], [[биология|биология]] links reliably.
[[Category:биология|биологии]] sort of links are against project policies. So
I guess no, I can't work around this, unless I missed some pretty things, or
unless I'm willing to do such an ugly thing as check for "Category:$1"
pattern in the result and manually check whether $1 main namespace page
exists. One would think that this has to be done server-side.

Certainly! Its better to have a bug then not - even if all it does is reference the conversation in this bug.

gryllida wrote:

Ok, filed bug 71491. Thanks! :)