Author: gryllida
Description:
Examples at https://www.mediawiki.org/wiki/Thread:Help_talk:CirrusSearch/%22Better%22_support_of_searching_in_other_languages
Version: unspecified
Severity: normal
Author: gryllida
Description:
Examples at https://www.mediawiki.org/wiki/Thread:Help_talk:CirrusSearch/%22Better%22_support_of_searching_in_other_languages
Version: unspecified
Severity: normal
Paraphrasing the link:
Searched for биологии and биология wasn't in the search results but it should have been.
Other information:
Searching for биология returns both биологии and биология as it should.
gryllida wrote:
Can someone please give this a higher priority.
A gadget (one in mediawiki:*), not a userscript, is becoming broken as lucene search is being removed from wikis.
gryllida wrote:
This one: https://ru.wikipedia.org/wiki/MediaWiki:Gadget-wikilinker.js
Its documentation: https://ru.wikipedia.org/wiki/Википедия:Гаджеты/Викиссыльщик
Description of the gadget issues:
TL;DR: they actually look up "биолог*" as they do client-side stemming in js to format wikilinks correctly. The cirrus search gives weird results, as it misses [[Биология]] on RU.WN for some reason, while it doesn't miss it on RU.WP.
This should probably go to a separate bug, or it probably should not - I have not yet analysed this behaviour enough to understand whether it has anything to do with the original issue described in this bug.
The gadget has 3 versions.
loadXMLDoc(wgServer + wgScriptPath + '/api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=' + preparedText);
loadXMLDoc(wgServer + wgScriptPath + '/api.php?action=query&list=search&srbackend=LuceneSearch&srlimit=5&srprop=&srredirects=1&format=json&srsearch=' + preparedText);
var xmlDocUrl = 'ru.wikipedia.org/w/api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&origin=' + document.location.protocol + '' + document.location.hostname + '&srsearch=' + preparedText;
[1] https://ru.wikipedia.org/w/index.php?title=MediaWiki:Gadget-wikilinker.js&diff=60626342&oldid=56265657
[2] https://ru.wikinews.org/wiki/MediaWiki:Gadget-wikilinker.js
I looked at raw net log in a browser and realised that apparently the scripts all use client-side stemming and look up "биолог*". Sorry, I probably wrongly identified the issue causing the gadget being broken. Now, some more analysis follows, below; I would file a new bug, but I did not yet identify the exact issue behind the problem and whether it is different.
This means we have these 3 sort of queries:
OK, 1 and 3 are the same. Forget 3.
This means we have these 2 sort of queries:
Results:
Russian Wikipedia:
Russian Wikinews:
Note that "Биология" also exists at Russian Wikinews (although it is a redirect).
Now Russian Wikinews no longer can use Wikilinker to link to local articles.
This looks to have been caused by us not using unicode style regexes when detecting the * syntax. We have a feature that was supposed to run those prefix queries against the unstemmed copy of the text but it wasn't kicking in for cyrillic because php hates me.
Merged. It'll go to test wikis and mediawiki.org today, non-wikipedias on Tuesday, and wikipedias on Thursday.
gryllida wrote:
Presumably this is in production now. Sorry, I don't see this work as expected now.
http://ru.wikinews.org/w/api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=%D0%B1%D0%B8%D0%BE%D0%BB%D0%BE%D0%B3*&format=xml does not return 'Биология' or 'Category:Биология', but it should (especially the former). Instead, it returns "Интервью с исследователем органов чувств Домиником Кларком о шмелях и электрических полях цветков" and other long article names.
Now you've hit something else! Cirrus will only find results from cross namespace redirects if the target of the redirect is included in the search. This finds the category:
http://ru.wikinews.org/w/api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=%D0%B1%D0%B8%D0%BE%D0%BB%D0%BE%D0%B3*&format=xml&srnamespace=0,14
While its possible for me to fix this its a pretty difficult change and hasn't caused _too_ many problems. If possible, can you get the tool to work around this?
gryllida wrote:
Cirrus will only find results from cross namespace redirects if the target of the redirect is included in the search.
Doesn't make sense to me:
If possible, can you get the tool to work around this?
This wikilinker gadget needs to produce [[биология|биологии]], [[биология|биологией]], [[биология|биология]] links reliably. [[Category:биология|биологии]] sort of links are against project policies. So I guess no, I can't work around this, unless I missed some pretty things, or unless I'm willing to do such an ugly thing as check for "Category:$1" pattern in the result and manually check whether $1 main namespace page exists. One would think that this has to be done server-side. (Lucene Search worked fine with it, btw.)
Sorry for not updating this earlier. The answer is no. Redirects won't appear in Cirrus's results. Your welcome to continue using lsearchd until its turned off in a few months by adding &srbackend=LuceneSearch to the url parameters but Cirrus isn't going to show you the redirect page as a result. It'll always come back as [[Category:биология|биологии]] with Cirrus and at some point we'll shut down lsearchd and you won't be able to select it any more.
As to searches in the main namespace finding main namespace redirects to the category namespace - it probably should but its not going to happen soon.
Sorry this isn't what you wanted to hear.
gryllida wrote:
Can I open a new bug about redirects=yes param to Cirrus Search? Not having it as a default may be reasonable, of course, but as I said above, we workaround is ugly...
This wikilinker gadget needs to produce [[биология|биологии]],
[[биология|биологией]], [[биология|биология]] links reliably.
[[Category:биология|биологии]] sort of links are against project policies. So
I guess no, I can't work around this, unless I missed some pretty things, or
unless I'm willing to do such an ugly thing as check for "Category:$1"
pattern in the result and manually check whether $1 main namespace page
exists. One would think that this has to be done server-side.
Certainly! Its better to have a bug then not - even if all it does is reference the conversation in this bug.