Page MenuHomePhabricator

Redirects should appear in Cirrus's results
Closed, DeclinedPublic

Description

Author: gryllida

Description:
[ It is advised that this bug is treated with higher priority, as deprecation of Lucene search has rendered a Wikilinker Gadget unusable in production at some Wikimedia projects. ]

Background

  1. A wikilinker gadget needs to produce [[биология|биологии]], [[биология|биологией]], [[биология|биология]] links reliably. [[Category:биология|биологии]] sort of links are against project policies.
  2. On some projects, such as Russian Wikinews, [[Биология]] is a redirect to [[Category:Биология]], because "Биология" is not a valid news headline and will never be.
  3. Lucene search worked fine as it also returns redirects. But it is deprecated and no longer running in production.

Problem description

  1. http://ru.wikinews.org/w/api.php?action=query&list=search&srlimit=5&srprop=&srredirects=1&format=json&srsearch=%D0%B1%D0%B8%D0%BE%D0%BB%D0%BE%D0%B3*&format=xml does not return 'Биология' or 'Category:Биология', but it should (especially the former). (Instead, it returns "Интервью с исследователем органов чувств Домиником Кларком о шмелях и электрических полях цветков" and other long article names.)

If possible, can you get the tool to work around this?

This wikilinker gadget needs to produce [[биология|биологии]],
[[биология|биологией]], [[биология|биология]] links reliably.
[[Category:биология|биологии]] sort of links are against project policies. So
I guess no, I can't work around this, unless I missed some pretty things, or
unless I'm willing to do such an ugly thing as check for "Category:$1"
pattern in the result and manually check whether $1 main namespace page
exists. One would think that this has to be done server-side.

(From discussion at bug 69766.)

Now Russian Wikinews no longer can use Wikilinker to link to local articles-redirects.

Proposed change

Please add an option to show redirects in Cirrus Search results, even if this option is off by default.


Version: unspecified
Severity: normal

Details

Reference
bz71491

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:52 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz71491.
bzimport added a subscriber: Unknown Object (MLST).

I can't help but think we've got something backwards here. Wasn't the point of gerrit 118592 to always include them?

gryllida wrote:

Please see problem description, (1).

How to get "Биология" (and NOT 'Категория:Биология') appear in these results?

What Cirrus does now is always search for pages by their redirects but its always the target of the redirect that is returned and never the redirect itself. Cirrus thinks of redirects as attributes of the target of the redirect and ignores redirect pages themselves. Look at the redirects object in the json blob here: https://en.wikipedia.org/wiki/Barack_Obama?action=cirrusdump

The upshot is that when you search you can find the result via the redirect and it'll come back in the redirect field but it'll never come back as a title. Example:
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&list=search&format=json&srsearch=O%27bama&srprop=snippet|titlesnippet|redirectsnippet|sectionsnippet&srlimit=10&srbackend=CirrusSearch

I'm honestly not sure what lsearchd does for this. Its similar to Cirrus so far as I can tell:
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&list=search&format=json&srsearch=O%27bama&srprop=snippet|titlesnippet|redirectsnippet|sectionsnippet&srlimit=10&srbackend=LuceneSearch

but it doesn't seem to always work similarly to Cirrus. If it produced the right result for Wikilinker then it must be different somehow. Its a lot of code to read and I've read a lot of it but I don't recall reading this part.

In any case for now I think the simplest solution for wikilinker is to set srbackend=LuceneSearch to keep the old behavior. That'll certainly buy us a few months of continued working and its reasonably simple.

gryllida wrote:

Yeah, I think it's complicated and I'm sure we'd figure it out. :-)


srbackend=LuceneSearch gives HTTP timeout on Russian Wikinews:

http://ru.wikinews.org/w/api.php?action=query&list=search&srbackend=LuceneSearch&srlimit=5&srprop=&srredirects=1&format=xml&srsearch=биолог*
Истекло время ожидания HTTP-запроса. = HTTP request timed out.

Should I gather community consensus on re-enabling it or can you just do it? I can again file a new bug if necessary.

gryllida wrote:

I repeat: how can I re-enable LuceneSearch on this project?

Enabling Lucene as the default wouldn't change anything that srbackend can't already do.

Two answers to two different questions:

  1. If LuceneSearch is timing out or failing in another way we'll need to fix it. At lest for the next few months. The link seems to be working now but I have no doubt that it was failing before. Its pretty difficult to debug it without breaking it worse. That's why we're moving away from it. If it fails again please update the bug or ping us on irc or something.
  1. This kind of issue isn't going to make us switch LuceneSearch back to the primary for ruwiki or ruwikinews. We're totally willing to switch back if Cirrus hurts more then it helps but in this case you have a clear work around and you were relying on a behavior that no one else noticed. And the behavior just doesn't seem helpful outside of your use case.

Truthfully I've change Cirrus's behavior to match Lucene's quirks in the past even though it had a pretty nice audience and I'd do it again in this case too but what your asking for would require a huge architectural change to cirrus which just isn't worth it.

If/when we do have to make some huge change I'll keep this is mind so its an option but I honestly don't think I can do more than that at this point.

A (maybe bad) idea! What if we piped the list of redirects that match the query back through the api. You'd still get the non-redirect page back but it'd come with a redirect list.

gryllida wrote:

The link seems to be working now but I have no doubt that it was failing before. Its pretty difficult to debug it without breaking it worse.

Oops. Works now.

  1. This kind of issue isn't going to make us switch LuceneSearch back to the primary for ruwiki or ruwikinews.

I am not asking for primary. I would like to have it as an /option/. At the time it was timing out, there was no such option.

A (maybe bad) idea! What if we piped the list of redirects
that match the query back through the api. You'd still
get the non-redirect page back but it'd come with a redirect list.

The way the wikilinker gadget works now, with LuceneSearch, is that it takes the first 3 results and chooses the shortest one. Such rewrite could be a threat to results relevancy or similarity.

Snippet from the gadget:

если в запросе было только одно слово, то выбираем самое короткое название из первых трёх результатов
чтобы для "Аглией" выдавалось "Англия", а не "Англиканство"
if ( requestTokens === 1 ) {

var resar = [];

for ( var j = 0; j <= 4; j++ ) {
    if ( typeof resp.query.search[j] !== 'undefined' && txt.substr( 0, 3 ).toLowerCase() === resp.query.search[j].title.substr( 0, 3 ).toLowerCase()) {
        resar.push( resp.query.search[j].title );
    }
}

resar.sort( compareStringLengths );

if ( typeof resar[0] !== 'undefined' ) {
    pageName = resar[0];
}

}

This could be, in theory, rewritten to pull a list of results + the redirects, and picking the shortest ones. But this would not guarantee the best result, as some of the redirect origins could be rather short but irrelevant. See:

lucene search:

  1. [[biology]] (redirect to [[category:biology]])
  2. [[category:biology]]
  3. ...

cirrus search:

  1. [[category:biology]] <- [[biology]] <- [[bio]] <- ...
  2. [[biologists discover world's smallest orchid]] <- ...

With cirrus search, the script for "biology" would return "[[bio|biology]]", but it doesn't mean that such wikilink would be accurate.

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I have need for "limit search results to non-redirects" and was directed to this phab. The use-case is when trying to resolve page-naming controversies (or more mildly, how to name new pages) I want to look at precedent. And that means the actual pages alone. Conceptually, it would be an option I could add to intitle:, so an additional token in the search-box or a checkbox in the advanced-search mode or an include/exclude-redirects toggle in the results page.

Gehel subscribed.

Closing ticket as Cirrus supports redirects (but not separate results for redirects and pages). Since this has been opened and without activity since 2014, I'm closing as it is more than likely that the use case has changed since then.

I have need for "limit search results to non-redirects" and was directed to this phab. The use-case is when trying to resolve page-naming controversies (or more mildly, how to name new pages) I want to look at precedent. And that means the actual pages alone. Conceptually, it would be an option I could add to intitle:, so an additional token in the search-box or a checkbox in the advanced-search mode or an include/exclude-redirects toggle in the results page.

This seems like a valid feature request, but should really be opened in a new ticket.