Page MenuHomePhabricator

CirrusSearch does not find all JavaScript and CSS pages when using insource and intitle syntax
Open, LowestPublic

Description

Where did all the JavaScript pages go?

See Also:

Details

Reference
bz62733

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:56 AM
bzimport set Reference to bz62733.
bzimport added a subscriber: Unknown Object (MLST).

Wonder if something went wrong with gerrit 115214.

I believe this is caused by us not word breaking foo.bar into foo and bar. The solution to this, as I see it, is to use the word_break token filter _but_ to do that I have to rebuild each analyzer with that filter. That isn't easy because now what I want the German analyzer I can ask for
{"analyzer":{"text":{"type":"german"}}}
but to rebuild it I have to do this:
{"analyzer":{"text":{

"filter": [
    "standard",
    "lowercase",
    "german_stop",
    "german_normalization",
    "light_german_stemmer"
],
"tokenizer": "standard",
"type": "custom"

}},"filter":{

"german_stop": {
    "stopwords": [
        "denn",

...

        "eures",
        "dies",
        "bist",
        "kein"
    ],
    "type": "stop"
}

}}

Except even that doesn't work because german_normalization isn't properly exposed! The pull request I've opened upstream exposes all the stuff I'd need and it creates an endpoint on Elasticsearch designed to spit this back out for easy customization.

Interesting. Wonder if we're running into bug 40612 in a different form then.

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Deskana renamed this task from CirrusSearch: Where did all the JS pages go? to CirrusSearch does not find all JS pages.Dec 5 2015, 7:10 AM
Deskana renamed this task from CirrusSearch does not find all JS pages to CirrusSearch does not find all JS pages when it should.
Deskana lowered the priority of this task from Medium to Lowest.
Deskana set Security to None.
Deskana moved this task from Needs triage to Search on the Discovery-ARCHIVED board.
He7d3r renamed this task from CirrusSearch does not find all JS pages when it should to CirrusSearch does not find all JavaScript and CSS pages when it should.Dec 23 2015, 1:28 PM
He7d3r updated the task description. (Show Details)

Looked briefly into this, the issue is almost certainly related to analyzers used for particular languages as mentioned above. The intitle searches for css and js work on italian, russian, english, chinese and german wiki's, but not on portugese, spanish and probably others.

Deskana renamed this task from CirrusSearch does not find all JavaScript and CSS pages when it should to CirrusSearch does not find all JavaScript and CSS pages when using insource syntax.Dec 31 2015, 3:49 AM
Deskana moved this task from Inbox to Advanced functionality and syntax on the CirrusSearch board.

Notice this is not just about insource (see examples in the description)

Deskana renamed this task from CirrusSearch does not find all JavaScript and CSS pages when using insource syntax to CirrusSearch does not find all JavaScript and CSS pages when using insource and intitle syntax.Jan 1 2016, 7:01 PM

Looked briefly into this, the issue is almost certainly related to analyzers used for particular languages as mentioned above. The intitle searches for css and js work on italian, russian, english, chinese and german wiki's, but not on portugese, spanish and probably others.

Something changed, because it no longer works in German: https://de.wikipedia.org/w/index.php?search=intitle%3Acss&profile=advanced&fulltext=1&ns8=1

Another data-point: It does sometimes find some results, but not all of them, e.g. https://fr.wiktionary.org/w/index.php?search=intitle%3Acss&profile=advanced&fulltext=1&ns8=1 (which doesn't find [MediaWiki:Common.css] as it should)

For german this appears to have something to do with the analysis chain, testing a variety of languages not all of them are respecting the word_break_helper which converts period to spaces:

ebernhardson@elastic1020:~$ for lang in ar de es en fa fr it ru zh; do echo -n "${lang}: "; curl -s localhost:9200/${lang}wiki_general/_analyze -d '{"analyzer": "text", "text": "common
.css"}' | jq -c '.tokens | map(.token)'; done                                               
ar: ["common.css"]
de: ["common.css"]
es: ["common.css"]
en: ["common","css"]
fa: ["common.css"]
fr: ["comon.cs"]
it: ["common","css"]
ru: ["common.css"]
zh: ["common","css"]

This is a consequence of elasticsearch's "monolithic" analyzers. Basically anywhere where we use a builtin analyzer like german instead of breaking it up and spelling out the individual parts our word_break_helper doesn't get applied. The fix is going to be to break up all of those analysis chains into components, and then evaluate the tokenization changes that happen.

The interim fix is that since this ticket was created we have added regex search for titles, so intitle:/css/ will find the appropriate pages.

Interesting. Thanks for the details.

The interim fix is that since this ticket was created we have added regex search for titles, so intitle:/css/ will find the appropriate pages.

Ah, right! Thanks again!

MPhamWMF subscribed.

Closing out low/est priority tasks over 6 months old with no activity within last 6 months in order to clean out the backlog of tickets we will not be addressing in the near term. Please feel free to reopen if you think a ticket is important, but bare in mind that given current priorities and resourcing, it is unlikely for the Search team to pick up these tasks for the indefinite future. We hope that the requested changes have either been addressed by or made irrelevant by work the team has done or is doing -- e.g. upgrading Elasticsearch to a newer version will solve various ES-related problems -- or will be subsumed by future work in a more generalized way.

Reverting misguided closure.