Page MenuHomePhabricator

CirrusSearch: intitle:¢ returns no results despite there being a redirect at [[¢]]
Open, MediumPublic

Description

What it says in the summary. :-)


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=33824

Details

Reference
bz61080

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:52 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz61080.

Bleh. It looks like that symbol is turned into a text boundary by the standard analyzer which isn't nice. I wonder if I should introduce another search just against the lowercased version of the title that should help boost things like:
Symbol page titles like this,
Exact, in order, title matches

Change 112566 had a related patch set uploaded by Manybubbles:
Use near_match to also search pages

https://gerrit.wikimedia.org/r/112566

Did what I said about adding an extra analyzer. It help. Note that intitle:¢ won't find ¢ because ¢ is a redirect.

Change 112566 merged by jenkins-bot:
Use near_match to also search pages

https://gerrit.wikimedia.org/r/112566

EBernhardson added a project: good first task.
EBernhardson subscribed.

intitle: currently only hits the title field, should update to do redirect.title as well

Restricted Application added a subscriber: TerraCodes. · View Herald Transcript
EBernhardson set Security to None.
Deskana lowered the priority of this task from Medium to Low.Dec 8 2016, 11:02 PM
Deskana moved this task from needs triage to search-icebox on the Discovery-Search board.

intitle now queries the redirect titles, but this bug is still not fixed. It looks like the analyzers throw away this token:

GET /enwiki_content/_analyze
{ "text": "¢", "analyzer": "plain" }

The results: {"tokens":[]}

Same for text, short_text, plain, plain_search. Maybe others. @TJones We probably, at some point, need to look into the english analysis chain and see what other tokens are being thrown away in our plain search.

Bleh. It looks like that symbol is turned into a text boundary by the standard analyzer which isn't nice.

There's your problem! This is also the problem that prompted T211824: Investigate a “rare-character” index. The tokenizer, probably as a too-clever shortcut, treats a lot of interesting non-text characters like they are punctuation and tosses them.

{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [
      {
        "name" : "word_break_helper",
        "filtered_text" : [
          "¢"
        ]
      },
      {
        "name" : "kana_map",
        "filtered_text" : [
          "¢"
        ]
      }
    ],
    "tokenizer" : {
      "name" : "standard",
      "tokens" : [ ]   <---------RIGHT THERE!!
    },
    "tokenfilters" : [
      {
        "name" : "aggressive_splitting",
        "tokens" : [ ]
      },
...

We could look into using some other tokenizer (maybe the ICU tokenizer? I'd have to check), but it's not going to be trivial, and for other languages we'd have to unpack their analyzer before we could swap out the tokenizer, so it's a big mess.

Wow, i didn't realize we threw away so many interesting tokens. Unfortunate, but seems this task can become a child of the other to be considered "some day".

TJones raised the priority of this task from Low to Medium.Aug 27 2020, 8:20 PM