CirrusSearch: intitle:¢ returns no results despite there being a redirect at [[¢]]
Open, MediumPublic
Actions

Assigned To

None

Authored By

	• Deskana
	Feb 8 2014, 1:54 AM

Description

What it says in the summary. :-)

Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=33824

Details

Reference: bz61080

Related Objects

Mentioned In: T194059: Index redirect content in categories
Mentioned Here: T211824: Investigate a “rare-character” index

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:52 AM

• bzimport added a project: CirrusSearch.

• bzimport set Reference to bz61080.

• Deskana created this task.Feb 8 2014, 1:54 AM

Bleh. It looks like that symbol is turned into a text boundary by the standard analyzer which isn't nice. I wonder if I should introduce another search just against the lowercased version of the title that should help boost things like:
Symbol page titles like this,
Exact, in order, title matches

Change 112566 had a related patch set uploaded by Manybubbles:
Use near_match to also search pages

https://gerrit.wikimedia.org/r/112566

Did what I said about adding an extra analyzer. It help. Note that intitle:¢ won't find ¢ because ¢ is a redirect.

Change 112566 merged by jenkins-bot:
Use near_match to also search pages

https://gerrit.wikimedia.org/r/112566

• Deskana moved this task from Inbox to Resolved/Invalid/Declined/Legacy on the CirrusSearch board.Apr 20 2015, 4:08 AM

intitle: currently only hits the title field, should update to do redirect.title as well

Restricted Application added projects: Discovery-ARCHIVED, Discovery-Search. · View Herald TranscriptDec 7 2016, 4:31 PM

Restricted Application added a subscriber: TerraCodes. · View Herald Transcript

EBernhardson removed • Manybubbles as the assignee of this task.Dec 7 2016, 4:31 PM

EBernhardson set Security to None.

• Deskana lowered the priority of this task from Medium to Low.Dec 8 2016, 11:02 PM

• Deskana moved this task from needs triage to search-icebox on the Discovery-Search board.

• demon unsubscribed.Feb 7 2017, 5:53 AM

Headbomb mentioned this in T194059: Index redirect content in categories.May 22 2018, 6:49 PM

intitle now queries the redirect titles, but this bug is still not fixed. It looks like the analyzers throw away this token:

GET /enwiki_content/_analyze
{ "text": "¢", "analyzer": "plain" }

The results: {"tokens":[]}

Same for text, short_text, plain, plain_search. Maybe others. @TJones We probably, at some point, need to look into the english analysis chain and see what other tokens are being thrown away in our plain search.

EBernhardson moved this task from search-icebox to Language Stuff on the Discovery-Search board.Feb 14 2019, 9:46 PM

In T63080#650462, @Manybubbles wrote:

Bleh. It looks like that symbol is turned into a text boundary by the standard analyzer which isn't nice.

There's your problem! This is also the problem that prompted T211824: Investigate a “rare-character” index. The tokenizer, probably as a too-clever shortcut, treats a lot of interesting non-text characters like they are punctuation and tosses them.

{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [
      {
        "name" : "word_break_helper",
        "filtered_text" : [
          "¢"
        ]
      },
      {
        "name" : "kana_map",
        "filtered_text" : [
          "¢"
        ]
      }
    ],
    "tokenizer" : {
      "name" : "standard",
      "tokens" : [ ]   <---------RIGHT THERE!!
    },
    "tokenfilters" : [
      {
        "name" : "aggressive_splitting",
        "tokens" : [ ]
      },
...

We could look into using some other tokenizer (maybe the ICU tokenizer? I'd have to check), but it's not going to be trivial, and for other languages we'd have to unpack their analyzer before we could swap out the tokenizer, so it's a big mess.

Wow, i didn't realize we threw away so many interesting tokens. Unfortunate, but seems this task can become a child of the other to be considered "some day".

TJones raised the priority of this task from Low to Medium.Aug 27 2020, 8:20 PM

CirrusSearch: intitle:¢ returns no results despite there being a redirect at [[¢]]Open, MediumPublicActions

Description

Details

Related Objects

Event Timeline

CirrusSearch: intitle:¢ returns no results despite there being a redirect at [[¢]]
Open, MediumPublic
Actions