What it says in the summary. :-)
Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=33824
What it says in the summary. :-)
Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=33824
Bleh. It looks like that symbol is turned into a text boundary by the standard analyzer which isn't nice. I wonder if I should introduce another search just against the lowercased version of the title that should help boost things like:
Symbol page titles like this,
Exact, in order, title matches
Change 112566 had a related patch set uploaded by Manybubbles:
Use near_match to also search pages
Did what I said about adding an extra analyzer. It help. Note that intitle:¢ won't find ¢ because ¢ is a redirect.
intitle: currently only hits the title field, should update to do redirect.title as well
intitle now queries the redirect titles, but this bug is still not fixed. It looks like the analyzers throw away this token:
GET /enwiki_content/_analyze { "text": "¢", "analyzer": "plain" }
The results: {"tokens":[]}
Same for text, short_text, plain, plain_search. Maybe others. @TJones We probably, at some point, need to look into the english analysis chain and see what other tokens are being thrown away in our plain search.
There's your problem! This is also the problem that prompted T211824: Investigate a “rare-character” index. The tokenizer, probably as a too-clever shortcut, treats a lot of interesting non-text characters like they are punctuation and tosses them.
{ "detail" : { "custom_analyzer" : true, "charfilters" : [ { "name" : "word_break_helper", "filtered_text" : [ "¢" ] }, { "name" : "kana_map", "filtered_text" : [ "¢" ] } ], "tokenizer" : { "name" : "standard", "tokens" : [ ] <---------RIGHT THERE!! }, "tokenfilters" : [ { "name" : "aggressive_splitting", "tokens" : [ ] }, ...
We could look into using some other tokenizer (maybe the ICU tokenizer? I'd have to check), but it's not going to be trivial, and for other languages we'd have to unpack their analyzer before we could swap out the tokenizer, so it's a big mess.
Wow, i didn't realize we threw away so many interesting tokens. Unfortunate, but seems this task can become a child of the other to be considered "some day".