Page MenuHomePhabricator

Merging Unicode similar-looking characters in internal search (apostrophes, "x" and "×", etc)
Closed, ResolvedPublic

Description

When doing a search with the apostrophe character U+0027 "apostrophe/single quote" available on most keyboard, results should match other Unicode apostrophe-like characters like the preferred apostrophe U+2019 and others.

In 2009 there was a discussion about "Different apostrophe signs and MediaWiki internal search" see
http://www.gossamer-threads.com/lists/wiki/wikitech/169177
This doesn't seem to have been implemented.

This is related to bug T38313 for autocompletion.

Basically indexing should convert all apostrophes to U+0027, and searching should convert all apostrophes to U+0027. So articles containing U+2019 for exemple would be matches when search with U+0027, U+2019 or other apostrophes.

From the 2009 discussion, the list of apostrophes was:
U+0027 APOSTROPHE
U+2018 LEFT SINGLE QUOTATION MARK
U+2019 RIGHT SINGLE QUOTATION MARK
U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+2032 PRIME
U+00B4 ACUTE ACCENT
U+0060 GRAVE ACCENT
U+FF40 FULLWIDTH GRAVE ACCENT
U+FF07 FULLWIDTH APOSTROPHE

I would add other characters for which U+0027 is often used as an accessible substitute like some modifier letters and saltillo:
U+02B9 MODIFIER LETTER PRIME
U+02BB MODIFIER LETTER TURNED COMMA
U+02BC MODIFIER LETTER APOSTROPHE
U+02BD MODIFIER LETTER REVERSED COMMA
U+02BE MODIFIER LETTER RIGHT HALF RING
U+02BF MODIFIER LETTER LEFT HALF RING
U+0384 GREEK TONOS
U+1FBF GREEK PSILI
U+A78B LATIN CAPITAL LETTER SALTILLO
U+A78C LATIN SMALL LETTER SALTILLO

Webkit-based browsers already do this kind of stripping and merge U+0027, U+2018, U+2019, U+FF07. However there are many cases where merge all the proposed characters would help regular keyboard input.

The proposed solution in 2009 was to use a strip function:

function stripForSearch( $string ) { 
$s = preg_replace( '/\xe2\x80\x99/', '\'', $string ); 
return parent::stripForSearch( $s );

Version: unspecified
Severity: enhancement
See Also:
T58080
T59242
T38313
T61666

Details

Reference
bz39501

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:59 AM
bzimport set Reference to bz39501.
bzimport added a subscriber: Unknown Object (MLST).

[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]

  • Bug 47881 has been marked as a duplicate of this bug. ***

Widening scope a tiny bit. If we're going to do this it should be done all at once.

AntiSpoof's sort of the idea I'm thinking here.

Repurposing into a Cirrus bug as lsearchd has been end-of-lifed and won't be fixed further.

Chad,

Were you thinking this should be done in Cirrus for all languages by pushing analysis configuration to Elasticsearch? Something along those lines would be pretty flexible, allowing, for example, us to boost perfect matches of the typed unicode characters above the squashed ones. I'm not saying that is a good idea, just something that is possible.

(In reply to comment #6)

Chad,

Were you thinking this should be done in Cirrus for all languages by pushing
analysis configuration to Elasticsearch? Something along those lines would
be
pretty flexible, allowing, for example, us to boost perfect matches of the
typed unicode characters above the squashed ones.

Yeah that was pretty much my thinking.

I'm not saying that is a
good idea, just something that is possible.

I think it's a good idea, eventually. I set priority so low on purpose :)

Added see also bug. I think we should do this when we pull the unicode plugin in to Elasticsearch.

Looks like apostrophes came up on The Daily WTF: http://thedailywtf.com/Articles/Lightspeed-is-Too-Slow-for-MY-Luggage.aspx (specifically http://img.thedailywtf.com/images/14/q1/e95/Pic-5.jpg).

(In reply to comment #6)

Were you thinking this should be done in Cirrus for all languages by pushing
analysis configuration to Elasticsearch? Something along those lines would
be pretty flexible, allowing, for example, us to boost perfect matches of the
typed unicode characters above the squashed ones.

We already do some input normalization at some level of the stack (for example, multiple underscores get squashed and input such as "AbrAhAm LincoLn" works if there's a redirect at "Abraham lincoln").

It's difficult to look at the provided screenshot and not think that the software has failed our readers. Unless you think these should be MediaWiki page redirects (#REDIRECT)? I think we should do better normalization for search inputs.

Any rough idea how big of a project this would be to implement?

(In reply to comment #9)

We already do some input normalization at some level of the stack (for
example, multiple underscores get squashed and input such as "AbrAhAm LincoLn"
works if there's a redirect at "Abraham lincoln").

To be more explicit on these points:

https://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=AbrAhAm+LincoLn

https://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=_____AbrAhAm_____LincoLn_____

We may be able to implement apostrophe normalization at the same level.

I'll have a look at this when I can. For now I'll leave the component set to CirrusSearch. It looks like PHP implements the same normalization components that I can use in Elasticsearch (http://php.net/manual/en/class.normalizer.php) so I'll have to evaluate doing that normalization there as well. I imagine we'll if we do it in php it'll have to be optional because the normalizer requires PHP 5 >= 5.3.0 and PECL intl >= 1.0.0.

In case anyone comes to this from http://thedailywtf.com/Articles/Lightspeed-is-Too-Slow-for-MY-Luggage.aspx#Pic-5, they should have a look at Bug 59666 which should plug that particular embarrassing hole.

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt added subscribers: TJones, debt.

Hey @TJones - can you take a look at this and see if it's already been done or what the level of effort it would be to finish it up? Thanks!

There are a lot of connected issues here. I’ll try to untangle some of them.

In Elasticsearch, there are two particularly relevant steps to processing text. Tokenizing, which breaks up text into tokens to be indexed (usually what we think of as words), and ascii-folding, which converts non-plain ascii into plain ascii if possible—though, for example, you can’t convert Chinese characters into plain ascii because there’s no reasonable mapping.

The rules Elasticsearch uses for tokenizing and other processing can differ by language, so I’ve only tested these on the English analysis chain for now.

A normal apostrophe is treated as a word break, so looking at prickett’s (from the prickett’s charge in the article from the Daily WTF), we get prickett and s as our terms to be indexed. Searching for prickett’s charge actually searches for three tokens: prickett s charge. The obvious title comes up because that phrase in that exact order is the title of the article, which is usually a very good result.

Many of the apostrophe-like characters listed above also serve as word breaks in English. The ones listed here that are not word breaks include all the listed modifier letters, and the small saltillo—oddly, the capital saltillo is a word break. Of course, in other languages, the analysis could be different, though I checked Greek and the separate tonos is still a word breaker. (I think it’s because it’s not a modifier mark, since all the vowels with tonos have precomposed Unicode characters—but I’m guessing.)

For characters that are not word breaks, ascii-folding often does what you’d want, but not always. Ascii-folding is currently enabled on English Wikipedia, so searching for pïćkętt‘s čhãrgè works like you’d want. In my (not quite done) research into French (T142620), Turkish dotted-I (İ) is properly folded to I by the default French analysis chain, but not by the explicit ascii-folding step. The French stemmer does some ascii-folding, but generally not as much as the explicit ascii-folding step (dotted-I notwithstanding).

In general, the Elasticsearch ascii-folding is pretty good.

The tokenizer is causing some of these problems, particularly with the multiplication mark, ×, which is a non-word character, and so acts as a word break. When using the multiplication symbol, 3×4 is tokenized as two tokens: 3 4; while when using an x, 3x4 is tokenized as three tokens: 3 x 4.

We are currently doing explicit ascii-folding for English and Italian, and we’re adding it for French (which will come with BM25). Some probably happens in other language-specific analysis chains, but we don’t know exactly what or where without testing.

It is possible to add any of these others—x for ×, I for İ—as Elastic character filters, which just uniformly map one character to another, but that could have unintended consequences. They would definitely no longer distinguish between the mapped characters—so we couldn’t apply them universally, since in Turkish, the distinction between I and İ matters.

There can always be problems with particular “non-native” characters and a particular symbols that the default tokenizing and ascii-folding doesn’t handle as well as we’d like. More issues will come up, but I’d consider closing this specific task since this is based on the behavior of lsearchd which is no longer around, all of the original apostrophe-like characters now behave like apostrophes, and we are looking into ICU folding (T137830), which is more appropriate for other languages that aren’t using the Latin alphabet (it’s already enabled for Greek).

I'd like to clarify the scope of this ticket in terms of relevant characters and languages. I don't think it's a good idea to universally merge all of these apostrophe-like characters for all languages. What makes sense in English and a few other European languages isn't necessarily universal.

For English, all of the original apostrophe-like characters listed are behaving similarly (they are all word-breakers). It's not clear that converting modifier character to word-breaking characters is a good plan, either for English, or in general.

For the more general folding of generally similar Unicode characters, sometimes that's good, sometimes it's not, and it is being generally addressed by the ICU folding efforts (T137830 and T146402).

Can we close this very broad ticket as having completed a first approximation, and wait for remaining issues to come up on a case-by-case basis in specific languages?

Deskana assigned this task to TJones.
Deskana subscribed.

Can we close this very broad ticket as having completed a first approximation, and wait for remaining issues to come up on a case-by-case basis in specific languages?

That makes sense to me. Thanks @TJones!