Page MenuHomePhabricator

feature request: replace forbidden characters with lookalike UTF8 signs in the wikipedia search input control
Closed, ResolvedPublic

Description

Author: michael.manner

Description:
replace forbidden characters with lookalike UTF8 signs in the wikipedia search input field [alt-F].

Here are some alternativs:
mayor:

  • # → ⧣ (⧣) EQUALS SIGN AND SLANTED PARALLEL (U+29E3) ⧣

With this replacements wouldt it be possible do find article titles like "C#"

minors:

  • < → &#8249; (‹) SINGLE RIGHT-POINTING ANGLE QUOTATION MARK (U+2039) &amp;#8249;
  • > → &#8250; (›) SINGLE LEFT-POINTING ANGLE QUOTATION MARK (U+203A) &amp;#8250;
  • | → &#8739; (∣) DIVIDES (U+2223) &amp;#8739;
  • { → &#10100; (❴) MEDIUM LEFT CURLY BRACKET ORNAMENT (U+2774) &amp;#10100;
  • } → &#10101; (❵) MEDIUM RIGHT CURLY BRACKET ORNAMENT (U+2775) &amp;#10101;

no alternativs found:

  • [
  • [

Only the CJK Characters would be available, but the arn't supported by a large number of fonts.


Version: unspecified
Severity: enhancement

Details

Reference
bz36954

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:22 AM
bzimport set Reference to bz36954.
bzimport added a subscriber: Unknown Object (MLST).

This sounds like something that would get in the way of AntiSpoof.

mr.heat wrote:

  • This bug has been confirmed by popular vote. ***
TJones claimed this task.
TJones subscribed.

I'm going to close this because it was written before we moved to Elasticsearch. The current behavior of Elasticsearch is the same for both these characters and their proposed normalization: all of are ignored during tokenization. In general, we have implemented ICU Normalization for English-language projects, so most non-punctuation characters are normalized well.

If the goal is to be able to find these specific characters, see T211824: Investigate a “rare-character” index.