Page MenuHomePhabricator

Pages with precomposed (accented) characters should match unaccented search query
Closed, ResolvedPublic

Description

At the Vietnamese Wikipedia, most pages (and their titles) include words with
precomposed, accented Unicode characters. (See [[Precomposed character]] and
[[Vietnamese alphabet]].) However, users who search for articles at the
Vietnamese Wikipedia often enter queries with the unaccented base characters,
with the expectation that MediaWiki will understand their query. MediaWiki
neither strips combining characters (Bug 1836) nor converts the precomposed
characters in existing pages to their base ASCII characters (i.e., ô→o and ậ→a)
when searching page titles or text, so the search feature consistently returns
disappointing results.

Steps to reproduce:

  1. Search for "viet nam" or "Viet Nam" (without the quotes) at the Vietnamese

Wikipedia

Expected results:
[[vi:Việt Nam]] is the first result, or at least somewhere in the results.

Actual results:
"Việt Nam" is nowhere to be found.


Version: unspecified
Severity: normal

Details

Reference
bz5752

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:13 PM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz5752.
bzimport added a subscriber: Unknown Object (MLST).

Please see also Bug 1836, Comment 3:

Perhaps the search function should ignore diacritics in article titles when the

user has entered a query that contains no diacritics. If the user has entered in
diacritics, the software should respect that. It would also be nice if there
were a MediaWiki message in which a list of diacritics could be customized per
wiki or locale, since different languages distinguish letters and diacritics
differently.

  • This bug has been marked as a duplicate of 1836 ***

This is not the same as Bug 1836. That bug is for ignoring combining (but
separate) diacritical characters in Unicode; this bug is for converting
precomposed characters, which might be a lot more complicated.

No, that's the same thing.

  • This bug has been marked as a duplicate of 1836 ***