Page MenuHomePhabricator

Search results for utf-8 strings
Closed, ResolvedPublic

Description

Author: pch13

Description:
Patch against languages/Language.php to let it return results for utf-8 terms

Before being entered into the searchindex table, utf-8 encoded strings are converted to a special notation: eg. dämon becomes du8c3a4mon; the search form does the same transform, but with an uppercase U8 escape - so the search fails in mysql.

Attached patch lets utf-8 search terms return results.


Version: 1.11.x
Severity: enhancement

Attached:

Details

Reference
bz17146

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:26 PM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz17146.
bzimport added a subscriber: Unknown Object (MLST).

The form returns results just fine for me...

SearchUpdate::doUpdate() takes the output of Language::stripForSearch() and does further processing to strip markup etc. This includes running it through strtolower() to make it entirely lowercase.

This extra lowercasing is *not* done by Special:Search, which produces the discrepancy you noted -- only the input data is being lowercased.

Searching "ééé FUNKY" hits this query:

SQL: SELECT /* WikiSysop */ page_id, page_namespace, page_title FROM page,searchindex WHERE page_id=si_page AND MATCH(si_title) AGAINST('+U8c3a9U8c3a9U8c3a9 +funky' IN BOOLEAN MODE) AND page_is_redirect=0 AND page_namespace IN ('0') LIMIT 20

However the backend search engine is case-insensitive so it shouldn't make a difference. :)

Worth going ahead and fixing though, just in case. Applied on trunk (for 1.15) in r46629

pch13 wrote:

Thank you Brion, I guess that should not break other peoples installations. I moved a mediawiki between servers and doctored the mysqldump to have the new location store mediawiki utf-8 strings in "utf8-...-ci" columns (was iso...). on import some rows would produce double key errors, so I made everything with a charset "utf-bin" instead. except the search form the wiki works fine so far.