Page MenuHomePhabricator

Special:Listfiles doesn't find anything when search term contains non-ascii characters
Closed, DeclinedPublic

Description

Special:Listfiles doesn't find anything if the search term contains german umlauts (the letters äöüÄÖÜß) - for example, searching for "Möhre" will give zero results, but I have files like "Möhre-Wilde-Blüte.jpg" and "Pfalz-Möhren-1.jpg".

The problem seems to be somewhere in $dbr->buildLike which replaces the "ö" with "\xc3\xb6". This "\xc3\xb6" doesn't match anything in the database.

At the moment, I'm using the patch pasted below as workaround - but it can give too many results (searching for "möhr" will also find "mähr" or even "mxyr").
Additionally, my patch replaces everything except the listed chars, so you might get even more results if a user searches for a special character.
(I can live with that for now - better than finding nothing ;-)

I'm also not sure where this should be fixed - I added my workaround to SpecialListfiles.php to avoid unintentional side effects, but it would probably be better to fix the code in $dbr->buildLike.

Patch with workaround (see limitations/known issues above)

===================================================================
--- SpecialListfiles.php        (Revision 86040)
+++ SpecialListfiles.php        (Arbeitskopie)
@@ -34,11 +34,13 @@ class ImageListPager / __construct()
                }
                $search = $wgRequest->getText( 'ilsearch' );
                if ( $search != '' && !$wgMiserMode ) {
+$search = preg_replace('/[^a-zA-Z0-9_ .-]/', '@@uml@@', $search);
                        $nt = Title::newFromURL( $search );
                        if( $nt ) {
                                $dbr = wfGetDB( DB_SLAVE );
-                               $this->mQueryConds = array( 'LOWER(img_name)' . $dbr->buildLike( $dbr->anyString(), 
-                                       strtolower( $nt->getDBkey() ), $dbr->anyString() ) );
+$cond = $dbr->buildLike( $dbr->anyString(), strtolower( $nt->getDBkey() ), $dbr->anyString() );
+$cond = str_replace('@@uml@@', '_', $cond);
+                               $this->mQueryConds = array( 'LOWER(img_name)' . $cond );
                        }
                }

Version: 1.16.x
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:01 AM
bzimport set Reference to bz32207.
bzimport added a subscriber: Unknown Object (MLST).

I can't reproduce this on 1.16 or trunk; the ö passed through buildLike just fine:

brion@stormcloud:~/pages/rel1.16$ php maintenance/eval.php 
> return wfGetDB(DB_MASTER)->buildLike("Pfalz-M\xc3\xb6hren-1.jpg");
 LIKE 'Pfalz-Möhren-1.jpg'

(I used the \xc3\xb6 in the double-quoted string here to ensure it passes through the terminal ok; that is just a plain UTF-8 ö in the string.)

Is there anything special about your database configuration? Are you using MySQL or one of the less-well supported databases? Any special options?

Interestingly it works for me on the shell - the output is "LIKE 'Möhre'".

I'm using MySQL 5.0.67. My config isn't very special IMHO, maybe except this:

[client]
default-character-set=latin1
[mysqld]
default-character-set=latin1
default-collation=latin1_german1_ci

Note that this is just a default, and any client (including mediawiki) can specify the charset to use when connecting to MySQL.

On the mediawiki side, my configuration is quite boring and doesn't contain anything related to the charset.

In the meantime I noticed that PHP's error_log() escapes special characters (like umlauts) - if I just echo out the query, it contains "Möhre" in valid UTF-8.
In other words: there must be something wrong on the way between mediawiki and mysql. Let me check...

show create table page

[...]

`page_title` varchar(255) character set latin1 collate latin1_bin NOT NULL,

[...]
) ENGINE=MyISAM AUTO_INCREMENT=6244 DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci

In other words: The page_title is latin1 (aka ISO-8815-1) in the database, which matches my mySQL defaults.

select page_title from page where page_title like '%hre%';

+----------------------------------------------------+

page_title

+----------------------------------------------------+

Möhre

[...]

Yes, the UTF-8 sequence for "ö" is really displayed as two bytes :-(
Looks like mediawiki didn't tell mysql that it will hand over UTF-8 strings, and MySQL handled them as ISO-8859-1 then...

In case it matters: The wiki was started in 2009 (IIRC MediaWiki 1.14) and updated since then.

Let me post an update on this:

The problem is that the column contains utf8, but is labeled as latin1.

I could fix this by modifying the charset in the database. The trick is to change the field to varbinary first and then to varchar utf8. The varbinary step is needed to avoid that MySQL does an automatic charset conversion which would result in double-encoded utf8.

In SQL, this means:

alter table t modify column f varbinary(255);
alter table t modify column f varchar(255) charset utf8;

That's the easy part. I then found out that I had this problem in various tables and columns, so it took me some hours to do this fix on all of them.

To sum it up: This looks an upgrade problem from a (very?) old version. AFAIK the openSUSE wiki was hit by a similar problem (sorry, I don't know exactly in which MediaWiki version).

I created a patch here: https://gerrit.wikimedia.org/r/#/c/238745/ please give some feedback if its work for you to.

Ciencia_Al_Poder renamed this task from Special:Listfiles doesn't find anything when search term contains umlauts to Special:Listfiles doesn't find anything when search term contains non-ascii characters.Nov 7 2017, 10:21 AM
Ciencia_Al_Poder updated the task description. (Show Details)
Ciencia_Al_Poder removed a subscriber: wikibugs-l-list.

There are reports of it not even working for ASCII characters due to a bad lowercase conversion on a varbinary column

See this topic where a workaround is provided and is actually very popupar, given the amount of thank comments received by the author of the workaround: Topic:Pwjch65m0lbr7n8e.

There are reports of it not even working for ASCII characters due to a bad lowercase conversion on a varbinary column

See this topic where a workaround is provided and is actually very popupar, given the amount of thank comments received by the author of the workaround: Topic:Pwjch65m0lbr7n8e.

Indeed this workaround is working! Then why wouldn't it be patched into the next MW version?

I had problems with uppercase chars (which are ascii of course, so the ticket description is a bit off scope).

I can confirm, that the patch in https://phabricator.wikimedia.org/T254251 (Topic:Pwjch65m0lbr7n8e) for ImageListPager.php is working for Special:FileList.

You also need to patch NewFilesPager.php for Special:NewFiles, as noted by @Ljonka (outdated patch: https://gerrit.wikimedia.org/r/#/c/238745/ ).

I made the following changes in MW 1.34.2:

diff --git a/httpdocs/includes/specials/pagers/ImageListPager.php b/httpdocs/includes/specials/pagers/ImageListPager.php
index 5de3ecb5..ca5438f5 100644
--- a/httpdocs/includes/specials/pagers/ImageListPager.php
+++ b/httpdocs/includes/specials/pagers/ImageListPager.php
@@ -152,7 +152,7 @@ class ImageListPager extends TablePager {
                        $nt = Title::newFromText( $this->mSearch );
                        if ( $nt ) {
                                $dbr = wfGetDB( DB_REPLICA );
-                               $conds[] = 'LOWER(' . $prefix . '_name)' .
+                               $conds[] = 'CONVERT(' . $prefix . '_name USING utf8)' .
                                        $dbr->buildLike( $dbr->anyString(),
                                                strtolower( $nt->getDBkey() ), $dbr->anyString() );
                        }
diff --git a/httpdocs/includes/specials/pagers/NewFilesPager.php b/httpdocs/includes/specials/pagers/NewFilesPager.php
index 9a78c5dd..e63cbe5d 100644
--- a/httpdocs/includes/specials/pagers/NewFilesPager.php
+++ b/httpdocs/includes/specials/pagers/NewFilesPager.php
@@ -131,7 +131,7 @@ class NewFilesPager extends RangeChronologicalPager {
                                        strtolower( $likeObj->getDBkey() ),
                                        $dbr->anyString()
                                );
-                               $conds[] = "LOWER(img_name) $like";
+                               $conds[] = "CONVERT(img_name USING utf8) $like";
                        }
                }

Any chance this will make it in MW 1.35? Looks like a lot of people run into this.

I just tried MW 1.35.0_rc.2 and the problem persists.

Luckily my patch from the previous post still works:

$ patch -p1 < src/patches/0001-fixed-Special-FileList-and-Special-NewFiles.patch
patching file httpdocs/includes/specials/pagers/ImageListPager.php
Hunk #1 succeeded at 161 (offset 9 lines).
patching file httpdocs/includes/specials/pagers/NewFilesPager.php
Hunk #1 succeeded at 127 (offset -4 lines).
Reedy subscribed.

Removing MW-1.35-release as it's not a blocker to the release, nor is it anything new in the release.

Please submit the patch into gerrit to get it reviewed

@Gruniversal: Thanks for taking a look at the code. Please use developer access to submit the proposed code changes as a Git branch directly into Gerrit which makes it easier to review and provide feedback. If you don't want to set up Git/Gerrit, you can also use the Gerrit Patch Uploader. Thanks again!

Change 630340 had a related patch set uploaded (by Gerrit Patch Uploader; owner: David Gruner):
[mediawiki/core@master] fixes a bug that prevents file searches with uppercase chars or umlauts (see: https://phabricator.wikimedia.org/T34207)

https://gerrit.wikimedia.org/r/630340

Change 630340 had a related patch set uploaded (by Ammarpad; owner: David Gruner):
[mediawiki/core@master] Fix a bug that prevents file searching with uppercase chars or umlauts

https://gerrit.wikimedia.org/r/630340

Hi everybody. It looks like @Gruniversal's patchset is in an unmergeable state due to it being applied to master, and the global wfGetDB being deprecated (as I am seeing evidence of here). I'm new to MediaWiki development, and still in the initial stages of confusion about Gerrit, so unfortunately I can't suggest a solution for that.

For anyone interested in applying the patch now, this worked for me on a 1.35.3 installation:

cd /path/to/wiki
curl https://gerrit.wikimedia.org/r/changes/mediawiki%2Fcore~630340/revisions/2/patch?download | base64 -d | patch -p1 --dry-run
# use 'base64 -D' on macOS; remove '--dry-run' if it doesn't complain

From what I can tell, 1.36 is actively replacing the wfGetDB( DB_REPLICA )s as well (e.g., around line 204 of includes/specials/pagers/ImageListPager.php). So until or unless this gets backported, people interested in applying the fix for 1.36 would have to do it by hand. No big deal, it's just a few lines, after all.

I will lastly add that this bug affects every uploaded file, even strictly-ASCII filenames, if the search term would/should match a capital letter. Being as MediaWiki forces the first letter of the upload filename to upper-case, this means that you cannot search for any upload by the first few letters of the filename on any third-party MediaWiki, using the out-of-the-box search facilities.

With tons and tons of heartfelt respect for the hard and mostly-thankless work you all do, it seems like this should be a higher-priority issue, given the amount of time it's been affecting third-party wikis (since 1.16). I mean, searching for uploaded files by name seems essentially broken to the end user unless you happen to know about this patch.

Change 630340 abandoned by Umherirrender:

[mediawiki/core@master] Fix a bug that prevents file searching with uppercase chars or umlauts

Reason:

The corresponding code/feature was removed from core - I553b25794962f71d7588d735ced88a2dc3812093

https://gerrit.wikimedia.org/r/630340

Umherirrender subscribed.

The used functionality was removed from core with T298722