Page MenuHomePhabricator

Can't upload file with non-ASCII name (eg cyrillic) on Windows host
Closed, ResolvedPublic

Description

Author: vershigora

Description:
Im runing mediawiki under Apache2 & Windows 2k. And I cant upload file with
russian name. File name becomes wrong when MD saves it to disk, so link on the
file becomes wrong -> 404. I think the solution it to convert Cyrilic file name
into translit (http://en.wikipedia.org/wiki/Cyr), but Im not very good PHP
programmer.

sorry for my english.


Version: 1.20.x
Severity: normal
OS: Windows XP
Platform: PC
URL: http://meta.wikimedia.org/wiki/Image:Bug_1780_non_ascii_%C3%A4%C3%B6%C3%BC%C3%9F.png
See Also:
https://bugs.php.net/bug.php?id=33350

Revisions and Commits

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 8:18 PM
bzimport set Reference to bz1780.
bzimport added a subscriber: Unknown Object (MLST).

May be a similar issue to bug 362; the OS and filesystem expects certain formatting different from what it's getting (in this case, UTF-8).

vershigora wrote:

Similar, but not the same. File was create, but with wrond name.
Should be : Вера.jpg
But it is : ??????????.jpg (I cant past real name, couse it contains wrong
characters)

jeluf wrote:

can you provide a link to your wiki that we could use for testing?

vershigora wrote:

limp.iceberg-m.ru:81/wiki/

  • Bug 3724 has been marked as a duplicate of this bug. ***

gunter.schmidt wrote:

I have the same bug with V.1.6.5.

Try to upload any image with the name: Bug_1780_non_ascii_äöüß.png (hope you can read this on your system)

I tried to show you on mediawiki, but the bug is not there!
http://meta.wikimedia.org/wiki/Image:Bug_1780_non_ascii_%C3%A4%C3%B6%C3%BC%C3%9F.png

Maybe 1.7 works differently?

That's because our site doesn't run on Windows servers.

codemonk wrote:

The problem persists in MediaWiki 1.8 on Windows XP. Generally everything works
fine on Windows, except this bug, that is very disturbing. Is it possible to do
something around it, or is it a fatal incompatibility forever?

codemonk wrote:

I've got a temporary solution (at least, for my MediaWiki 1.8.2 on Windows XP), though it is far from perfection and involves iconv function.

Firstly,
In SpecialUpload.php file, in processUpload() function, right before closing the last "if( $this->saveUploadedFile(..." block, update the source code as follows:

...

} else {
  $wgOut->showFileNotFoundError( $this->mUploadSaveName );
}
rename( $this->mSavedFile, iconv ('UTF-8', 'CP1251', $this->mSavedFile) ); # NEW	}

...

Secondly,
In Image.php file, in reallyRenderThumb() function, in the middle of "elseif ( $wgUseImageMagick ) {..." block, update the source code as follows:

...
wfDebug("reallyRenderThumb: running ImageMagick: $cmd\n");
if (file_exists(iconv('UTF-8', 'CP1251', $thumbPath)) == false) # NEW

rmdir( substr_replace($thumbPath, '', strrpos($thumbPath, "/")));	# NEW

mkdir( substr_replace( iconv('UTF-8', 'CP1251', $thumbPath), '', # NEW

strrpos(iconv('UTF-8', 'CP1251', $thumbPath), "/")));	# NEW

$cmd = iconv ('UTF-8', 'CP1251', $cmd); # NEW
wfProfileIn( 'convert' );
...

If you use something other than ImageMagick for image processing, you should transfer the second code fragment to appropriate block and adapt it to that program, if required.

IMPORTANT: If your Windows uses some other code page than Windows-1251, than in code above you should change 'CP1251' to your code page identifier. And DO NOT use this code on non-Windows machines.
  • Bug 11758 has been marked as a duplicate of this bug. ***

Created attachment 4734
A basic configurable workaround for this bug

The patch adds a global configuration variable $wgLocalFilesystemCharsetOverride that can be set to the charset of the local file system (e.g. 'CP1250'), and all names of the uploaded files are converted to this charset (using iconv) when talking with the filesystem. However, this works correctly only when the destination filename contains only characters from this charset, so this is not a perfect solution.

But the support for file uploads on Windows (and other OSes) is limited in many other ways (there is no filename syntax checking other than stripping path components, which is far from being sufficient on Windows), anyway.

The correct solution to this might depend on the mysterious image backend rewrite. ;-)

Attached:

Yeah, this would still break with other chars, or if iconv() isn't present... the generated URLs might be wrong, too; depends what charset the web server is going to be expecting!

  • Bug 14924 has been marked as a duplicate of this bug. ***

dj.bauch wrote:

(In reply to comment #13)

> *** Bug 14924 has been marked as a duplicate of this bug. ***

Thanks for redirecting me from bug 14924. The patch attachment for this bug, with code page set to CP1250 in LocalSettings.php seems to fix most of the problems I've been seeing with images on IIS6/SQL Server/Windows 2003/Mediawiki 1.13 -- including the one I identified in my bug submission and several others, such as the recent POTD Image:CT of brain of Mikael Häggström large.png and Image:Bandeira do Município do Rio de Janeiro.png. It does not, however fix all of them. For example:
Image:Ostredok, Veľká Fatra (SVK) - NW slope.jpg (http:.../index.php?title=Image:Ostredok%2C_Ve%C4%BEk%C3%A1_Fatra_%28SVK%29_-_NW_slope.jpg) image still does not show up.
Image:Hors d'œuvre (Bosnian).jpg (Image:Hors_d%27%C5%93uvre_%28Bosnian%29.jpg) causes iconv to complain [function.iconv]: Detected an illegal character in input string in W:\Inetpub\wwwroot\mediawiki\includes\filerepo\File.php on line 68

  • Bug 15863 has been marked as a duplicate of this bug. ***

DJ, CP1250 is for Central Europe and doesn't include the "œ" character, hence the failure.

"Ostredok, Veľká Fatra (SVK) - NW slope.jpg" presumably ought to work, but it's hard to debug without an instance to check... However...

My suspicions:

  1. It's possibly safest to just create UTF-8 URLs -- that is, don't try to encode the generated URLs to the locale charset. IIS is probably smart enough to detect UTF-8 and load the files correctly (the filesystem stores filenames as UTF-16 Unicode.)
  1. Suddenly I'm not sure whether you actually want the "ANSI" codepage or the "OEM" codepage for filesystem storage. *shudder*

Ugh.

The best thing would probably just be to have a switch to encode filenames in some nice ASCII-safe hex encoding, rather than mess around with charsets.

fran wrote:

The problem is in PHP's handling, or lack thereof, of Unicode. NTFS uses UTF-16 internally, as Brion pointed out; the problem is that the Win32 API provides separate wchar_t oriented versions of stdio functions (like _wfopen()) for working with Unicode filenames, while the traditional char versions (like fopen()) translate the current legacy 8-bit code page into the corresponding Unicode representation for backwards compatibility. Unfortunately, PHP's innards are completely eight-bit, and has no knowledge of wchar_t stdio, so it's limited to characters in the current code page. :/ Using setlocale() to change the code page to UTF-8 might work, but setlocale() looks very brittle and ugly.

Indeed, mangling Unicode characters to ASCII in a predictable way is probably the best/only way to work around it.

dumpHtml uses a fun hack that shells out to a VBScript to rename files to a Unicode destination... That's probably not the nicest way to do it in active use. ;)

http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/DumpHTML/rename-hack.vbs?view=markup

Even if we used such a hack to *create* files, we couldn't *manipulate* them again without doing really weird crap like looking up the 8.3 version of the file path. So ASCII mangling is definitely going to be the safest thing.

dj.bauch wrote:

Brion, et. al.,
Thanks for your attention. I'm hoping that the official mechanism does change to one that's more compatible with Windows. In the mean time, I've switched from CP1250 to 'ISO-8859-1//TRANSLIT' as the character set that gives me the best results. Most images work now, but not all. This also doesn't fix problems with filenames that have '%' in the name. Sometimes that appears to be used to indicate the degree of transparency of some icons on Wikipedia, and I've had no luck getting those to display.

  • Bug 23028 has been marked as a duplicate of this bug. ***

Apparently PHP 6 will have full unicode support:

http://bugs.php.net/bug.php?id=46990

I can't believe that something like PHP still has bugs like this. I ran into it today trying to help a user understand why his images were not working, and first we suspected it was just instantcommons, but eventually tracked it down to this issue.

PHP6 is dead, so who knows when this will be fixed.

In the meantime, I'd suggest adding a warning to Special:Upload when wfIsWindows() and you try to upload a file with unicode in the name.

Bryan.TongMinh wrote:

I forbid uploading non-ascii files on Windows in r88165.

paolobenve wrote:

Well, this isn't a fix, it's a limitation...

Bryan.TongMinh wrote:

It's a fix in the sense that it is no longer possible to upload a file which then can't be viewed anymore. A proper fix would be to make PHP use wide character functions.

Reopening -- doesn't seem to fix it, just makes some of your pages platform-dependent.

Bryan.TongMinh wrote:

(In reply to comment #26)

Reopening -- doesn't seem to fix it, just makes some of your pages
platform-dependent.

A way to fix this would be to make filenames on disk no longer map to titles. We have a bug open for that somewhere.

sumanah wrote:

Thank you for the patch, Mormegil.
(In reply to comment #12)
Adding the "reviewed" keyword. Also adding the internationalization keyword so the internationalisation/localisation team knows to look at this bug.

Not really an i18n bug, it's an issue with filerepo.

Bryan.TongMinh wrote:

This is now finally fixable with the filebackend!

I'm thinking about writing a custom backend which implements [[quoted-printable]] encoding. Any opinions on the encoding to use? It's a pity that the filebackend implements a listFiles method, otherwise we could have simply used a one-way hashing function.

(In reply to comment #30)

This is now finally fixable with the filebackend!

I'm thinking about writing a custom backend which implements
[[quoted-printable]] encoding. Any opinions on the encoding to use? It's a
pity
that the filebackend implements a listFiles method, otherwise we could have
simply used a one-way hashing function.

Why not add that to FSFileBackend in the form of configurable escape/unescape functions? The default ones could just pass throw the raw input. One issue with any encoding scheme is handling URLs correctly, so users get file/thumbnail urls that actually are mapped to the encoded file names. I suppose a redirection module could be used. img_auth and thumb_handler would cover some of the obvious cases, though they don't handle RANGE requests. Another option would be a redirector module which would redirect requests to the encoded URL. CDN caching would be slightly trickier in any case.

It's hard to resist saying "just use Linux" though...

That said, it would be nice if FilRepo stored files based on hash and used a redirection or service layer to make readable URLs to files anyway. It would solve a lot of problems like weird race conditions, the poor performance and lack of atomicity for file moves/deletes/undeletes and re-uploads (especially for large files or if there are many versions), and issues like this bug as well (what characters a system allows). That's another story though...

orbartal wrote:

How to fix the bug in Hebrew (and in any other language that windows support)

  1. In windows OS change the language for non-Unicode to your local MediaWiki language. E.g. the language of the files names you wish to upload. Usually it is the same as $wgLanguageCode language. See how on this link.
  2. Windows NTFS file system uses special encoding, not ascii or utf8. Check the appropriate encoding for your language. For Hebrew I used windows-1255.
  3. Edit the MediaWiki core code, and add these 4 changes. Note to use your language and not windows-1255. I used windows-1255 for Hebrew, but you might need something else.

a. Remove (or put as a comment) the test added by Bryan Tong Minh that prevent from uploading files with non ascii name in windows. Later we shell fix the bug, so that filter is no longer required.
See details: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/88165
MediaWiki/includes/upload/UploadBase.php line 756.
b. Go to the source code file in
MediaWiki/includes/filebackend/ FSFileBackend.php. And in class FileBackendStore, in function FileBackendStore :: doStoreInternal in line 206, add the following lines:

if (strtoupper(substr(PHP_OS, 0, 3)) == 'WIN')
{
$charSetArr = array("ASCII", "JIS", "EUC-JP", "UTF-8", "UTF-16","windows-1251",
"ISO-8859-1", "GBK");

		if (mb_detect_encoding($dest, $charSetArr) =="UTF-8")
		{
				$dest = iconv("UTF-8", "windows-1255",  $dest);
		}

}
Just before the command that copies the file to the path:
$ok = copy( $params['src'], $dest );

Now you can upload files and images in Hebrew. But you can’t view them as thumbnail. Two more similar code fix are required for this task to complete.

c. Go to the source code file in MediaWiki\includes\filerepo\file\File.php. And in class File, in function File:: transform in line 623, add the following lines:
if (strtoupper(substr(PHP_OS, 0, 3)) == 'WIN')
{
$charSetArr = array("ASCII", "JIS", "EUC-JP", "UTF-8", "UTF-16","windows-1251",
"ISO-8859-1", "GBK");
if (mb_detect_encoding($thumbPath, $charSetArr) =="UTF-8")
{

		$thumbPath = iconv("UTF-8", "windows-1255",  $thumbPath);

}
}
Right after the command returns the full path to the folder of the thumbnail file:
$thumbPath = $this->getThumbPath( $thumbName ); // final thumb path
d. Go to the source code file in MediaWiki\includes\media\Bitmap.php. And in class BitmapHandler, in function BitmapHandler::transformGd in line 548, add the following lines:
if (strtoupper(substr(PHP_OS, 0, 3)) == 'WIN')
{
$charSetArr = array("ASCII", "JIS", "EUC-JP", "UTF-8", "UTF-16","windows-1251",
"ISO-8859-1", "GBK");
if (mb_detect_encoding($params['srcPath'], $charSetArr) =="UTF-8")
{

			$params['srcPath'] = iconv("UTF-8", "windows-1255",  $params['srcPath']);

}
}
Right before the command that test if the file exists in that location:
if ( !file_exists( $params['srcPath'] ) )

orbartal wrote:

How to upload file with non-ASCII name on Windows host

How to enable upload file with non-ASCII name on Windows host with just 3 simple changes to the wiki server.

Attached:

orbartal wrote:

enable upload file with non-ASCII name on Windows host

How to enable upload file with non-ASCII name on Windows host with just 3 simple changes to the wiki server.

Attached:

Bryan.TongMinh wrote:

(In reply to comment #31)

(In reply to comment #30)

This is now finally fixable with the filebackend!

I'm thinking about writing a custom backend which implements
[[quoted-printable]] encoding. Any opinions on the encoding to use? It's a
pity
that the filebackend implements a listFiles method, otherwise we could have
simply used a one-way hashing function.

Why not add that to FSFileBackend in the form of configurable escape/unescape
functions? The default ones could just pass throw the raw input. One issue
with
any encoding scheme is handling URLs correctly, so users get file/thumbnail
urls that actually are mapped to the encoded file names. I suppose a
redirection module could be used. img_auth and thumb_handler would cover some
of the obvious cases, though they don't handle RANGE requests. Another option
would be a redirector module which would redirect requests to the encoded
URL.
CDN caching would be slightly trickier in any case.

It's hard to resist saying "just use Linux" though...

That said, it would be nice if FilRepo stored files based on hash and used a
redirection or service layer to make readable URLs to files anyway. It would
solve a lot of problems like weird race conditions, the poor performance and
lack of atomicity for file moves/deletes/undeletes and re-uploads (especially
for large files or if there are many versions), and issues like this bug as
well (what characters a system allows). That's another story though...

I would not add a complicated redirector, but just modify File::getUrl() to apply the encoding. I can't really find out though if there currently is any interaction between filerepo and filebackend regarding the file url.

So I found an old upstream bug from 2005 on the low-level API problem here:
https://bugs.php.net/bug.php?id=33350

Added a comment that this is still a live issue. :)

Bryan.TongMinh wrote:

Alternatively to hacking filebackend, we could wrap the FileSystemObject using PHPs COM extension. If somebody really wants to put effort into this ;)

Change 125573 had a related patch set uploaded by Aaron Schulz:
[WIP] Added path encoding to FileBackendStore for Windows support

https://gerrit.wikimedia.org/r/125573

Change 132298 had a related patch set uploaded by Aaron Schulz:
Added better path encoding to FileBackend for Windows

https://gerrit.wikimedia.org/r/132298

Change 125573 abandoned by Aaron Schulz:
Added path encoding to FileBackendStore for Windows support

Reason:
Mostly not needed since given the SHA1 storage name patch, which also handles the same problem and more

https://gerrit.wikimedia.org/r/125573

  • Bug 68268 has been marked as a duplicate of this bug. ***

dgiim wrote:

I am using mediawiki in Korean environment.

When will completely fix this?

I have resolved to hack Upload problem.

But I can not see the thumbnail.

Help me.

orbartal wrote:

Try using the in the pdf file: "How to fix the bug in Hebrew". It works for all languages, not just for Hebrew. And it fixes the thumbnail bug as well. Tell me if it works. And if it’s not, I will try to help you solved it.

dgiim wrote:

First of all, thank you give a quick get attention. orbartal.

I've had to change a thumbnail below to display the file.php.

...
$ thumbPath = $ this-> getThumbPath ($ thumbName); Final thumb path
CP949 is a windows charset system for hangul, a korean character.
$ thumbPath = iconv ("UTF-8", "CP949", $ thumbPath);
...

Also, I've had to change as follows bitmap.php.

...
$ params ['srcPath'] = iconv ("UTF-8", "CP949", $ params ['srcPath']);
if (! file_exists ($ params ['srcPath'])) {
...

Currently, it is well Hangul file upload. However, no thumbnail is displayed. Instead, in the following locations, are displayed in the thumbnail spot an error: 'filemissing'

Please help me!

[More]

  • MediaWiki Version: 1.23.3
  • System: Windows 7 (hangul)

Thank you.

(In reply to Gerrit Notification Bot from comment #40)

Change 125573 abandoned by Aaron Schulz:
Added path encoding to FileBackendStore for Windows support

Reason:
Mostly not needed since given the SHA1 storage name patch, which also
handles the same problem and more

https://gerrit.wikimedia.org/r/125573

That patch has been abandoned, but I have asked on the changeset whether the patch might still be useful for older versions of MediaWiki which have this bug.

Change 125573 restored by Aaron Schulz:
Added path encoding to FileBackendStore for Windows support

Reason:
Rebasing (then closing again)

https://gerrit.wikimedia.org/r/125573

Change 125573 abandoned by Aaron Schulz:
Added path encoding to FileBackendStore for Windows support

https://gerrit.wikimedia.org/r/125573

epriestley added a commit: Unknown Object (Diffusion Commit).Mar 4 2015, 8:23 AM

This should be fixed in PHP 7.1, but I have not tested it:

Anyone affected who could re-test with PHP 7.1, now that it's out?

Change 382074 had a related patch set uploaded (by Brion VIBBER; owner: Brion VIBBER):
[mediawiki/core@master] Support uploads with UTF-8 names on Windows

https://gerrit.wikimedia.org/r/382074

I've confirmed it's working with PHP 7.1; needed a small patch to disable the check for Windows if PHP is new enough: https://gerrit.wikimedia.org/r/382074

Change 382074 merged by jenkins-bot:
[mediawiki/core@master] Support uploads with UTF-8 names on Windows

https://gerrit.wikimedia.org/r/382074

@brion: As the patch is merged, should this task get closed as resolved?

brion claimed this task.