Page MenuHomePhabricator

Chinese needs sensible fallback character encoding set
Closed, ResolvedPublic

Description

Author: zayoo

Description:

English

Why there is a "$this->load();" on line 1495 of "function fallback8bitEncoding()" in languages/Language.php of Mediawiki 1.13.3?
I'm a Simplified Chinese user and our default charset is gb2312. When we use Mozilla Firefox to access somewhere Mediawiki and search something Chinese, as click into the address bar and press enter, it shows a page with a mess title.
That means (not only in FireFox but also in IE) if you click a url to "http://zh.wikipedia.org/w/index.php?title=首页", it works nice; but when you type it into the address bar and press enter, it goes into an empty page with title "Ê×Ò³".
I'm trying to solve the problem and I have found a method, that is, delete "$this->load();" on line 1495 in languages/Language.php. After that, all works well. Is it useful here (for some other language) and can it be deleted?

zayoo

简体中文

请问Mediawiki 1.13.3 languages/Language.php的第1495行(function fallback8bitEncoding()中的)语句$this->load();有什么作用?
我是一名简体中文用户,我们的默认字符集是gb2312。当使用Firefox浏览Mediawiki平台的站点并搜索中文时,如果点击浏览器地址栏并按回车键,将会进入一个标题为乱码的页面。
意思是(不论Firefox还是IE)如果点击“http://zh.wikipedia.org/w/index.php?title=首页”,它能够进入正常的页面;而如果将其输入浏览器地址栏并按回车键,将会进入一个标题为“Ê×Ò³”的错误页面。
我在尝试解决此问题时,发现可通过删除languages/Language.php第1495行的“$this->load();”能够完美地解决此问题。那么这行语句(在其它语言中)是否确有必要存在并能否去掉?

zayoo


Version: 1.13.x
Severity: normal
OS: Windows Server 2003
Platform: PC
URL: /languages/Language.php

Details

Reference
bz17020

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:24 PM
bzimport set Reference to bz17020.
bzimport added a subscriber: Unknown Object (MLST).

It is enough for you to describe problem, it is not necessary to throw guesses why it may be so, which may mislead developers.

Now, this seems to be problem with fallback encodings. Further, it looks like that Chinese does not have any fallback encoding specified. And I guess it uses the default one which is windows-1252. This happens when non-utf8 text is inputted, which is the case when not following links.

Can you try to set the $fallback8bitEncoding to gb2312 in appropriate Messages file helps? Alternative you can configure your browser to use utf-8 by default.

zayoo wrote:

There is no use changing default charset, while setting $fallback8bitEncoding failed(maybe I don't know how to set it).

I've found this line means to change charset while it finds the title entered non-utf8.

return $this->iconv( $this->fallback8bitEncoding(), "utf-8", $s );
Language.php 1491 function checkTitleEncoding( $s )

Here I don't know what $this->fallback8bitEncoding() is, and I can't have it printed. But when leave the first parameter empty, that is iconv("", "utf-8", $s );, it also works nice, maybe it uses the default charset of server for input.

Either delete $this->load(); in function fallback8bitEncoding() or have return $this->iconv("", "utf-8", $s ); works nice.

I have no ability testing function load() because it's widely used and is about cache. I just found when $this->load(); exists, there may be a redirect(that causes the problem) while non-utf8 title entered, and while deleted, the redirect disappears.

What is important, a programmer who uses single-byte charactors only cannot think the way of those who uses multi-byte charactors.(haha~)

At last I want to say, I'm not good at English, this is my first time reporting bug(is it?) to corporated programmers, and, I'm one of such GNU/Windows, I can't master Linux and even php. So my thought may be strange and have you troubled.

Now my wiki have $this->load(); deleted and works nice, while other Chinese wikis(including zh.wikipedia.org) not:
http://www.ipal.org.cn/i/
(Warning: Chinese) but only can be tested if you use Chinese.

And I'll have more discusions with other Chinese programmers, and test new wikis in virture machine, while have modification before installation or have language changed.

Yours faithfully,
zayoo

The Chinese case is probably also complicated by the existence of separate simplified and traditional Chinese locales.

zayoo's testing with the unset stuff may indicate that in some cases iconv() does autodetection (and is actually working in this case!), which is spiffy if so... don't know how reliable that will be, though. Additionally, this may or may not work depending on what actual iconv or mb_string configs are set up. This'll need some research and testing...

Any takers? CC-ing Shinjiman and philip, as I assume they have more knowledge on the matter than us Westerners do :)

Just added the fallback encoding for the folowing message files (as r49829):

Traditional Chinese (zh-Hant): Windows-950 -> CP950 -> Big5
Simplified Chinese (zh-Hans): Windows-936 -> CP936 -> GB2312
Hong Kong Chinese (zh-HK) : Big5-HKSCS

P.S. For the generic Chinese language (zh), there's more than one codepage are used so an idea to using more than one encoding as fallback is considerable. For example, try the first encoding. If found, go to the page; otherwise try the second encoding and so on.

Is there a reliable way to identify whether the encoding conversion appears to be successful that can distinguish these?

Alternatively, can we make use of things like Accept-Language headers to aid in our guess?

The Accept-Language headers can be changed per user's preferences on their browser.
But briefly can be guessed what sort of the non-Unicode encoding that they wanted.

zayoo wrote:

This problem remains on 1.13.5, and may remain on 1.14.0.

It occured an error only once when $this->load(); in function fallback8bitEncoding() be deleted. I'm now having return $this->iconv( $this->fallback8bitEncoding(), "utf-8", $s ); changed into return $this->iconv( "", "utf-8", $s ); on two servers and they have worked for a long time.

More research shows when the server is Windows(English) and client is Windows(Chinese), it also works nice as if the server is also Chinese.

What's the status of this bug in MediaWiki 1.16?

zayoo wrote:

It works very nice in Mediawiki 1.16, while both client and server are Chinese, and non-unicode is set to Chinese(PRC), using both IE and Firefox. I will test it in English system soon.

PHP warnings occur several times about something undefined, so I made this in LocalSettings.php:

if (!isset($_SERVER['REQUEST_URI']))
{

if(!isset($_SERVER['SCRIPT_NAME'])) $_SERVER['SCRIPT_NAME']='';
$_SERVER['REQUEST_URI'] = $_SERVER['SCRIPT_NAME'];
if(isset($_SERVER['QUERY_STRING'])) {
    $_SERVER['REQUEST_URI'] .= "?" . $_SERVER['QUERY_STRING'];
}

}
if (!isset($_SERVER['REQUEST_METHOD'])) {$_SERVER['REQUEST_METHOD']='GET';}

By the way, when I use IIRF for rewrite, Chinese comes into massy code sometimes (repeatable for certain titles), especially the number of Chinese is ODD or there're ASCII letters inside, on both clicking a link and typing into the bar. It occurs only when the server uses Chinese for non-unicode, but becomes normal when set English for non-unicode. It is a bug of PHP (not Mediawiki), but never occur on ASP pages (Why?). Fortunately, I did these and it works for most of the titles. In iirf.ini:

RewriteRule ^/$ /i/index.php?title=%E9%A6%96%E9%A1%B5 [L,QSA]
RewriteRule ^/zh[/]*$ /i/index.php?title=%E9%A6%96%E9%A1%B5 [L,QSA]
RewriteRule ^/zh/(.*)[_\x20]\((.*)\)$ /i/index.php?title=$1|||$2|| [L,QSA]
RewriteRule ^/zh/(.*)$ /i/index.php?title=$1| [L,QSA]

and in LocalSettings.ini:

if (isset($_GET['title']))
{
$_GET['title']=str_replace("|||","_(",$_GET['title']);
$_GET['title']=str_replace("||",")",$_GET['title']);
$_GET['title']=str_replace("|","",$_GET['title']);
}

I know it's not a good method. I need some help.

And I also want Extension:SpecialUploadLocal for 1.16.0. I tried but failed - there are too many classes involved in uploading. I think it's better to take this function as an official (embedded) one.

Another question: can it be ignored for upper-lower case and/or traditional-simplified Chinese for titles in Mediawiki?

zayoo wrote:

Everything is OK now, and the bug of PHP is even fixed by a new PHP version. This topic can be closed.