Page MenuHomePhabricator

Keep simplifed Chinese characters out of zh-tw please
Closed, ResolvedPublic

Description

Hello. There are a few non-Taiwan variant characters that have crept
into the Taiwan message files.

This is a different bug that just plain translating.

It involves adding another layer of caution to catch things even a good translations would miss.

The translations are fine, it is just that a final test must be added
to catch very similar looking wrong characters.

They are not Unicode variants, no, but instead characters that have
never been seen before in Taiwan, as shown by the fact they don't make
the round trip to big5 and then back to Unicode. They are typos for
the common Taiwan version. For example, they are not simplified
Chinese 钩 present in GB2312, nor the common Taiwan version present in
big5, 鉤, but instead a third variant: 鈎.

What I am hoping you will do is add a test to make sure such
characters don't again creep in again.

The test should say "Non Taiwan characters found in file ...;
Please pick the Taiwan versions (e.g., replace 鈎 with 鉤) before
this version of MediaWiki can be released
" die(1);

Here is the makefile I used:
d=/var/lib/mediawiki/languages/messages
v:$d/MessagesZh_hant.twdiff $d/MessagesZh_tw.twdiff
%.twdiff:%.php
iconv -ct big5 $?|iconv -f big5|diff -U0 $? -|sed /^@@/d

Note that it is crude, in that it also catches superscript numbers
etc., though all we want to be on the lookout for is the Chinese
characters.

And here is the results. You will notice the missing characters
are the ones that didn't make the round trip to big5 and back.

No I'm not just asking you to correct those characters and forget this
bug.

I'm saying that a test needs to be added to always catch such things
before each MediaWiki release can proceed.

Also consider extending the test to MessagesZh_classical.php etc.

(Lastly, this is not a diff to be applied to anything!)

make v
iconv -ct big5 /var/lib/mediawiki/languages/messages/MessagesZh_hant.php|iconv -f big5|diff -U0 /var/lib/mediawiki/languages/messages/MessagesZh_hant.php -|sed /^@@/d

  • /var/lib/mediawiki/languages/messages/MessagesZh_hant.php 2009-03-01 23:31:04.000000000 +0800

+++ - 2009-03-05 08:07:19.263443332 +0800
-/ Traditional Chinese (‪中文(繁體)‬)
+/
Traditional Chinese (中文(繁體))
-'usercssjsyoucanpreview' => "'''提示:''' 在保存前請用'顯示預覧'按鈕來測試您新的 CSS/JS 。",
+'usercssjsyoucanpreview' => "'''提示:''' 在保存前請用'顯示預'按鈕來測試您新的 CSS/JS 。",
-'edit-hook-aborted' => '編輯被鈎取消。
+'edit-hook-aborted' => '編輯被取消。
-'post-expand-template-argument-category' => '包含着略過模板參數的頁面',
+'post-expand-template-argument-category' => '包含略過模板參數的頁面',
-'timezonetext' => '¹輸入當地時間與伺服器時間(UTC)的時差。',
+'timezonetext' => '輸入當地時間與伺服器時間(UTC)的時差。',
-'timezoneoffset' => '時差¹:',
+'timezoneoffset' => '時差:',
-Template:消歧义
-Template:消除歧义
+Template:消歧
+Template:消除歧
-'protect-cascadeon' => '以下的{{PLURAL:$1|一個|多個}}頁面包含着本頁面的同時,啟動了連鎖保護,因此本頁面目前也被保護,未能編輯。您可以設定本頁面的保護級別,但這並不會對連鎖保護有所影響。',
+'protect-cascadeon' => '以下的{{PLURAL:$1|一個|多個}}頁面包含本頁面的同時,啟動了連鎖保護,因此本頁面目前也被保護,未能編輯。您可以設定本頁面的保護級別,但這並不會對連鎖保護有所影響。',
-'trackbackremove' => '([$1删除])',
+'trackbackremove' => '([$1除])',
-'version-parserhooks' => '語法鈎',
+'version-parserhooks' => '語法',
-'version-hooks' => '鈎',
+'version-hooks' => '',
-'version-parser-function-hooks' => '語法函數鈎',
+'version-parser-function-hooks' => 語法函數',
-'version-hook-name' => '鈎名',
+'version-hook-name' => '名',
iconv -ct big5 /var/lib/mediawiki/languages/messages/MessagesZh_tw.php|iconv -f big5|diff -U0 /var/lib/mediawiki/languages/messages/MessagesZh_tw.php -|sed /^@@/d

  • /var/lib/mediawiki/languages/messages/MessagesZh_tw.php 2009-03-01 06:04:42.000000000 +0800

+++ - 2009-03-05 08:07:19.292322989 +0800
-/ Chinese (Taiwan) (‪中文(台灣)‬)
+/
Chinese (Taiwan) (中文(台灣))

  • * @author לערי ריינהארט

+ * @author
-'usercssjsyoucanpreview' => "'''提示:''' 在保存前請用'顯示預覧'按鈕來測試您新的 CSS/JS 。",
+'usercssjsyoucanpreview' => "'''提示:''' 在保存前請用'顯示預'按鈕來測試您新的 CSS/JS 。",
-'timezoneoffset' => '時差¹',
+'timezoneoffset' => '時差',
-Template:消歧义
-Template:消除歧义
+Template:消歧
+Template:消除歧
-'protect-cascadeon' => '以下的{{PLURAL:$1|一個|多個}}頁面包含着本頁面的同時,啟動了連鎖保護,因此本頁面目前也被保護,未能編輯。您可以設定本頁面的保護級別,但這並不會對連鎖保護有所影響。',
+'protect-cascadeon' => '以下的{{PLURAL:$1|一個|多個}}頁面包含本頁面的同時,啟動了連鎖保護,因此本頁面目前也被保護,未能編輯。您可以設定本頁面的保護級別,但這並不會對連鎖保護有所影響。',
-'trackbackremove' => '([$1删除])',
+'trackbackremove' => '([$1除])',


Version: 1.15.x
Severity: normal

Details

Reference
bz17794

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:33 PM
bzimport set Reference to bz17794.
bzimport added a subscriber: Unknown Object (MLST).

There needs to be way that does not produce lots of false positives. Otherwise PHP's iconv function could be used.

OK, your wish is my command!

$ make v
perl -C -plwe 's/\P{Han}g;s/./$&\n/g' /var/lib/mediawiki/languages/messages/MessagesZh_hant.php|sort -u > tmpA
iconv -ct big5 tmpA|iconv -f big5|sort -u|comm -31 - tmpA|xargs
义 删 着 覧 鈎
perl -C -plwe 's/\P{Han}
g;s/./$&\n/g' /var/lib/mediawiki/languages/messages/MessagesZh_tw.php|sort -u > tmpA
iconv -ct big5 tmpA|iconv -f big5|sort -u|comm -31 - tmpA|xargs
义 删 着 覧

d=/var/lib/mediawiki/languages/messages
v:$d/MessagesZh_hant.twdiff $d/MessagesZh_tw.twdiff
%.twdiff:%.php
perl -C -plwe 's/\P{Han}//g;s/./$$&\n/g' $?|sort -u > tmpA
iconv -ct big5 tmpA|iconv -f big5|sort -u|comm -31 - tmpA|xargs

I'm sure all the items in my Makefile could be done with PHP 'preg'
stuff and arrays.

Also you might want to add a normalization check if you don't have one
already.

Here's an example of normalization. Note it wouldn't catch the
characters mentioned earlier in this bug. Also you don't want to
convert blindly as here, but make a diff to catch them...

#!/usr/bin/perl

  1. use best Unicodes, at least so iconv -f utf8 -t big5
  2. won't hit any illegal chars.
  3. Copyright : http://www.fsf.org/copyleft/gpl.html
  4. Author : Dan Jacobson http://jidanni.org/
  5. Created On : 2006
  6. Last Modified On: Wed Nov 5 08:54:53 2008
  7. Update Count : 24

use strict;
use warnings FATAL => 'all';
use open qw/:std :encoding(utf8)/;
use Unicode::Normalize q(decompose);
while(<>){

$_=decompose($_);
s/没/沒/g;
s/━/-/g;
s/«/《/g; #ㄍ
s/ / /g;
print;

}

  1. Local Variables:
  2. compile-command: "echo 老老參參歷歷|normalize"
  3. End:

Bug #17859 asks for removal of the current crop of Simplified etc. Chinese that
accidentally has entered Traditional translations.

That bug is in addition to this bug. This bug instead asks for permanent tests to be put in place to stop such characters creeping in in the future.

Also, the test is useless if there is no translator to fix it.

I suggest you start contributing to the zh-tw localisation on http://translatewiki.net instead of ordering others to fix alleged issues. There is a fall back chain which substitutes non-localised messages. Whenever the message is localised, it will be used instead of the message from the fallback.

There are plenty of possibilities to talk to the currently active zh translators.

Closed as WONTFIX.

Never mind. I'll just run my tests at home and submit the diffs.

Thank you very much for fixing my bug.
I sent patches to the translatewiki staff and they were applied.

At times all the simplified characters were cleaned up.

However now some are creeping back in:

GET http://radioscanningtw.jidanni.org/index.php?title=Special:Allmessages\&uselang=zh-tw|
w3m -dump -T text/html>/tmp/allmess
perl -C -plwe 's/\P{Han}//g;s/./$&\n/g' /tmp/allmess|sort -u > /tmp/tmpA
iconv -ct big5 /tmp/tmpA|iconv -f big5|sort -u|comm -31 - /tmp/tmpA|xargs
义 着 鈎

This is caused by "holes in what a zh-tw site sees, where the
underlying simplified characters shine through."

I'm not sure if the above is an exact test. Please tell me a better
test if there is one.

This test finds much more:

GET http://translatewiki.net/wiki/Special:AllMessages?uselang=zh-tw|\
w3m -dump -T text/html>/tmp/allmess
perl -C -plwe 's/\P{Han}//g;s/./$&\n/g' /tmp/allmess|sort -u > /tmp/tmpA
iconv -ct big5 /tmp/tmpA|iconv -f big5|sort -u|comm -31 - /tmp/tmpA|xargs
个 为 义 删 动 变 号 嘅 对 将 录 户 护 无 时 显 来 标 欢 没 着 码 称 组 讨 记 论 评 译 语 说 跃 输 过 这 选 鈎 页 项

Anyway, keep simplifed Chinese characters out of zh-tw please.
I don't know if Hong Kong likes them, but they are highly inappropriate for Taiwan.

If there is a better way to report this tell me. I have already sent many patches to translatewiki and they were applied... but then fell back off.

Special:Allmessages also shows messages from fallback language.

OK, then could you please make traditional versions of these.
$ egrep '义|着|鈎' /tmp/allmess

disambiguationspageTemplate:消除含糊 Template:消歧义
對話Template:消除歧义 Template:消歧義
edit-hook-aborted編輯被鈎取消。它並無給出解釋。
post-expand-template-argument-category包含着略過模板參數的頁面
version-hook-name鈎名
version-hooks
version-parser-function-hooks語法函數鈎
version-parserhooks語法鈎

Use
s/义/義/g;
s/着/著/g;
s/鈎/鉤/g;

The Traditional Chinese Characters (zh-Hant) are not only imply for use in Taiwan (zh-TW), so that the characters like 着 and 鈎 would not be changed to 著 and 鉤, which the last two are done in zh-TW purposes. And both 著 and 鉤 are already done in zh-TW localisation.

Thank you, however they still are not getting in:
$ w3m -dump http://test.wikipedia.org/wiki/Special:Version?uselang=zh-tw |egrep '鉤|鈎|alpha'
MediaWiki$ 1.15alpha (r48811)
語法鈎
語法函數鈎
$ w3m -dump http://radioscanningtw.jidanni.org/index.php?title=特殊:版本資訊 |egrep '鉤|鈎|alpha'
MediaWiki$ 1.15alpha (r49146)

鈎名 利用於