Page MenuHomePhabricator

pywikibot transliteration should support chinese transliteration
Closed, ResolvedPublic

Description

Author: nzmoihue

Description:
https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/userinterfaces/transliteration.py should support more scripts like Korean, Chinese or ml. jQuery.ime https://github.com/wikimedia/jquery.ime/tree/master/rules transliteration keyboards can be used for developing it like https://github.com/wikimedia/jquery.ime/blob/master/rules/ml/ml-transliteration.js

I am using output of code on my gadget http://www.wikidata.org/wiki/MediaWiki:Gadget-SimpleTransliterate.js (http://commons.wikimedia.org/wiki/File:Wikidata_Transliteration_Gadget.png) that is why I like it is be developed a little more


Version: unspecified
Severity: enhancement
See Also:
T75410

Details

Reference
bz56524

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 2:33 AM
bzimport set Reference to bz56524.
bzimport added a subscriber: Unknown Object (????).

Change 97040 had a related patch set uploaded by Ladsgroup:
Improving transliteration support

https://gerrit.wikimedia.org/r/97040

Change 97044 had a related patch set uploaded by Ladsgroup:
Improving transliteration support

https://gerrit.wikimedia.org/r/97044

Change 97040 merged by jenkins-bot:
Improving transliteration support

https://gerrit.wikimedia.org/r/97040

Change 97044 merged by jenkins-bot:
Improving transliteration support

https://gerrit.wikimedia.org/r/97044

nzmoihue wrote:

Reopened for Chinese transliteration

Can you give me list of Chinese characters that needed to be added to this list?

(For future reference, defining the exact scripts to be supported in a bug request is welcome. If it's just about "support more" than a report can easily get unfixable by comments broadening the scope of a bug report.)

I checked that source but I couldn't find the dictionary file, [1] syas there is file named CTLauBig5.tit, but there isn't. Can you tell me more precise about the dictionary?

[1] http://cpansearch.perl.org/src/KAWASAKI/Lingua-ZH-Romanize-Pinyin-0.23/lib/Lingua/ZH/Romanize/DictZH.pm

nzmoihue wrote:

There is not a one-to-one "dictionary" there, that is why I CCd original writer of the transliteration. Also have a look at https://github.com/axgle/pinyin

Created attachment 16327
Python translation of https://github.com/axgle/pinyin

(In reply to [no longer active user] from comment #14)

There is not a one-to-one "dictionary" there, that is why I CCd original
writer of the transliteration. Also have a look at
https://github.com/axgle/pinyin

{{done}} translation of the four scripts to python. See attachment.

Attached:

Created attachment 16327 [details]
Python translation of https://github.com/axgle/pinyin

However there is some bug that caused 6651 Chinese characters getting 'zuo'.

Attached:

I suppose we can add this, but what's the intended use case? We support full unicode console output (and input, but transliteration is output-only) on all systems.

Change 157498 had a related patch set uploaded by Zhuyifei1999:
Improving transliteration support for Chinese

https://gerrit.wikimedia.org/r/157498

(In reply to Merlijn van Deen from comment #17)

I suppose we can add this, but what's the intended use case? We support full
unicode console output (and input, but transliteration is output-only) on
all systems.

Yes, that is indeed a hard question. Why do we still have transliteration.py?

Change 157498 merged by jenkins-bot:
Improving transliteration support for Chinese

https://gerrit.wikimedia.org/r/157498

Reopen if there is more to be done.

(In reply to John Mark Vandenberg from comment #21)

Reopen if there is more to be done.

zh-hant (Traditional Chinese) needed.

I will repeat my question:

I suppose we can add this, but what's the intended use case? We support full unicode console output (and input, but transliteration is output-only) on all systems.

/why/ is it needed?

(In reply to Merlijn van Deen from comment #23)

I will repeat my question:

I suppose we can add this, but what's the intended use case? We support full unicode console output (and input, but transliteration is output-only) on all systems.

/why/ is it needed?

I suppose, that when the output is somehow ASCII-limited (for some reason the log files by the grid engine on tool labs is an example of this), transliterated output could be more useful than a pile of question marks or other non-readable code.

Most of the useful ones have been added since 603ab87d916f73c5981b03ae6deb269795306276.

Is there anything left to do here?

I close this now. Please feel free to reopen it if there is something left to do.

Stang subscribed.

I believe we need some further works here...

  • More character support - current list is still incomplete
  • Handling Chinese polyphone (heteronym) character, like has two pronunciation called hang and xing. I have no idea how to handle such cases as this is a context free process.

It will benefit a gadget on Wikidata.

  • More character support - current list is still incomplete

Can you give any example or description or implementation for any language which can be adopted. Without any further information it cannot be completed.

  • Handling Chinese polyphone (heteronym) character, like has two pronunciation called hang and xing. I have no idea how to handle such cases as this is a context free process.

Polyphone characters cannot be handled in Pywikibot. Transliteration is made for any output word scraps but there is no context or semantic processing. The only way could be to use it in combination with other characters. is already transliterated as xing.

Xqt claimed this task.

I close this now as invalid because there is no description what to do with this task. It is necessary for the implementation to have a list of unicode chars and it’s transliteration. Please feel free to reopen it if there is something left to do and you have such a table or any other implementation in any language.