Language conversion is not applied in documents delivered by the Collection extension
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	• bzimport
	Mar 3 2012, 1:22 AM

Description

Author: yaoziyuan

Description:
After the fixing of T35430, the Chinese Wikipedia community says there is still another problem that prevents them from adopting the latest MediaWiki version that provides PDF/ebook creation for the Chinese Wikipedia.

This remaining problem is, because wiki text of the Chinese Wikipedia is a mix of both simplified and traditional Chinese (mainlanders tend to contribute edits in simplified Chinese, while Taiwanese / Hong Kongese tend to contribute in traditional Chinese), it needs to be converted to all-simplified or all-traditional before being displayed or made into PDFs.

Version: unspecified
Severity: major
See Also:
http://web.archive.org/web/20111002213849/http://code.pediapress.com/wiki/ticket/574

Details

Reference: bz34919

Related Objects
Search...

Status	Subtype	Assigned	Task
Invalid		None	T36919 Language conversion is not applied in documents delivered by the Collection extension
Open		None	T43716 [EPIC] Support language variant conversion in Parsoid
Open		None	T21044 Document LanguageConverter
Open		None	T53587 Parsoid needs to run findVariantLink or some equivalent thing
Invalid		• GWicke	T48658 Tpl-style encapsulation for <include> and lang-variant conversions
Resolved		liangent	T45547 MediaWiki needs a fictitious variant for English for easier variant development work
Resolved		thiemowmde	T156280 Wikibase assumes English doesn't have a variant
Open		None	T54661 Preprocessor/Parser irregularities with -{...}- variant constructs.
Resolved		cscott	T146304 Preprocessor should handle -{...}- variant constructs in template arguments
Resolved		cscott	T153761 Incorrect parser output if -{{ appears in wikitext
Resolved		• Elitre	T165175 Support communications around the preprocessor fixups
Resolved		cscott	T146305 Parser should protect -{...}- variant constructs in links
Resolved		cscott	T54192 Markups in alt param of <gallery> are "eaten" during parsing
Resolved		cscott	T54190 <gallery> with \|link=<external link> doesn't work on wikis with LanguageConverter
Resolved		cscott	T153135 doBlockLevels breaks with embedded language converter markup
Resolved		cscott	T153140 -{ ... }- markup breaks tables
Open		None	T153265 Language converter source text and language names cannot use <nowiki> escaping.
Duplicate	BUG REPORT	None	T353501 new Parsoid cannot parse the converter wikitext syntax
Resolved		cscott	T153341 Export LanguageConverter enabled status in page info from core
Open		None	T204966 Production use of LanguageConverter for read views of Phase 2A languages
Open		None	T204968 Production use of LanguageConverter for read views of Phase 2B languages
Open		None	T204969 Production use of LanguageConverter for read views of Phase 2C languages
Open		None	T222328 [extlink] parsing - link cannot contain language variant or extension tags
Resolved	BUG REPORT	Jgiannelos	T305383 [BUG] Kazakh Wikipedia Character mapping
Open		None	T320733 Support and document how language conversion work with multidirectional wikitext <=> HTML conversion on language-conversion-supported extensions.

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:19 AM

• bzimport added projects: Collection, I18n.

• bzimport set Reference to bz34919.

• bzimport added a subscriber: Unknown Object (MLST).

• bzimport created this task.Mar 3 2012, 1:22 AM

Language converter is not only used on zhwiki.

volker.haas wrote:

Is the conversion to all-simplified of all-traditional done for "regular" display in the browser - and therefore only a problem with the PDFs at the moment? If that is the case:

how is the conversion done for the browser
can someone provide a minimal example with simplified and traditional chinese
what would be a good start to read in order to understand the problematic of simplified vs. traditional chinese and conversion methods

yaoziyuan wrote:

The Chinese Wikipedia itself already has a simplified <-> traditional Chinese automatic conversion tool for displaying. It is explained here:

http://meta.wikimedia.org/wiki/Automatic_conversion_between_simplified_and_traditional_Chinese

An example of the conversion in action:

Simplified: http://zh.wikipedia.org/zh-cn/%E4%BA%94%E4%BB%A3%E5%8D%81%E5%9B%BD

Traditional: http://zh.wikipedia.org/zh-tw/%E4%BA%94%E4%BB%A3%E5%8D%81%E5%9B%BD

(In reply to comment #2)

Is the conversion to all-simplified of all-traditional done for "regular"
display in the browser - and therefore only a problem with the PDFs at the
moment? If that is the case:

how is the conversion done for the browser

can someone provide a minimal example with simplified and traditional chinese

what would be a good start to read in order to understand the problematic of

simplified vs. traditional chinese and conversion methods

Technically the language conversion process is done after the normal parsing process. This means if you parse the article in your own way (to generate PDF) you have to apply conversion to your parser result manually. Note that the current converter (in languages/LanguageConverter.php) is just designed to convert HTML.

yaoziyuan wrote:

I'm sure there are many PHP-based simplified/traditional Chinese conversion libraries.

(In reply to comment #5)

I'm sure there are many PHP-based simplified/traditional Chinese conversion
libraries.

mwlib (the wikitext parser & PDF generator used by Extension:Collection) is not written by PHP. Besides you have to consider conversion markups such as -{}-.

volker.haas wrote:

The conversion script doesn't exactly look trivial: http://svn.wikimedia.org/doc/LanguageConverter_8php_source.html

Does anybody have an idea how to get the conversion done without the need to reimplement the language converter in python suitable for mwlib?

yaoziyuan wrote:

Google for an existing python-based conversion library?

ralf_wikimedia wrote:

or just ask for patches?

yaoziyuan wrote:

Google Translate also offers simp. <-> trad. Chinese conversion. Maybe call its API?

(In reply to comment #10)

Google Translate also offers simp. <-> trad. Chinese conversion. Maybe call its
API?

Even in LanguageConverter.php, more code is used to do, for example, parsing conversion markup, grabbing proper parts to convert, reading on-site conversion table, handle page links etc., than actually convert the text.

yaoziyuan wrote:

I increasingly believe, such features should better be implemented on the client side, e.g. a "site to pdf ebook" program that converts a given site (blog, wiki, pages of certain depth from a start page, etc.) to a pdf.

yaoziyuan wrote:

If you do it too "back end"-wise, you have to much processing in the middle, like this chinese conversion thing.

volker.haas wrote:

The problem with the "client-side" approach is that every client needs to re-implement these specific features (like the simple/traditional conversion).

If we ever use HTML as the base for PDF rendering this problem will be solved as long as MediaWiki takes care of the transformation. In the meantime I'd happily accept a patch for the problem, but I lack the time to implement the simple/traditional conversion.

yaoziyuan wrote:

(In reply to comment #14)

The problem with the "client-side" approach is that every client needs to
re-implement these specific features (like the simple/traditional conversion).

No, because simple/traditional conversion is already taken care of by the Chinese Wikipedia on the server side.

If we ever use HTML as the base for PDF rendering this problem will be solved
as long as MediaWiki takes care of the transformation. In the meantime I'd
happily accept a patch for the problem, but I lack the time to implement the
simple/traditional conversion.

That's exactly why I think third-party client-side or browser-side pdf/ebook creation solutions would provide what PrediaPress hasn't provided.

barabbas wrote:

FYI, before LanguageConverter.php, there's a quick'n'dirty trail of LanguageZh.php: https://bugzilla.wikimedia.org/show_bug.cgi?id=5343

(In reply to Liangent from comment #6)

Besides you have to consider conversion markups such as
-{}-.

The test case provided by Nikola in http://web.archive.org/web/20111002213849/http://code.pediapress.com/wiki/ticket/574 is still valid:
https://sr.wikipedia.org/w/index.php?title=%D0%9F%D0%BE%D1%81%D0%B5%D0%B1%D0%BD%D0%BE:%D0%9A%D1%9A%D0%B8%D0%B3%D0%B0&bookcmd=render_article&arttitle=%D0%9A%D0%BE%D1%80%D0%B8%D1%81%D0%BD%D0%B8%D0%BA%3A%D0%9D%D0%B8%D0%BA%D0%BE%D0%BB%D0%B0+%D0%A1%D0%BC%D0%BE%D0%BB%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%2FCollection+bugs&oldid=2610141&writer=rdf2latex

Created attachment 16595
Корисник:Никола Смоленски/Collection bugs.pdf

Serbian test case PDF as produced by [[mw:OCG]]/rdf2latex/new PDF rendering.

Attached:

7c13ce116ae4562497364dd6d0be4567608265d542 KBDownload

He7d3r awarded a token.Nov 24 2014, 1:17 PM

Shizhao added a project: MediaWiki-Language-converter.May 12 2015, 1:05 PM

Shizhao set Security to None.

MarkAHershberger unsubscribed.May 12 2015, 3:32 PM

Yes, this is a side-effect of the fact that Parsoid still lacks support for language converter. But I'm working on it!

Shizhao added a subtask: T125033: [DO NOT USE] Chinese Wikimedia projects (tracking) [superseded by #Chinese-Sites].Jan 28 2016, 2:11 AM

Restricted Application added a project: Internet-Archive. · View Herald TranscriptJan 28 2016, 2:11 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Liuxinyu970226 removed a subtask: T125033: [DO NOT USE] Chinese Wikimedia projects (tracking) [superseded by #Chinese-Sites].Jan 28 2016, 4:03 AM

Liuxinyu970226 added a parent task: T125033: [DO NOT USE] Chinese Wikimedia projects (tracking) [superseded by #Chinese-Sites].

Shizhao mentioned this in T128425: Introducing the Book Creator in Chinese Wikipedia.Mar 1 2016, 2:23 AM

Aklapper added a project: Chinese-Sites.Dec 21 2016, 9:58 AM

Restricted Application added a subscriber: Stang. · View Herald TranscriptDec 21 2016, 9:58 AM

Aklapper removed a parent task: T125033: [DO NOT USE] Chinese Wikimedia projects (tracking) [superseded by #Chinese-Sites].Dec 21 2016, 10:03 AM

zhuyifei1999 moved this task from Backlog to Extensions/Skins on the Chinese-Sites board.Dec 21 2016, 8:08 PM

Liuxinyu970226 updated the task description. (Show Details)Dec 29 2016, 7:40 AM

Liuxinyu970226 removed a subscriber: • wikibugs-l-list.

Liuxinyu970226 changed the status of subtask T43716: [EPIC] Support language variant conversion in Parsoid from Open to Stalled.Jan 1 2017, 5:44 AM

Legoktm changed the status of subtask T43716: [EPIC] Support language variant conversion in Parsoid from Stalled to Open.Jan 1 2017, 10:04 AM

Liuxinyu970226 mentioned this in T158467: Re-enable Collection on Sranan Wikipedia (srnwiki).Feb 18 2017, 9:22 AM

Apologize for copying this sentense here, that @Aklapper you said in many OCG related tasks:

As already announced in Tech News, OfflineContentGenerator (OCG) will not be used anymore after October 1st, 2017 on Wikimedia sites. OCG will be replaced by Electron. You can read more on mediawiki.org.

Let's focus on T167603? Or this problem will still exists even loss PDF features?

Amire80 moved this task from Untriaged to Script & term conversion on the I18n board.Feb 4 2018, 10:48 AM

Harej removed a project: Internet-Archive.Nov 2 2021, 7:32 PM

Stang moved this task from Extensions/Skins to Closed on the Chinese-Sites board.Nov 3 2021, 3:34 PM

Stang moved this task from Closed to Extensions/Skins on the Chinese-Sites board.

Winston_Sung moved this task from Backlog to Extensions/Skins on the MediaWiki-Language-converter board.Mar 18 2023, 4:22 AM

Restricted Application added a subscriber: Ericliu1912. · View Herald TranscriptMar 18 2023, 4:22 AM

Closing as obsolete - collection export functionality has been dead for years. It appears the same bug exists as T167603 for the maintained PDF functionality, so this ticket serves no continued purpose.

Stang moved this task from Extensions/Skins to Closed on the Chinese-Sites board.Nov 5 2023, 8:03 PM

Stang unsubscribed.Nov 8 2023, 10:33 PM

	F9210: 7c13ce116ae4562497364dd6d0be4567608265d5
	Nov 22 2014, 12:19 AM

Language conversion is not applied in documents delivered by the Collection extensionClosed, InvalidPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Language conversion is not applied in documents delivered by the Collection extension
Closed, InvalidPublic
Actions

Related Objects
Search...