Page MenuHomePhabricator

Don't add space characters between transcluded pages in Chinese Wikisource
Closed, ResolvedPublic

Description

Author: wmr89502270

Description:
In Chinese language, we don't use space to split words. Thus, there is no need to add space automatically when merge lines and pages.

For example, in this page:
https://zh.wikisource.org/wiki/Page:Real_Story_of_Red_China_Land_Reform_-_NARA_-_5730064.jpg

Entered text are:

這裏印出的八張照片,是由一
個逃亡的共幹從大陸偷帶到香港的
。照片所拍攝的事實發生在廣東佛
岡縣,時間是民國四十一年七月廿
....

Displayed text are:

這裏印出的八張照片,是由一 個逃亡的共幹從大陸偷帶到香港的 。照片所拍攝的事實發生在廣東佛 岡縣,時間是民國四十一年七月廿 ....

The space should be removed.

In this page:

https://zh.wikisource.org/wiki/%E9%91%84%E6%83%85

When the text from scanned page were merged, a space is added. This space should be removed too.


Version: unspecified
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:38 AM
bzimport added projects: ProofreadPage, I18n.
bzimport set Reference to bz58729.
bzimport added a subscriber: Unknown Object (MLST).

I feel this is based on how HTML works.

(In reply to comment #2)

Is it a new issue?

I don't think so.

Seems that there needs to be a configurable option for a wiki to have a space or not to have a space between transcluded pages. Presumably set in the MW: namespace.

Seems that there needs to be a configurable option for a wiki to have a space or not to have a space between transcluded pages. Presumably set in the MW: namespace.

indeed

@wmr: No progress yet because nobody has written a patch yet. You are very welcome to use developer access to submit a proposed code change as a Git branch directly into Gerrit which makes it easier to review it quickly and provide feedback. Thanks!

@wmr: No progress yet because nobody has written a patch yet. You are very welcome to use developer access to submit a proposed code change as a Git branch directly into Gerrit which makes it easier to review it quickly and provide feedback. Thanks!

Can anyone do it for me?

I think it's as easy as changing line 231 from

$out .= " ";

to

$out .= "{{:MediaWiki:Word-separator}}";

and <del>setting the content of MediaWiki:Word-separator as

&#32;

by default through a way that I don't know</del>.

It already existed.

I might have time to do this later today.

Just to clarify: is it correct that there are *no* situations in which a space character should be added between transcluded pages? (It sounds like it, but I just wanted to make sure.)

And the word-separator system message sounds like a good option. Makes me wonder, though, if there's some other part of MediaWiki that already needs to know this info, that we could use. Does anyone know of such a thing?

@Samwilson,

Thank you.

And the word-separator system message sounds like a good option. Makes me wonder, though, if there's some other part of MediaWiki that already needs to know this info, that we could use. Does anyone know of such a thing?

Yes. https://translatewiki.net/w/i.php?title=MediaWiki:Word-separator

A space is required when pages or lines merging for English language, but not for Chinese language.

I have just realized that my purposed change only stops adding space between pages but not between lines.

The addition of space between lines is for all pages regardless of namespace. So fixing this problem should be involving changing somewhere other than this proofreading extension.

I think for inter-line spaces, you just have to remove them manually; that's a standard HTML sort of thing. On English Wikisource, it's policy to remove mid-paragraph line breaks.

Samwilson renamed this task from Please stop add space automatically when merge lines and pages in Chinese Wikisource to Don't add space characters between transcluded pages in Chinese Wikisource.Dec 5 2016, 1:38 AM

If it's part of ProofreadPage, the message should probably be something more specific e.g. proofreadpage-page-separator. Sound okay?

If it's part of ProofreadPage, the message should probably be something more specific e.g. proofreadpage-page-separator. Sound okay?

Looks good to me.

On English Wikisource, it's policy to remove mid-paragraph line breaks.

Would you mind providing a citation for this statement?

In T60729#2845977, @wmr wrote:

Would you mind providing a citation for this statement?

I shouldn't have said 'policy' exactly, but rather 'convention'. See https://en.wikisource.org/wiki/Help:Beginner%27s_guide_to_typography#Paragraphs_and_sentences

Change 430811 had a related patch set uploaded (by Candalua; owner: Candalua):
[mediawiki/extensions/ProofreadPage@master] PagesTagParser: Make the page separator a configurable system message

https://gerrit.wikimedia.org/r/430811

Seeing that this task was still stalled, I finally gathered the courage to prepare a patch myself, following wmr's and Sam Wilson's suggestions.
I named the new system message MediaWiki:Proofreadpage page separator.

I'm not sure if using All-and-every-Wikisource is ok or not, as its task description says "Please do not report language specific tasks under this project"

I'm not sure if using All-and-every-Wikisource is ok or not, as its task description says "Please do not report language specific tasks under this project"

Well, the "real" task is "make the separator configurable", which applies to all Wikisources, although most of them will just use the default.

Change 430811 abandoned by Candalua:
PagesTagParser: Make the page separator a configurable system message

https://gerrit.wikimedia.org/r/430811

Change 431100 had a related patch set uploaded (by Candalua; owner: Candalua):
[mediawiki/extensions/ProofreadPage@master] PagesTagParser: Make the page separator a configuration variable

https://gerrit.wikimedia.org/r/431100

Ok, second try. This new patch uses a configuration variable rather than a system message, as suggested by Tpt.
The project(s) which want to suppress the space between pages will have to set wgProofreadPagePageSeparator = "".
By default the value will be &#32; as before.

I would suggest that this is a change that requires a community consensus be so configured with a subsequent Wikimedia site request.

Guessing that this will be described on the Extension homepage.

Change 431100 merged by jenkins-bot:
[mediawiki/extensions/ProofreadPage@master] PagesTagParser: Make the page separator a configuration variable

https://gerrit.wikimedia.org/r/431100

@Billinghurst Yes, the change that have been done in ProofreadPage only allows to change the ProofreadPage behavior, it does not change anything by itself. The change to zhwikisource should be done using a site request after a consensus on zhwikisource.

Chinese Wikisource already has reached a consensus to enable this feature as soon as possible. I have notified the community at https://zh.wikisource.org/wiki/Wikisource:%E5%86%99%E5%AD%97%E9%97%B4#%E5%B0%86%E9%A1%B5%E9%9D%A2%E4%B9%8B%E9%97%B4%E7%9A%84%E7%A9%BA%E6%A0%BC%E7%A7%BB%E9%99%A4%E7%9A%84%E4%BB%A3%E7%A0%81%E5%B7%B2%E9%83%A8%E7%BD%B2%E4%BA%8E%E7%BB%B4%E5%9F%BA%E6%96%87%E5%BA%93%EF%BC%8C%E9%9C%80%E8%A6%81%E5%A4%A7%E5%AE%B6%E6%8A%95%E7%A5%A8%E5%90%AF%E7%94%A8 . In fact, a gadget to do the very same job has been available for months at Chinese Wikisource. They were not enabled by default due to performance concerns of doing massive regular expression replacements over the entire article. This patch is the most efficient and easy solution to the problem.
Also, Korean, Japanese and Vietnamese Wikisources might be interested in this improvement? Anybody to notify them?

I created a separate task for the site request related to zh.source: T194875.

I will notify (in English) the other communities about requesting similar site requests, and then I think we can close this task.

I opened a similar request T195873 on behalf of the Japanese community.

The Koreans were notified but so far they did not answer.

I didn't ask the Vietnamese, but they are probably best off with the space separator, as most of their texts seems to be in the Latin-based alphabet.

This can be considered as resolved, can't it? Respective gerrit commit merged (and deployed, because respective train picked it up automatically), what are the actionables here? Only the individual communities? Can (and should) be handled in dedicated tasks IMHO.

Candalua claimed this task.

This can be considered as resolved, can't it? Respective gerrit commit merged (and deployed, because respective train picked it up automatically), what are the actionables here? Only the individual communities? Can (and should) be handled in dedicated tasks IMHO.

Yes, I'm closing it as resolved.

The function of removing space from line breaks is still badly needed in Chinese Wikisource. Line breaks are kept to help proofreading.

How about adding a new option in "pages" to replace HTML line breaks with a customized string? Like

<pages index="file.pdf" from="18" to="24" breaks="" />

to replace breaks with nothing, which means to remove the line breaks.

wmr subscribed.
wmr renamed this task from Don't add space characters between transcluded pages in Chinese Wikisource to Don't add space characters between transcluded pages and customized HTML line break replacement in Chinese Wikisource.Jun 5 2020, 2:12 PM
wmr reopened this task as Open.
wmr raised the priority of this task from Low to Medium.

@wmr: Please do not broaden the scope or change the priority of existing tickets. Follow https://www.mediawiki.org/wiki/How_to_report_a_bug - thanks.

Aklapper renamed this task from Don't add space characters between transcluded pages and customized HTML line break replacement in Chinese Wikisource to Don't add space characters between transcluded pages in Chinese Wikisource.Jun 5 2020, 2:25 PM
Aklapper closed this task as Resolved.
Aklapper lowered the priority of this task from Medium to Low.