Page MenuHomePhabricator

Advanced editor toolbar i18n messages with markup should be avoided
Closed, DeclinedPublic

Description

In the Usability initiative extension, EditTool section, there are
(imho repititive) messages with:

  1. wikitext markup,
  2. html markup, which usually would be generated by the parser.

Imho, it is bad coding style to have them, and have them translated.
They should be automatically generated.
Pure technical changes in the parser output should not lead to
the necessity to amend and retranslate (lots of) i18n messages
elsewhere.

I believe, there should be pure textual messages.
Wikitext markup could be added programmatically.
(Altenatively, have wikitextual messages only)
Parser calls should be used to convert them to html.
Maybe, a special hook, or parameter, "convert sample/snippet"
would be required to achieve this goal, but having that is
certainly worth the added safety, plus few hundred hours of
translators labour.


Version: unspecified
Severity: enhancement

Details

Reference
bz19190

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 10:42 PM
bzimport added projects: WikiEditor, I18n.
bzimport set Reference to bz19190.
bzimport added a subscriber: Unknown Object (MLST).

Hopefully there can be a solution to this that makes sense - as I see all your points. Hover the current implementation is as such for the following reasons.

  • The messages with wikitext in them are input examples, and need to be unparsed wikitext because they are illustrating the type of syntax one would use in a wiki page.
  • The messages with HTML in them are output examples which would initially be considered a great candidate for using the parsmag option with wfMsg, however the code the parser outputs is not always the same as the code we want in the examples - as the examples are merely visual representations of what the output will look like, and not intended to be fully functional. Additionally, these bits of code may include images which are bundled with the toolbar itself as to not depend on the existence of an uploaded image in the wiki. This is especially important when distributing the toolbar to many different wikis - remember this is not only for English Wikipedia, and it will also be used by non-Wikimedia projects.
  • The messages are being inserted (without parsing) into a javascript loadGM() function. The client side gM() function can perform $1 replacements, but cannot perform any parsing. Eventually the inserting of messages into loadGM() will not be done with PHP, but rather the messages will be replaced inline on a javascript file by a script server (just before minification and packing), and the limitations of this future implementation are dictating the current one. This is an important transitional component of this software, and unless the script server is made to perform actual parsing, the messages used on the client should remain as they are right now.

It may be possible to make the text strings (which need actual translation, compared to the HTML parts which don't) their own independent messages, allowing translators to perform their work without stepping around HTML. This could potentially be done using the client-side $1 replacement functionality as well. However, this will also introduce even more individual messages which begin to loose context by the time the translation is taking place possibly making them more difficult.

My understanding, especially when discussing this type of software (JavaScript UI), is that there is an intention to move towards more simple internationalization messages (meaning, those which do not require full parsing)

Suggestions on alternative implementations are welcome.

mdale wrote:

As Trevor identifies we need to transition to less parsing in the msg text for javascript included msgs. Otherwise it becomes impossible to do localized interfaces in JavaScript without lots of external hits to the server for msg lookup.

For example you have search results interface the result msg text:
"{{PLURAL:$1|1 result|$1 results}} found".

We presently can't use the mediawiki {{PLURAL}} since we have to do the $1 swap in JS. But it would not be hard to do basic parsing of "PLURAL" in the javascript. There are a few low hanging fruits that would cover a good percentage of the Language functions. But in general we will need to be aware of what msgs we include in the JS that have to use a specific subset of the Language helper functions.
Likewise its not hard to parse [[link|linkName]] into <a href="'+wgArticlePath.replace($1, link)'">linkName</a>. I think we can do very decent coverage this way.

I propose we do the following:

  1. We port some of the simpler parser functions to JS and try and stick to that subset functionality for messages that are included in the JavaScript Interfaces.
  1. We need to write some maintenance scripts so that we don't duplicate msg storage. ie we want to write the msg text once in the js file and have a script that copies that to the MessagesEn.php (with big comment headers that identify these msgs are copied from the js and to modify them in the js instead of here)

I don't think it's desirable or necessary to port parser functions to JS. A better solution IMO would be to use the list=allmessages API module to grab the messages we need in the language we need. Magic word expansion could be added to the allmessages module as an optional parameter, which would take care of everything except the messages that rely on substitution being done before parsing (like {{PLURAL:$1|foo|bar}}, which can't easily be done in JS because some languages have more than two cases), which would be pretty much impossible without hitting the server a lot.

  1. We need to write some maintenance scripts so that we don't duplicate msg

storage. ie we want to write the msg text once in the js file and have a script
that copies that to the MessagesEn.php (with big comment headers that identify
these msgs are copied from the js and to modify them in the js instead of here)

I think keeping messages in JS is a very bad idea; messages should all be in the messages file. We could then have a PHP script that generates JS containing the messages and cache that aggressively both client-side and server-side.

The current approach will be shifting - as there is another concern about help, which is that depending on the configuration of the local wiki different help topics such as references for instance may be irrelevant and not supported. The direction Help should be going is similar to the direction of Special Characters, which is that it is supported by software everyone has in common, but configured by messages each wiki has in particular. This would place the help into the MeiaWiki:Common.js space for instance, where more parsing can take place if needed, but more importantly each wiki can make it's own set of help resources. This transition will likely take place this week, as architecture is being sketched out still.

mdale wrote:

To address Roan comments:

What we are trying to avoid with the grouping is fewer round trips to the server. This is always a trade off with how much can be cached and how much extra stuff you want to grab at once (at a cost of having to grab that whole package again once any item of that package changes)

Certainly central libraries like jquery and our core helper library could be grabbed once and cached indefinitely per urid (scriptloader presently requests core libraries separably)

I know that kind of sucks to send out a new javascript packaging every time a user language interface update occurs. On the other hand it also kind of sucks to go out to the server on a separate round trip to get messages. Assuming that the average interface module code and its messages is well under 30k (minified and gziped) .. depending on your connection speed ... the time it takes to do an extra round trip to the server can more than kill the benefit of not re-downloading packaged gzip compressed JS ... so its a trade off...

It would not be hard to retool the JS loader to grab the msgs in a separate api request if we want to try that. (does the api support gziped output?) Also we would want to add the concept of urid's (unique request id) to the API (should be easy to add in 'if $wgRequest->urid is included send the cache forever header'). With the script-loader we check the latest version of any javascript file or mediaWiki page included and append a urid to the url. This way we can cache forever on the client and just update the URID when msg text, svn version or version of mediaWiki js pages change. Without this the client has to regularly issue a requests and check for a 304 response from the server (another round trip). Does the API support 304 responses?

Modern browsers will be doing request in parallel so grouping less beneficial. For maximum cache once could propose we don't group anything (that way we only download whats new). Unfortunately that makes it a lot harder to line up URID's (unique request ids) per resource. So you end up doing a bunch of 304 response checks.

The ultimate setup would be a single resource that included an urid for every js / language req /css resource in the system that could be regularly 304 response checked. That way any js file or any of its text messages get updated we have a new urid for it, otherwise we use the older urid and the client knows never to check the server to see if its updated since it had the cache-forever header sent out on it.

But for the mean time I think the script-loader offers advantages over the current setup.

The problem with ~not~ storing the msgs in the JS is that it makes it harder to support stand alone operation. Although if we are going out to the server in a separate round trip anyway.. then not so big a deal. But the idea of putting the msg in the JS is it works without the script server and is easier to debug that way since you load the actual JS files with include the fallback English msgs used by the interface libraries your including in a given set of interface interactions.

Also the javascript interfaces are being designed to support 'stand-alone' operation this will make the easier to integrate into other CMS and environments and support remote embedding things like videos with a timed text interface with minimal round-trips to wikimedia servers.

In the case PLURAL: we probably have to modify the way that is calculated on the server via having a array representation ie( array('1-4':X, '5':Y, '6-11':Z) (instead of having a php function with switch statements) then we can package that array into the JS and replace accordingly.... We should probably never mixed code with content translation representation .. This needs to be fixed for this case and other that will arise in the future.

(In reply to comment #5)

To address Roan comments:

What we are trying to avoid with the grouping is fewer round trips to the
server. This is always a trade off with how much can be cached and how much
extra stuff you want to grab at once (at a cost of having to grab that whole
package again once any item of that package changes)

Yeah; fortunately, Wikipedia's i18n doesn't change very often (only at scap
time and when someone changes a MediaWiki: message that happens to be used in
JS).

It would not be hard to retool the JS loader to grab the msgs in a separate api
request if we want to try that.

I'll write a proof of concept-like thingy tomorrow.

(does the api support gziped output?)

Yes.

Also we
would want to add the concept of urid's (unique request id) to the API (should
be easy to add in 'if $wgRequest->urid is included send the cache forever
header').

Presently, the API supports URIDs through the requestid parameter, and
URL-based caching through the maxage and smaxage parameters (which sadly don't
work on Wikipedia because of the Squids, will poke people about that).

With the script-loader we check the latest version of any javascript
file or mediaWiki page included and append a urid to the url. This way we can
cache forever on the client and just update the URID when msg text, svn version
or version of mediaWiki js pages change. Without this the client has to
regularly issue a requests and check for a 304 response from the server
(another round trip). Does the API support 304 responses?

No.

Modern browsers will be doing request in parallel so grouping less beneficial.
For maximum cache once could propose we don't group anything (that way we only
download whats new). Unfortunately that makes it a lot harder to line up URID's
(unique request ids) per resource. So you end up doing a bunch of 304 response
checks.

The ultimate setup would be a single resource that included an urid for every
js / language req /css resource in the system that could be regularly 304
response checked. That way any js file or any of its text messages get updated
we have a new urid for it, otherwise we use the older urid and the client
knows never to check the server to see if its updated since it had the
cache-forever header sent out on it.

What about (ab)using $wgStyleVersion for this purpose, or introducing a similar
var?

But for the mean time I think the script-loader offers advantages over the
current setup.

It sure does, but it lacks parsemag functionality, which we're gonna need as
well.

The problem with ~not~ storing the msgs in the JS is that it makes it harder to
support stand alone operation. Although if we are going out to the server in a
separate round trip anyway.. then not so big a deal. But the idea of putting
the msg in the JS is it works without the script server and is easier to debug
that way since you load the actual JS files with include the fallback English
msgs used by the interface libraries your including in a given set of interface
interactions.

One of the ideas I had was to write a .php script that generates message JS, so
you'd do <script src="foo.php?blah"></script> . Of course that PHP script would
set some pretty aggressive caching headers.

Also the javascript interfaces are being designed to support 'stand-alone'
operation this will make the easier to integrate into other CMS and
environments and support remote embedding things like videos with a timed text
interface with minimal round-trips to wikimedia servers.

In the case PLURAL: we probably have to modify the way that is calculated on
the server via having a array representation ie( array('1-4':X, '5':Y,
'6-11':Z) (instead of having a php function with switch statements) then we can
package that array into the JS and replace accordingly.... We should probably
never mixed code with content translation representation .. This needs to be
fixed for this case and other that will arise in the future.

About this: do we really *need* PLURAL in JS messages? Isn't there some way
that we can trade functionality for performance and just not support it?

(In reply to comment #6)

About this: do we really *need* PLURAL in JS messages? Isn't there some way
that we can trade functionality for performance and just not support it?

Need any other response here but *cough*? If so: yes.

mdale wrote:

(In reply to comment #6)

Yeah; fortunately, Wikipedia's i18n doesn't change very often (only at scap
time and when someone changes a MediaWiki: message that happens to be used in
JS).

yea .. at any rate it should be no more computationally expensive than rendering wiki-pages and normal interface updates that have to update all the html output.

It would not be hard to retool the JS loader to grab the msgs in a separate api
request if we want to try that.

I'll write a proof of concept-like thingy tomorrow.

if you take a look at the branches/new-upload/phase3/js2/mwEmbed/jsScriptLoader.php & branches/new-upload/phase3/js2/mwEmbed/mv_embed.js you can see we doLoad function.

Every JS file in mwEmbed has a list of all the msg included in the header. (children can assume their parents msgs are available). It would be easier to go back through he script server that already parses that list. That way instead of waiting for the script to come back to you with the list of msg you need; you can just send the class name to the script server and get back the entire list of msgs needed at the same time. If the point is to use the api-entry point maybe add a hook for jsClass msg look-up to the allmessages api module? ... Although without fancy expatriation stuff working its probably better to just group them and package the msg into the js in a single request (as its doing presently)

(does the api support gziped output?)

Yes.

cool

Presently, the API supports URIDs through the requestid parameter, and
URL-based caching through the maxage and smaxage parameters (which sadly don't
work on Wikipedia because of the Squids, will poke people about that).

yea would be good to get that working.. will be needed for the script server as well.

Does the API support 304 responses?

No.

oky.

What about (ab)using $wgStyleVersion for this purpose, or introducing a similar
var?

script server currently uses a combination of wgStyleVersion and the latest page version. I want to tie it to a svn version check though so we don't have to maintain a global wgStyleVersion var. And we can lay the ground work for the ultimate setup described below

But for the mean time I think the script-loader offers advantages over the
current setup.

It sure does, but it lacks parsemag functionality, which we're gonna need as
well.

yep. The point is we generally don't know what we are going to swap until we swap it in reaction to some interface interaction. So basic swamping is useful.. and basic js mediaWiki parsing would be even more useful and help keep things in-sync with php msg. (I already did a port of formatSize() in mv_embed.js as I needed it for showing the user upload progress)

One of the ideas I had was to write a .php script that generates message JS, so
you'd do <script src="foo.php?blah"></script> . Of course that PHP script would
set some pretty aggressive caching headers.

see: branches/new-upload/phase3/js2/mwEmbed/jsScriptLoader.php is caches forever since it will always be requested with a fresh urid when things change. ..we could add a param to ?onlymsg and then it would only send msg text. using the class look-up ..

About this: do we really *need* PLURAL in JS messages? Isn't there some way
that we can trade functionality for performance and just not support it?

It seem to me this would be a good thing to fix.... Obviously we will have to work around the issue in the mean time... But this will impede other efforts like abstracting parts of the parser to lower-level code (if that effort ever gets revived) and in theory would make the language msgs more maintainable no?

(In reply to comment #8)

(In reply to comment #6)

Yeah; fortunately, Wikipedia's i18n doesn't change very often (only at scap
time and when someone changes a MediaWiki: message that happens to be used in
JS).

yea .. at any rate it should be no more computationally expensive than
rendering wiki-pages and normal interface updates that have to update all the
html output.

It would not be hard to retool the JS loader to grab the msgs in a separate api
request if we want to try that.

I'll write a proof of concept-like thingy tomorrow.

if you take a look at the
branches/new-upload/phase3/js2/mwEmbed/jsScriptLoader.php &
branches/new-upload/phase3/js2/mwEmbed/mv_embed.js you can see we doLoad
function.

Ah, it seems you're already doing exactly what I had in mind.

(In reply to comment #5)

To address Roan comments:
In the case PLURAL: we probably have to modify the way that is calculated on
the server via having a array representation ie( array('1-4':X, '5':Y,
'6-11':Z) (instead of having a php function with switch statements) then we can
package that array into the JS and replace accordingly.... We should probably
never mixed code with content translation representation .. This needs to be
fixed for this case and other that will arise in the future.

The approach sounds pretty reasonable for me.
It could be implemented as a single pre-parse call that is executed on messages before they
are downloaded to the JS package, be it as a bundle, be it individually.

Currently, we have SITENAME, PLURAL, GENDER, and GRAMMAR in general, and if Wikis individualize their messages,
arbitrary parser functions. Leaving the latter aside for now, this would be a workable approach:

{{SITENAME}} - replaced by pre-parse.
{{GRAMMAR with {{SITENAME}} }} - evaluated and replaced by pre-parse.
{{GRAMMAR with other constant }} - evaluated and replaced by pre-parse.
{{GRAMMAR with something variable }} - open, but is it used anywhere?

{{GENDER}}
currently handled by a case list of individual named constants.
Evaluation is based on user names, whose gender is being looked up whenever a message is being rendered.
It needs adding special individial cases for few languages anyways, which simply means having more than 3 choices. Addition of "polite forms" of addressing (such as differentiating between youth, standard, and respected old age, e.g.) has been suggested for languages having that.

Can most easily be switched to enumerate constants via pre-parse.
Implementing an array approach was a snap, if usernames, which currently are passed as parameters, could be repaced by something like index(gender(username))). See below.

{{PLURAL}}
Plural often does not treat fractions well at the moment, but it works with integers and it works for 0.0
There are several individual implementations, usually with either a simple expression, or a compound of ones in a series of "if"s. Imho, the evaluation of plural could be split into to steps:

  1. call a function that returns an index,
  2. lookup of the final result via this index.

A general function returning an index imho could be implementd both in php and JS.
It would work through an array of arithmetic expressions having 1 parameter, returning the index of the 1st giving a non-zero, or true result; at the end of the array return one more.
The function would be fed with the value to check, and an array like ( 0: " x != 1 " ) for the English language, because it only has singular, and plural. Some languages in formaer Yoguslavia have dual in addition, so their array would be ( 0: " x != 1 " , 1: " x != 2 "), and the logic of English 1st, 2nd, 3rd, 4th, ... 11th, ... 21st, ... would be represented by ( 0: " ( 1 == ( x mod 10 ) ) && ( 11 != ( x mod 100 ) ) , 1: " ( 2 == ( x mod 10 ) ) && ( 12 != ( x mod 100 ) ) , 2: " ( 3 == ( x mod 10 ) ) && ( 13 != ( x mod 100 ) ) )
Implemeters of GRAMMAR can likely pretty easily provide lists of expressions in both php and JS notation (even as messages in the MediaWiki name space) - all the rest would be standard.

Coming back to GENDER above, and GRAMMAR, too, this split of 1.get index, 2.lookup result
would lessen the work to be done for those, since step 2. (lookup result) would be identical
for each of them, streamlining the code for both the php and JS implementations.
As Roan stes, this would be easily reusable in the future, should we require more variablility.

Some comments:

I do not find it acceptable to not support plural. It should be part of bare minimum of i18n support of any software (and the array approach is not going to work for that). How to do it, I see three alternatives:

  1. call MediaWiki to parse it.
  2. implement a i18n library in javascript
  3. implement a system that can generate the code for php and javascript from common source (i.e. language neutral i18n library, or just library with different language versions kept in sync)
  1. is of course a huge project (and 2) too), but on the other hand it is stupid that every software needs to re-implement it. 1) is probably too slow, as discussed above.

GENDER is problematic, because it depends on external data (gender of username X). We should also support number formatting.

(with big comment headers that identify
these msgs are copied from the js and to modify
them in the js instead of here)

This concerns me a bit. The whole stuff is a huge piece of code with only a few developers that know it. It is a lot to learn to just improve wording of few messages. Although grep and JavaScript string syntax escaping are probably enough for most of the cases.

"In the case PLURAL: we probably have to modify the way that is calculated on
the server via having a array representation ie( array('1-4':X, '5':Y,
'6-11':Z) (instead of having a php function with switch statements)"

Please note that some messages contain (and do need) several occurences of PLURAL: with DISTINCT numeric values. Such messages are not necessarily splittable into distinct ressources (due to language grammars, or the meaning of translated FULL sentences, which may require reordering some items).

If all what Javascript has to do is to parse PLURAL: items, with their $n parameters, I think this is not complicate to implement, because such parsing will be extremely simple, provided that there's a policy about its presence and encoding in messages. So it will just look like this basic regexp that all javascript engines will handle correctly:

/\{\{PLURAL:\$([0-9]+)\|([^}]*)\}\}/

The only restriction being that the part between the pipe and the first closing brace should not contain any wiki markup, or some characters like newlines, or pipe or brace character (however these characters may be transmitted by the server as numeric entities, if they are really present in the source wiki or PHP code). Such policy is enforcable in translatable resources sent and received to Translate.net, or by correct documentation of the messages to translate.

Then the content of $2 (the texts between the first pipe and first closing brace, should be splittable immediately on the pipe character into a basic array.

The difficulty will be to implement the plural rules according to locale (which value is a plural, and how many forms are needed : consider singular, plural, dual, few, many, other...): how many locales does the japascript to support?

Can these rules be encoded in a way that Javascript can handle the plural rules correctly for all locales supported by the Wiki ?

The same could be used for GENDER. The wiki cal also provide to the Javascript the appropriate external data, along with the message, as additional properties, without forcing the javascript to perform another AJAX or JSON query to the server each time it detects messages containing a GENDER or PLURAL subsitution function.

Another difficulty comes with plural forms that are causing change of grammatical case (notably in Slavic languages), and which also depends on how sentences containing these conditional plural forms are created: other parts of the sentences may need to be changed. But it's impossible to predic which part will be affected (notably if there are several GENDER or PLURAL occurences in the message). Should we consider GRAMMAR ? Probably not for Javascript.

Finally there's the problem of wikis that use:

  • multiple scripts (including Chinese for converting the simplified vs. traditional ideographs). This requires a complex script to correctly handle the dynamic message formatting (or character substitutions).
  • RTL scripts (Hebrew, Arabic, ...), because they are in fact using a mix of scripts. The correct rendering of formated messags often requires specific Bidi control for embedding some variable items (this is really complex in the presence of BiDi-neutral or weak characters, notably for final common punctuations ; for example in a RTL wiki, a message that starts or end by Latin letters (possibly in the variable part of the sentence) will cause these characters without strong directionality to be displayed at the wrong place, or the whole sentence may appear broken or reordered, creating confusion.

Currently, MediaWiki does not handle BiDi gracefully, and offers no easy way to support correct BiDi embedding of variable elements in the middle of a sentence, and no easy way to restore the default directionality after this variable part. Unicode offers BiDi controls, but they are NOT recommanded in HTML, which should use <element dir=""> overrides, or CSS bidi properties.

The solution seems simple but it is not: before and after the variable parts of the message in the same HTML block element, there needs to be some <span dir=""> element to embed the static parts, but most ressources are not prepared this way: this has to be done in Translatewiki.net when translating those resources with variable positions whose content directionaly is ambiguous, variable or unknown — for example user names, page names, native foreign language names from {{language:}}, or ressources autotranslated via {{int:}} : this affects in fact all wikis, including in English, not just those wikis with a default RTL locale.

mdale wrote:

PLURAL support was added to javascript via porting the php functions over to javascript per bug 20962 That code is used in the upload wizard and part of the mwEmbed support extension that integrates with the new resource loader. ( previously known as JS2 ). The gender bug 20964 is still open.

A array or XML representation of the transforms is complicated it much easier to just port over the php to javascript. Unfortunately mediaWiki's transforms don't match the standard CLDR transforms. So again its best to just match the php functions to maintain consistent transformations across implementations.

This seems to be only about a handful of messages which could potentially save translators the copy and pasting of some code.
https://translatewiki.net/w/i.php?title=Special%3ATranslate&taction=proofread&group=ext-wikieditor&limit=1000&task=reviewall
If some of those messages require actual action like GRAMMAR support for the single message which uses SITENAME, or there are actual errors, it's probably best to report individual bugs.

There are some translations that require their own markup (e.g. to include superscripts, or for some characters that cannot be produced without markup). The yare preferable to using "compatibility" characters. This includes the possibility of splitting some long paragraphs (e.g. to lists of items), or adding relevant links (specific for a given language), or adding paragraphs for supplementary instructions (e.g. related to local policies on a specific wiki).

I hope that you don't want to forbid markups in translations. Note also that some messages, if you remove markup and split elements in distinct translatable items, would now require reordering depending on the language (and this may not work as intended).

Basically, a translatable item should avoid splitting sentences into a simple concatenated patchwork (otherwise you would need another translatable item to specify the composition format and order of replaceable varaibles, but things will be in fact more complex for translators that have difficulties to create appropriate formats for their language).

Translation units should then remain a single paragraph, or a single list item (without the leading bullet or number), but not necessarily a single sentence in the same paragraph, when those sentences are closely related (with terms like "it", "this", "that"). Frequently a good translation does not necessarily use the same number of sentences (some languages don't support syntaxic constructions, like embedded propositions used in European languages, ande will need multiple sentences ; other language can easily combine multiple closely related short sentences into a single one which is clearer for readers).