Page MenuHomePhabricator

initialization of the Language object is very heavy
Open, MediumPublic

Description

Sometimes a lot of languages need to be processed in one batch. Initializing a Language object in such cases wastes a lot of memory. Some refactoring there is needed - maybe another lightweight class just for getting simple language info, maybe a lazier initialization, etc.


Version: unspecified
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:54 AM
bzimport set Reference to bz41103.
bzimport added a subscriber: Unknown Object (MLST).

Being more specific on the use case than "sometimes" would be very helpful.

As a first action, you may want to add profiling calls to provide some more detailed insight.

(In reply to comment #1)

Being more specific on the use case than "sometimes" would be very helpful.

In both cases I need to retrieve the dir value for a long list of languages.

See the blocked bugs.

(In reply to comment #2)

(In reply to comment #1)

Being more specific on the use case than "sometimes" would be very helpful.

In both cases I need to retrieve the dir value for a long list of languages.

Seems like a good candidate to stuff in cache object.

I have been thinking for some time about whether we should make a static function for getting the direction, so you have an array of language codes that are RTL instead of putting $rtl = true; in the Messages files. So like: static function isRtl( $langcode ) { $rtl = array ( ... ); return in_array( $langcode, $rtl ); }

Benefits:

  • Then you don't need a Language object for such a simple boolean (in many cases, when making <div lang="" dir=""> you just have a language code and you currently need to call a Language object just for the direction.)
  • We can add this info very easily for languages that don't have a Messages file yet.

What do you think?

I considered this, although I'd love to make as little duplication as possible. To be smarter, I would:

  1. Investigate what makes the current Language so heavy and try to make it lazy.
  2. Try to merge it with the YAML langdb, which is used in the Universal Language Selector.

(In reply to comment #5)

I considered this, although I'd love to make as little duplication as possible.

How would that be duplication? (I very much dislike duplication, to be clear)
We can make isRtl() static (with benefits mentioned above) regardless of fixing the heavy initialization of Language.

  • Bug 41596 has been marked as a duplicate of this bug. ***

Caching the LTR values for all languages is solving only a very limited issue while leaving the larger one unsolved: we sometimes need information about *languages* without needing the messages. For example, for Wikidata we also need the custom to-lowercase function for each language, to normalize search terms. I expect there are more things we will need to know about languages.

So, I don't think caching specific values is a solution (although we can do that in addition). We really need to be able to get Language objects without loading all the messages. This could be done by lazy initialization - only loading the messages when they are first used.

p.selitskas wrote:

Amir suggested me to put this here:

https://gerrit.wikimedia.org/r/#/c/35383/ (I don't think it's supposed to be merged, but please keep in mind)

The most used ones in Wikidata is language names, direction, lists, truncating and inserting end marker. I suspect that other functions could be more important later on.

p.selitskas wrote:

Is there that much to lazy in terms of messages? I can see only $preloadedMessages loaded in LocalisationCache for other languages in ViewItem (plus everything needed for wgContLang/wgUserLang, and for $preloadedMessages, fallback chain is respected, thus loading every language in a fallback chain). RTL, as it was stated above, is not a big deal either.

On the other hand, even $preloadedMessages for, let's say, 25 languages... how much memory does it take? We can postpone self::getLocalisationCache() call until the first message is requested, but the effect of such "optimization" will be smoothed, because $rtl, fallback encodings, namespaces, etc. all belong to Messages file, which is loaded and cared by LocalisationCache (which will anyway load $preloadedMessages).

Writing a work-around for LocalisationCache (if the issue is actually in $preloadedMessages) is the worst case resolution for the issue imho.

(In reply to comment #8)

... We really need to be able to get Language objects without
loading all the messages. This could be done by lazy initialization - only
loading the messages when they are first used.

Strongly agreed on that.

For the majority of languages other than English, loading the messages includes loading two fallback language message files, too, see:
http://www.mediawiki.org/wiki/File:MediaWiki_fallback_chains.svg
A 100% percent coverage is rare. Yet, most often, the untranslated messages are also the ones hardly ever used. So loading messages *is* heavy and should be lazy, as well as loading fallback language messages.

[replacing wikidata keyword by adding CC - see bug 56417]

bugzilla.wikimedia wrote:

(In reply to comment #8)

... We really need to be able to get Language objects without
loading all the messages. This could be done by lazy initialization - only
loading the messages when they are first used.

That's what we do already, since MW 1.16, except for the "preload" key, which is a small set of messages that are preloaded for optimal parser cache hit performance. Are you saying that preloading should be deferred? Or are you talking about something else, like MessageCache?

Sample test case for eval.php:

ini_set( 'memory_limit', '100M' );
$langs = Language::fetchLanguageNames();
foreach ( $langs as $l => $name ) { Language::factory( $l ); }

Fatal error: Allowed memory size of 104857600 bytes exhausted (tried to allocate 4666 bytes) in /www/dev.translatewiki.net/docroot/w/includes/cache/LocalisationCache.php on line 1326

And it takes multiple seconds too.

There was a talk of rewriting Commons template:Dir (https://commons.wikimedia.org/wiki/Template:Dir ), which is the most transcluded template on Commons to use "isRTL" LUA function, as to keep things in synch. Template:Dir has 32,721,515 pages transcluding it, often many times per page. But that would only be possible if Language object was a little lighter.

On my system the command

time echo 'foreach ( Language::fetchLanguageNames() as $l => $name ) { print Language::factory( $l )->getHtmlCode(); }' | php maintenance/eval.php

consumes with cleared file cache:

real    2m54.465s
user    2m14.753s
sys     0m38.718s

and with filled cache:

real    0m7.041s
user    0m3.213s
sys     0m3.759s

The process consumes up to 1.5 GB of memory.

Change 259509 had a related patch set uploaded (by Nikerabbit):
LocalisationCache: stop unconditional preloading for all languages

https://gerrit.wikimedia.org/r/259509

The process consumes up to 1.5 GB of memory.

What version of MediaWiki? What storage backend? I no longer see high memory consumption.

Change 259509 abandoned by Nikerabbit:
LocalisationCache: stop unconditional preloading for all languages

https://gerrit.wikimedia.org/r/259509

What version of MediaWiki? What storage backend? I no longer see high memory consumption.

  • MediaWiki from the current git master version with SQLite.
  • The 1.5 GB footprint is after deleted cache files:
rm /tmp/l10n_cache-*

Ok, regenerating the localisation cache is heavy on memory, but that is not why this bug was filed. I actually think this bug is fixed and LU recache could be it's own thing.

I think the Language object is still very heavy. Maybe a new lightweight object just for the language code would be an improvement.

So I thought as well, but I haven't found any proof that it is still heavy.

With https://gerrit.wikimedia.org/r/331208 it is suggested to introduce a new PHP class LanguageCode. This class can also take some of the code from the class Language. Maybe the class LanguageCode can be a lightweight part of the class Language with a better performance. Please have a look at this change and suggest what can done as next steps.

A part of the task T135845 in debug state in s:fr:Module:Central can already:
: automatic cumulate new translations and new languages
: automatic cumulate from the main module, its sub-modules, and central libraries
: automatic cumulate from central modules/libraries and their */I18N sub-modules
: using the function versioning.bind_modules( mainmodule )
: using the function versioning.add_i18n( t, module_name, module_tab )

It also displays missing translations, search in the display: "translate.missing_translations()".

Can you transpose this code in gerrit?

@Fomafix LanguageCode is useful to identify a language, but we also need lightweight access to some properties of the language: most importantly, whether it's ltr or rtl; but also the date and numer format, or the language name in different languages. This information should probablby be provided by standalone services.

Exactly. Is https://gerrit.wikimedia.org/r/331208 the right step into this direction? What are the next steps?

I think this comment from @tstarling at T85461 is also relevant here:

Somewhat off topic, but speaking of lightweight Language object construction, I think having $wgLangObjCacheSize = 10 by default is actually bit rot, I don't think there's any reason for that anymore. Language objects used to hold message arrays, presumably that is the reason for it being so low.