Page MenuHomePhabricator

Use a different format for l10n_cache (or document why the current one is the best one)
Closed, InvalidPublic

Description

Our current l10n_cache model seems to use serialised PHP arrays as the storage mechanism for localisation strings. This makes perfect sense if we assume that all use cases for retrieving the data are centred around PHP, which, for production, they are. Unfortunately it's tremendously frustrating from a research perspective. As an example, let's use namespace names and aliases, which are stored in l10n_cache and accessible via the MediaWiki API.

Namespace names and aliases are a relatively commmon thing to need to retrieve, at least for me, for things like introducing granularity into our request logs or UA data.

Fortunately for our machines and unfortunately for our researchers, the research and analytics machines are, very deliberately, not connected to the internet directly (with the exception of stat1, which is being decommissioned). Accordingly, the API option is not available to us if we want to retrieve namespace names, we need to use the l10n_cace table.

Doing this requires us to be using a language with a PHP parser in it (Python has one, R does not), roll our own if one isn't available, or write something incredibly hacky where we read the data in, de-serialise it and save it in a more usable format /through/, say, PHP or Python. This is an unattractive proposition because it makes for less readable code, which is a concern not only for transparency but in the situation where the code is 'productionised' by the analytics engineers, for which it needs to be workable in Java.

Can we switch away from serialised PHP to, say, JSON objects? If not, why not? Is there documentation of the rationale for using serialised PHP anywhere?


Version: 1.23.0
Severity: enhancement

Details

Reference
bz61802

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 3:04 AM
bzimport set Reference to bz61802.
bzimport added a subscriber: Unknown Object (MLST).

We are not using serialized files for localisation cache. The cache is either in CDB format or in the SQL database.

Are you saying that the strings in the database are serialized?

(In reply to Niklas Laxström from comment #1)

Are you saying that the strings in the database are serialized?

Indeed (or, at least, the arrays appear to be)

But we still lack a use case. One server in the world with DB but without API isn't really convincing. The kind of information you need (like content namespace or not) also has nothing to do with l10n, so there is no reason to believe it would last forever in there if it's even available.

On the "why", that's the format used in other tables too, for instance log_params, so your question should be rephrased as: document why and where the DB tables contain serialiased data.

(In reply to Nemo from comment #4)

But we still lack a use case. One server in the world with DB but without
API isn't really convincing.

It's more than "one server"; it's going to be all our analytics machines, and that's the crucial element. If the scenario was reversed - some strange universe in which we had one production machine and the rest were analytics-based - the argument wouldn't hold any water. Part of this, sure, is that production is 90% of our use cases, but that's only part of it. The other is simply that research and analytics /are/ a use case, and one of increasing importance. We now have 6 researchers, 4 analytics engineers, a director-level position and a Product Manager; arguing that this is not something worth addressing, either by justifying it or solving it, simply because those people can get their work done on only a few machines, ignores the tremendous resources being thrown behind analytics.

A use case was provided with the original post ;p.

The kind of information you need (like content
namespace or not) also has nothing to do with l10n, so there is no reason to
believe it would last forever in there if it's even available.

It's absolutely available; please do me the courtesy of assuming I did basic research and attempted to solve the problem through other means before submitting the bug. If you want to check yourself, look for namespaceNames and namespaceAliases in lc_key.

Sure, namespaceNames are not /directly/ a localisation problem - they're wiki-based as well as language based - but they are, most of the time, language-based.

More importantly, whether they do or do not last forever (which seems a very strange test. /nothing/ we do lasts forever. well, other than the projects. That's kind of why we're here), they're currently stored there. The solution to the problem is for the language engineering team to either (a) explain why serialised PHP is the best plausible way to store this data or (b) change it. I'm not asking for a solution proof against any possible permutation of future events, because that's impossible, but the fact of the matter is that this table is the source of the issue as it stands.

On the "why", that's the format used in other tables too, for instance
log_params, so your question should be rephrased as: document why and where
the DB tables contain serialiased data.

Sure, if documentation was all I was asking for ;p. log_params is less crucial; to my knowledge the only thing we tend to use it for is retrieving the patrol status of a page (that is, whether it was patrolled automatically or manually), and that's something you can extract with a minimal amount of effort because it's not a complex piece of data - it's just the 1 or 0 closest to the end of the string.

For definite answer you need to ask Tim who wrote the code. My guess is that it just followed the pattern of BagOStuff which serializes values.

I will undermine your use case a bit though:

  1. You should not be accessing a cache directly, it may not even exist.
  2. WMF is using CDB files, no DB for l10n cache.
  3. Why don't you spin up MediaWiki and ask it to provide you the information you need. No network needed for that.

The language engineering team does not have knowledge of or capacity to address all i18n/l10n related issues [1], but we will do our best to help you and make you depend less on us.

[1] We spend largest portion of our time on feature development.

By the way, Toolserver has a special table for namespace names and I'm told Labs recently got such feature too. https://wiki.toolserver.org/view/Toolserver_database

(In reply to Niklas Laxström from comment #6)

For definite answer you need to ask Tim who wrote the code. My guess is that
it just followed the pattern of BagOStuff which serializes values.

I will undermine your use case a bit though:

  1. You should not be accessing a cache directly, it may not even exist.

Sure; if you can point to a better way to automatically get to this data on a machine with no connection to the internet, I'm happy to hear it.

  1. WMF is using CDB files, no DB for l10n cache.

Then why is the table there? It's a MediaWiki feature we don't use?

  1. Why don't you spin up MediaWiki and ask it to provide you the information

you need. No network needed for that.

Sure; that's not an easily replicable solution, though. To use that scenario; every time the table is updated, I'd need to spin up a local MediaWiki instance, query it, retrieve the data, save that to a file, manually transfer to the file to [stat1001/stat1002/analytics*/delete-as-applicable]...and so would anyone else looking at doing things with granularity at the namespace level. That kind of approach would be more easily done by just querying the APIs in a loop, retrieving the data as JSON files, and transferring that....the problem being the 'automation' bit.

The language engineering team does not have knowledge of or capacity to
address all i18n/l10n related issues [1], but we will do our best to help
you and make you depend less on us.

Thanks :).

(In reply to Nemo from comment #7)

By the way, Toolserver has a special table for namespace names and I'm told
Labs recently got such feature too.
https://wiki.toolserver.org/view/Toolserver_database

huh; interesting. Prrobably not directly applicable, because there's no direct connection betwixt the analytics machines and the toolserver dbs to my knowledge, but the way they extracted that could be useful.

(In reply to Oliver Keyes from comment #9)

huh; interesting. Prrobably not directly applicable, because there's no
direct connection betwixt the analytics machines and the toolserver dbs to
my knowledge, but the way they extracted that could be useful.

Or you could start using the infrastructure everyone uses instead of reinventing the wheel.

(In reply to Nemo from comment #10)

(In reply to Oliver Keyes from comment #9)

huh; interesting. Prrobably not directly applicable, because there's no
direct connection betwixt the analytics machines and the toolserver dbs to
my knowledge, but the way they extracted that could be useful.

Or you could start using the infrastructure everyone uses instead of
reinventing the wheel.

You mean, Tool Labs? Sure, I'll just poke Legal and see how they feel about moving our request logs to Labs. Oh, wait ;p.

(In reply to Oliver Keyes from comment #8)

  1. WMF is using CDB files, no DB for l10n cache.

Then why is the table there? It's a MediaWiki feature we don't use?

Database tables are created unconditionally. The store can be changed with wgGlobals configuration, and WMF uses CDB distributed to the application servers for speed.

Can you tell a bit more where do you need the namespaces? Are you parsing the wikitext? Doesn't for example pagelinks table have the namespace resolved to the numerical id already?

(In reply to Niklas Laxström from comment #12)

Can you tell a bit more where do you need the namespaces? Are you parsing
the wikitext? Doesn't for example pagelinks table have the namespace
resolved to the numerical id already?

Nope, the RequestLogs - which are unfortunately totally detached from MediaWiki proper.

Then I don't see any other solution currently other than 3) I mentioned. If you don't want to install MediaWiki on the analytics servers, the second best thing would be to automate it with some kind of script keeping the constraints in mind.

After getting the namespaces, should be prepared to mirror some of transformations MW handles: spaces/plusses/underscores, url encoding, charset encoding, case insensitivity, unicode normalizations, redirect pages... Some of these cause an actual redirect, some of the don't.

demon subscribed.

I see nothing to be done here.