Our current l10n_cache model seems to use serialised PHP arrays as the storage mechanism for localisation strings. This makes perfect sense if we assume that all use cases for retrieving the data are centred around PHP, which, for production, they are. Unfortunately it's tremendously frustrating from a research perspective. As an example, let's use namespace names and aliases, which are stored in l10n_cache and accessible via the MediaWiki API.
Namespace names and aliases are a relatively commmon thing to need to retrieve, at least for me, for things like introducing granularity into our request logs or UA data.
Fortunately for our machines and unfortunately for our researchers, the research and analytics machines are, very deliberately, not connected to the internet directly (with the exception of stat1, which is being decommissioned). Accordingly, the API option is not available to us if we want to retrieve namespace names, we need to use the l10n_cace table.
Doing this requires us to be using a language with a PHP parser in it (Python has one, R does not), roll our own if one isn't available, or write something incredibly hacky where we read the data in, de-serialise it and save it in a more usable format /through/, say, PHP or Python. This is an unattractive proposition because it makes for less readable code, which is a concern not only for transparency but in the situation where the code is 'productionised' by the analytics engineers, for which it needs to be workable in Java.
Can we switch away from serialised PHP to, say, JSON objects? If not, why not? Is there documentation of the rationale for using serialised PHP anywhere?
Version: 1.23.0
Severity: enhancement