Page MenuHomePhabricator

Update list of most often used messages for MediaWiki core at Wikimedia
Closed, ResolvedPublic

Description

The file wikimedia-mostused-2011.txt should be updated, so that the translation group core-0-mostused at translatewiki.net can be updated.

Blog post: http://laxstrom.name/blag/2015/02/19/prioritizing-mediawikis-translation-strings/


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=63415

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:07 AM
bzimport set Reference to bz63416.
bzimport added a subscriber: Unknown Object (MLST).

I'm moving this out of Translate component. Perhaps we could use EventLogging together with the new hook in message cache and suitable sampling to as a way to collect data this time.

(In reply to Niklas Laxström from comment #1)

I'm moving this out of Translate component. Perhaps we could use
EventLogging together with the new hook in message cache and suitable
sampling to as a way to collect data this time.

For this high volume this sounds like complete overkill. For previous versions, this took about 30 minutes of Tim Starling's time to gather the data, IIRC.

Tim is difficult to reproduce on demand.

It's been three weeks since this bug was filed, and years since data was collected last time.

(In reply to Niklas Laxström from comment #3)

Tim is difficult to reproduce on demand.

It's been three weeks since this bug was filed, and years since data was
collected last time.

That's not because Tim is hard to get a hold of, it's because I'm bad at scheduling an update.

(In reply to Siebrand Mazeland from comment #4)

That's not because Tim is hard to get a hold of, it's because I'm bad at
scheduling an update.

Typically it happens at the hackathon, so we have hopes. :)

Copying here a comment I made some time ago on https://meta.wikimedia.org/wiki/Talk:Language_proposal_policy: «the WMF could give assistance in updating the "500 most used" core messages, and adding a second "1500 most used" (or whatever) core+extensions messages (measured in a way that ensures the most used messages on e.g. Wikisource only are all included too)». Whatever is measured in the end (even just core), it would be nice to ensure it represents all Wikimedia projects (maybe Tim's method already does but we don't know).

(In reply to Siebrand Mazeland from comment #2)

For previous
versions, this took about 30 minutes of Tim Starling's time to gather the
data, IIRC.

I was wondering, was that a live hack or is there a commit on some wmf branch in SVN? The code may also be available to shell users on Tim's home directory on fenari or whatever; it takes a couple minutes to look it up.

(In reply to Nemo from comment #6)

(In reply to Siebrand Mazeland from comment #2)

For previous
versions, this took about 30 minutes of Tim Starling's time to gather the
data, IIRC.

I was wondering, was that a live hack or is there a commit on some wmf
branch in SVN? The code may also be available to shell users on Tim's home
directory on fenari or whatever; it takes a couple minutes to look it up.

There's a patch set somewhere in Subversion, I think (or maybe in Gerrit already). I say that, because I think I remember he reverted that to stop profiling.

Nikerabbit raised the priority of this task from Medium to High.Jan 24 2015, 3:44 PM
Nikerabbit added a project: translatewiki.net.
Nikerabbit set Security to None.

Over the year multiple people has been begging me for to get a refreshed list enabled at translatewiki.net. Let's get this bug fixed for real. That means devising a reproducible solution and not hunting old hacks or waiting for someone to ping someone. I will try to take advantage of the developer summit to get this fixed.

Talked with Ori briefly about this. He said using tcpdump in one of the memcached hosts could be a solution (though unless scripted that doesn't seem to be easy to update the list). Hopefully will get more details later.

Tim told me that he just hacked some logging for an hour. Ori said he would teach me how to gather the data when I have time.

gerritbot subscribed.

Change 190777 had a related patch set uploaded (by Nikerabbit):
Temporarily log message key lookups on four app servers

https://gerrit.wikimedia.org/r/190777

Patch-For-Review

My recollection of the conversation here was that any list of the most used messages would invariably include messages used in the standard user interface (sidebar messages, footer messages, etc.), which really brought into question the value of such a report. The concern is that a high level of noise would drown out any signal. I don't really care if we do a sampled collection of hits, I just personally doubt there will be much utility in the results. That said, I'm pretty detached from the interface translation process and the people closer to it seem to really think that a refreshed report will help.

What you call "noise" is precisely the signal being sought.

The list will be post processed to filter out noise as well as we can. I don't understand your point about "would invariably include messages used in the standard user interface". That's exactly what we want: the most important messages (using requests/visibility as a proxy). The core itself has 3000 messages, many of them obscure things like exif info and rare error messages or disabled features. Not to speak about crucial extensions which also contain important messages. Finding those in midst of 24k messages is what we are here doing, to help the translators to prioritize their time and work.

The list will be post processed to filter out noise as well as we can. I don't understand your point about "would invariably include messages used in the standard user interface". That's exactly what we want: the most important messages (using requests/visibility as a proxy).

Thank you for clarifying. As someone who's not as closely involved in the translation process and who was largely relying on the task description here, this wasn't obvious to me. :-)

Finding those in midst of 24k messages is what we are here doing, to help the translators to prioritize their time and work.

Right... I think everyone is of the view that ideally all of the messages should be translated in every available language.

Ordering by most used feels a bit artificial and strange to me, but @Nemo_bis pointed out that https://meta.wikimedia.org/wiki/Language_proposal_policy does (in my opinion, bizarrely) reference this list:

As a baseline, it is recommended that you begin by translating the "most used MediaWiki messages".

And this is apparently a policy created/followed by the Language committee (LangCom). It's even marked as a global policy... wild.

In any case, if we're going continue to update the most used messages list regularly, I agree with you that we should find a way to make the process more automated.

Change 190777 merged by jenkins-bot:
Temporarily log message key lookups on four app servers

https://gerrit.wikimedia.org/r/190777

I was tracking this in real-time and after a few minutes the frontrunners did not change, so I ran this for half an hour total.

During this time, there were 15,324,011 message key lookups in total. I have attached the full list. The top 100 are:

Message keyHits
word-separator643974
interlanguage-link-title552468
parentheses512980
brackets468465
mainpage365891
scribunto-doc-page-name352744
editsectionhint314825
editsection250348
pipe-separator170227
cite_reference_link_suffix161050
cite_reference_link_prefix161049
cite_references_link_suffix143605
cite_references_link_prefix143605
red-link-title139745
comma-separator125058
talkpagelinktext111075
aboutsite108211
rc-change-size-new103157
pagetitle101121
february94908
disclaimers93210
conversion-ns-188336
privacy86732
conversion-ns685100
conversion-ns1481630
conversion-ns381542
conversion-ns1081302
conversion-ns281288
conversion-ns181276
conversion-ns1581248
conversion-ns781242
conversion-ns881238
conversion-ns1181236
conversion-ns82981234
conversion-ns82881234
conversion-ns1381234
conversion-ns1281234
conversion-ns981233
cite_reference_link80971
conversion-ns10080934
conversion-ns10180922
conversion-ns-279108
contribslink72930
cite_reference_link_key_with_num70317
mobile-frontend-editor-edit66118
printableversion58057
feb57425
cite_references_link_one55100
january53215
aboutpage50192
search49969
privacypage49852
disclaimerpage49844
cirrussearch-boost-templates49372
december49309
wikilove-err-not-logged-in48929
july47898
november47888
october47664
september47353
march47290
wikibase-repo-name47256
august47226
june46604
april46554
may_long46498
anonnotice46478
jan46396
dec45668
nov45255
oct45165
mar44989
aug44813
sep44712
jun44657
apr44531
jul44449
may44443
searchbutton43511
opensearch-desc43383
wikimedia-developers-url43366
wikimedia-developers43366
retrievedfrom43366
ffeed-enable-sidebar-links43366
mobile-frontend-view43364
site-atom-feed42878
conversion-ns441778
conversion-ns541757
interlanguage-link-title-langonly41479
tooltip-search-fulltext40841
tooltip-search40841
searchsuggest-search40841
accesskey-search-fulltext40837
accesskey-search40837
pt-login40064
pt-createaccount40064
sitenotice37918
colon-separator37430
contact-url37234
size-kilobytes36492

Change 191072 had a related patch set uploaded (by Nemo bis):
Add 2015 list of most used MediaWiki messages

https://gerrit.wikimedia.org/r/191072

Patch-For-Review

When I look at https://translatewiki.net/w/i.php?title=Translating:MediaWiki/Most_used_messages_table,_2015&oldid=6022702 , it's not without significance that the first message standing out is mobile-frontend-editor-edit, 12th in the rank but 1 order of magnitude fewer translations than its neighbours. Thanks to this list, we're going to fix it!

Change 191072 merged by jenkins-bot:
Add 2015 list of most used MediaWiki messages

https://gerrit.wikimedia.org/r/191072

Thank to everyone involved, especially Ori and Nemo_bis for their time so that I myself was able to focus actually setting this message group up – it's now deployed to production.

Not sure how index got into the list; the only index message translatewiki.net knows about is a message from an obscure/unmantained IndexFunction extension.

I guess this is thanks to ru.wikipedia (and a few other wikis) using MediaWiki:Index to include “A–Z index” link in the sidebar?

Not sure how index got into the list; the only index message translatewiki.net knows about is a message from an obscure/unmantained IndexFunction extension.

I guess this is thanks to ru.wikipedia (and a few other wikis) using MediaWiki:Index to include “A–Z index” link in the sidebar?

That's plausible, let's remove "index". It's fine to keep improving the list with further patches, we made it a bit longer on purpose.

cleanup-most-used.php should perhaps be amended to check this stuff as well.

cleanup-most-used.php should perhaps be amended to check this stuff as well.

How? List the primary group it belongs to?

Jhs, see https://gerrit.wikimedia.org/r/192288 for metadata-fields which you reported (I can't find you in gerrit).

How? List the primary group it belongs to?

I've been thinking more about it but no better idea so far. What could be used to compare contents as LocalisationUpdate would do? Perhaps get a cdb l10n file from production?

What exactly would you compare? That is just accidental message reuse.

Change 200822 had a related patch set uploaded (by Mormegil):
Remove 'index' from most-used list

https://gerrit.wikimedia.org/r/200822

Change 200822 merged by jenkins-bot:
Remove 'index' from most-used list

https://gerrit.wikimedia.org/r/200822