Page MenuHomePhabricator

Enable parallel processing of stub dump and full archive dump for same wiki.
Closed, DeclinedPublic

Description

Years ago Wikistats used to process the full archive dump for each wiki, the dump which contains full text for each revision of each article. Only that type of dump file can yield word count and average article size and some other content based metrics. For a list of affected metrics see all partially empty columns at e.g. http://stats.wikimedia.org/EN/TablesWikipediaEN.htm (first table).

As the dumps grew larger and larger this was no longer possible on a monthly schedule, at least for the largest Wikipedia wikis. Processing the English full archive dump takes more than a month now by itself. Some very heavy regexps are partially to blame.

Many people have asked when the missing metrics will be revived. A pressing case was brought forward in the first days of 2014 in https://nl.wikipedia.org/wiki/Overleg_gebruiker:Erik_Zachte#Does German Wikipedia have a crisis? For example "Can you find out, if the growth of average size has significantly changed in 2013?"

At the moment there is limited parallelism within Wikistats dump processing. Two wikis from different projects can be processed in parallel, as each project has its own set of input/output folders. But processing two Wikipedia wikis at the same time could bring interference problems, as there are some project-wide csv files. Not to mention processing stub and full archive dump for the same wiki at the same time, where all files for that wiki would be updated by two processes.

The simplest solution is to schedule full archive dump processing on a different server than stub dump processing (e.g. stat1 instead of stat1001?) and merge the few metrics that can be only collected from the full archive dumps into the csv files generated from the stub dumps.

This merge would require a separate script, which can fetch a csv file from one server and merge specific columns into the equivalent csv files on another server.

This/these csv file(s) should be protected against concurrent access (metaphore? how?) or the merge step should be part of the round-robin job which processes dumps whenever they become available. (the latter being slightly less safe, as there is a theoretical change that a concurrent access still could occur, as there are on occasion manually scheduled extra runs).


Version: unspecified
Severity: normal

Details

Reference
bz60826

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:53 AM
bzimport set Reference to bz60826.
bzimport added a subscriber: Unknown Object (MLST).

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1429

I'd really like to see if we can use hadoop for further processing of the dumps.

We can easily set up a hadoop instance in labs -- anybody interested in taking a crack at this?

-Toby

As discussed with Toby off-line, given the current functionality replacing it with HADOOP will not so simple. Possibly opportune, but some caution as for ETA seems warranted.

The new job will need to incorporate several filters (in Wikistats countable namespaces are determined dynamically, redirects are filtered out with awareness of language specific tags harvested from php files and WikiTranslate, dumps need to be vetted for validity (ideally such housekeeping would be done by the dump process, but given the low bandwidth for dump maintenance for many years that might take a while, so right now the ugly approach of parsing html status files is used). Also word count is far from the straightforward function implemented in some languages. Here markup, headers, links etc are first stripped, also for some language the current approach is aware of ideographic languages and their differnt content density. This list is probably not exhaustive. Any rebuild will probably be less ambitious in some aspects (e.g. word count) but it will not be trivial.

Set importance to high as this is a widely deplored bug and I get mails about it since 2010 every few months.

First step is done:

adapting wikistats scripts

*new argument -F to force processing full archive dumps (regardless of dump size)
*Wikistats now can handle segmented dumps
(which BTW differ in file name for wp:de and wp:en)
e.g. see first 100 lines or so in http://dumps.wikimedia.org/enwiki/20140304/
*Wikistats can now also see for segmented dumps if there was a error during dump generation
(by parsing dump job status report 'index.html') and looking for 'failed' in appropriate sections
if found switch to other dump format,
if none of dumps format is valid switch to older dump

Second step has started

collect counts from full archive dumps for Wikipedias only on stat1

*this will run for several weeks probably

Third step needs to be done

merge results from stat1 into stat1002

*make small script that merges values (missing values only) from
stat1:/[..]/StatisticsMonthly.csv into
stat1002:/[..]//StatisticsMonthly.csv
as part of monthly Wikistats cycle stat1002:/[..]count_report_publish_wp.sh

This task has not seen updates for 16 months. Is this still high priority?

Closing this ticket as Wikistats version 1 is dead per https://stats.wikimedia.org/Wikistats_1_announcements.htm . In case this ticket is still a valid bug report or feature request for Wikistats 2, then please reopen. Thanks a lot!