Page MenuHomePhabricator

Dump stats: switch to persistent stats rather than monthly regenerated stats
Closed, DeclinedPublic

Description

On Wed, Mar 6, 2013 at 10:47 PM, Erik Zachte wrote:

I just realize that the YoY +2%, which LIMN shows and which I reported earlier, will probably be around 0% on next report. As always the latest active editor counts shrinks 1% – 2% in subsequent month as more quick deletions of Jan 2012 content will still happen.

ErikM March 13, 2013:

I’d like to get this “deletion drift” issue on the mid-term agenda for the analytics team. While I understand why you chose this approach with WikiStats, I think the drawbacks outweigh the benefits. Having comparative data from month to month continually shift does materially impact our ability to plan and to understand trends in the data. Freezing the data at month-end and only making corrections in cases of genuine errors in measurement seems much preferable to me.

I think with such an approach, we can simply take as a given that some % of TAE are not making constructive contributions. That is a given anyway as we can’t account for quality of edits.

I realize this is very non-trivial given the way the data pipeline currently works, but I at least want to be on record that we should aim for numbers being frozen at measurement point and only corrected in case of measurement errors, as a design characteristic. That applies to article counts and other such measures as well—if we measured 3M articles in January 2012, and 2.5M in February 2012 because 500K articles were deleted (absurd example), that does not negate our 3M article measurement from January.


Version: unspecified
Severity: normal
URL: http://lists.wikimedia.org/pipermail/analytics/2013-March/000467.html

Details

Reference
bz46198

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:21 AM
bzimport set Reference to bz46198.
bzimport added a subscriber: Unknown Object (MLST).

Pro's and con's of permanent stats have been discussed endlessly over the years. Current inclination among analytics team and key users is to favor it. Updating historic stats due to new insights is deemed less important than giving the user a sense of stability. Updating due to bug fixing if still on the table.

Technically it could be added as feature to current wikistats scripts, as follows: a runtime argument tells Wikistats whether to update all historic months or only a range of months (default last month only).

Then all or some routines in WikiCountsOutput.pm for updating all or some csv files need to be adjusted, to not add/replace all data for a given wiki, but only for a given period.

A minimal implementation would be to do this only for key metrics in StatisticsMonthly.csv

As future of dump based Wikistats scripts is uncertain (HADOOP will likely take over) costs may outweigh benefits.

(In reply to Erik Zachte from comment #2)

As future of dump based Wikistats scripts is uncertain [...]
costs may outweigh benefits.

Did you consider the "archives" alternative, i.e. updating everything but archiving the old HTML instead of wiping it? If space is really an issue, I believe most space is taken by category trees and few other things, so it would be enough to refrain from archiving those.

Closing this ticket as Wikistats version 1 is dead per https://stats.wikimedia.org/Wikistats_1_announcements.htm . In case this ticket is still a valid bug report or feature request for Wikistats 2, then please reopen. Thanks a lot!