Page MenuHomePhabricator

Story: WikimetricsUser downloads large CSV
Closed, ResolvedPublic

Description

Example:
https://metrics.wmflabs.org/reports/result/437b30dd-f535-4d7e-b460-eefabaa07b2a.csv <— won’t download
https://metrics.wmflabs.org/reports/result/437b30dd-f535-4d7e-b460-eefabaa07b2a.json


Version: unspecified
Severity: major
Whiteboard: u=WikimetricsUser c=Wikimetrics p=8 s=2014-10-16

Details

Reference
bz71255

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:54 AM
bzimport set Reference to bz71255.

Dan's comment:
This seems related to the size of the download. I tried downloading a smaller result and it was fine. The CSV takes longer to generate than the JSON because the native storage format is JSON. Most likely it would download eventually but it might take an unreasonable amount of time. The fix would be to optimize the conversion to CSV, maybe go to a streaming converter like Yuvi implemented in Quarry.

Created attachment 16742
Cohort of student editors from fall 2014 on enwiki

A cohort of 2095 usernames ( 2090 are valid )

Attached:

In many cases it will not download ever. After many minutes, the user gets a 504 gateway error.

For example: https://metrics.wmflabs.org/reports/result/c04cf328-f198-4a12-81ec-9cb4badefc46.csv

vs. json which works fine: https://metrics.wmflabs.org/static/public/1504496.json

This is a cohort of 2090 users, with individual results for bytes added over a span of about 45 days.

Unfortunately, using JSON is not a viable alternative because it does not include the usernames like CSV does.

Change 167356 had a related patch set uploaded by Nuria:
i[WIP] Improving retrieval of user names on cvs report

https://gerrit.wikimedia.org/r/167356

Bug was reproducible running "pages created" with per-user results on the cohort attached to the bug.

Code changes fix issues with performance, now we have to do some refactor as to see whether we can fit similar changes on json report.

Verified in staging that cvs report for 2000 users (with per-user results) gets created in couple seconds.

Tested a variety of reports with the cohort attched to this bug and all those run in seconds. Made sure to test timeseries report too.

Change 167356 merged by Milimetric:
Improves retrieval of user names on csv report

https://gerrit.wikimedia.org/r/167356

will be deployed after sprint demo