Include the dumps from http://dumps.wikimedia.org/other/pagecounts-raw/ in /public/datasets/public/
Version: unspecified
Severity: enhancement
Include the dumps from http://dumps.wikimedia.org/other/pagecounts-raw/ in /public/datasets/public/
Version: unspecified
Severity: enhancement
This could be done but it's 3.1T (and will only get bigger); is there space for this, Ryan?
There is space, though 3.1T is big enough that I'd like to see if we can somehow manage to share access to a single copy rather than duplicate it around.
It makes sense(In reply to comment #5)
... wait, that's already accessible through HTTP; why doesn't that suffice?
So everyone downloads their own copy, 3.1T worth, and puts it where?
It makes sense to me that we have one shared copy accessible to the lab projects.
If folks don't need the whole thing but only the most recent x days/weeks, I can arrange for that, as we do with the dumps, to save space.
I can't remember who initially asked me about this.
I imagine a Year would suffice, who large would that be?
Talked with apergos about it, and I've gotten a good idea of what needs doing. Will try to do a puppet patchset in a while.
Assuming that 3.1T on NFS won't be an issue... :D
@Yuvi, Apergos: can we please coordinate this as the Analytics team is working on a mysql setup with this data
We have an RFC for making this pageview data queryable: https://www.mediawiki.org/wiki/Analytics/Hypercube
Change 91293 had a related patch set uploaded by Yuvipanda:
dumps: Copy pagecounts data to public labs nfs too
Change 91293 merged by ArielGlenn:
dumps: Copy pagecounts data to public labs nfs too
So the new labs nfs is not as big as the old one and the page counts will only grow in size. I'd like to revisit how fae back we keep the old files. Marc, what available space do we have and what gets used?
Folks, if I don't hear back on this soon I'm going to whack files so only the last year is there; we're down to 300 gb or so on that filesystem.
Just to clarify, are we talking about:
scfc@tools-login:~$ df -h /public/dumps/pagecounts-raw |
Filesystem Size Used Avail Use% Mounted on |
labstore.svc.eqiad.wmnet:/dumps 9.1T 8.1T 1.1T 89% /public/dumps |
scfc@tools-login:~$ |
If so, I'd rather add a terabyte or two than delete stuff until [[mw:Analytics/Hypercube]] is available.
On the dumps FS? Fairly hard: it doesn't live on the shelves since it didn't need the same level of redundancy and fully fills its raid.
There are three things there, however, so perhaps we can move one to another filesystem. The dumps currently occupy 4.4T, and the pagecounts 3.7T.
In a pinch, I have the /scratch filesystem which has the same properties and has some 7T available.
There will be a new server allocated for dumps and pagecounts which will give us several times more space than we need, and will return that space from labstore1001 back to the usable pool.
More news soon.
(In reply to Tim Landscheidt from comment #24)
New server tracked in RT #7578.
I can't see that. I assume it's in procurement.
There's several more relevant tickets e.g. RT 7948, RT 8090
None have been updated for a couple weeks; maybe wikimania/travel got in the way.
Looks like we got at least as far as an OS install and then issues with puppet/DNS and then idk.
After a couple of odd issues with the underlying filsystem that took us several days to fix, the server is back online with the dumps.
The pagecounts aren't /quite/ done copying yet but up to 2014 and should be done soon.
For reference, the canonical location is:
/public/dumps/pagecounts-raw/