Page MenuHomePhabricator

Include pagecounts dumps in datasets
Closed, ResolvedPublic

Description

Include the dumps from http://dumps.wikimedia.org/other/pagecounts-raw/ in /public/datasets/public/


Version: unspecified
Severity: enhancement

Details

Reference
bz48894

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:37 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz48894.

unattended for weeks boosting priority

This could be done but it's 3.1T (and will only get bigger); is there space for this, Ryan?

There is space, though 3.1T is big enough that I'd like to see if we can somehow manage to share access to a single copy rather than duplicate it around.

... wait, that's already accessible through HTTP; why doesn't that suffice?

It makes sense(In reply to comment #5)

... wait, that's already accessible through HTTP; why doesn't that suffice?

So everyone downloads their own copy, 3.1T worth, and puts it where?

It makes sense to me that we have one shared copy accessible to the lab projects.

If folks don't need the whole thing but only the most recent x days/weeks, I can arrange for that, as we do with the dumps, to save space.

Did we decide folks need the whole 3.1T?

I can't remember who initially asked me about this.

I imagine a Year would suffice, who large would that be?

If it is no problem getting all 3.1T of it in Labs, we should!

Talked with apergos about it, and I've gotten a good idea of what needs doing. Will try to do a puppet patchset in a while.

Assuming that 3.1T on NFS won't be an issue... :D

@Yuvi, Apergos: can we please coordinate this as the Analytics team is working on a mysql setup with this data

Change 91293 had a related patch set uploaded by Yuvipanda:
dumps: Copy pagecounts data to public labs nfs too

https://gerrit.wikimedia.org/r/91293

Change 91293 merged by ArielGlenn:
dumps: Copy pagecounts data to public labs nfs too

https://gerrit.wikimedia.org/r/91293

Now available in /public/pagecounts/pagecounts-raw/

So the new labs nfs is not as big as the old one and the page counts will only grow in size. I'd like to revisit how fae back we keep the old files. Marc, what available space do we have and what gets used?

Folks, if I don't hear back on this soon I'm going to whack files so only the last year is there; we're down to 300 gb or so on that filesystem.

Just to clarify, are we talking about:

scfc@tools-login:~$ df -h /public/dumps/pagecounts-raw
Filesystem Size Used Avail Use% Mounted on
labstore.svc.eqiad.wmnet:/dumps 9.1T 8.1T 1.1T 89% /public/dumps
scfc@tools-login:~$

If so, I'd rather add a terabyte or two than delete stuff until [[mw:Analytics/Hypercube]] is available.

I don't know how feasible that is. Marc?

On the dumps FS? Fairly hard: it doesn't live on the shelves since it didn't need the same level of redundancy and fully fills its raid.

There are three things there, however, so perhaps we can move one to another filesystem. The dumps currently occupy 4.4T, and the pagecounts 3.7T.

In a pinch, I have the /scratch filesystem which has the same properties and has some 7T available.

There will be a new server allocated for dumps and pagecounts which will give us several times more space than we need, and will return that space from labstore1001 back to the usable pool.

More news soon.

  • Bug 67909 has been marked as a duplicate of this bug. ***

New server tracked in RT #7578.

(In reply to Tim Landscheidt from comment #24)

New server tracked in RT #7578.

I can't see that. I assume it's in procurement.

There's several more relevant tickets e.g. RT 7948, RT 8090

None have been updated for a couple weeks; maybe wikimania/travel got in the way.

Looks like we got at least as far as an OS install and then issues with puppet/DNS and then idk.

After a couple of odd issues with the underlying filsystem that took us several days to fix, the server is back online with the dumps.

The pagecounts aren't /quite/ done copying yet but up to 2014 and should be done soon.

For reference, the canonical location is:

/public/dumps/pagecounts-raw/