Page MenuHomePhabricator

DBQ-129 create a list of total individual user edits per month for English wikipedia
Closed, ResolvedPublic

Description

This issue was converted from https://jira.toolserver.org/browse/DBQ-129.
Summary: create a list of total individual user edits per month for English wikipedia
Issue type: Task - A task that needs to be done.
Priority: Major
Status: Done
Assignee: Hoo man <hoo@online.de>


From: Chris55 <moschris@googlemail.com>

Date: Mon, 21 Mar 2011 23:38:58

Output 3 colums: userid, month, totaledits_this_month for English wikipedia from the start with users with non-zero edits. IP edits can be disregarded.
Having now looked at the database definition I believe that the sql statement I need executed is:
select rev_user, left(rev_timestamp,6), count![][1] from revision where rev_user!=0 group by rev_user,left(rev_timestamp,6);

If doing this in parts an extra condition such as left(rev_timestamp,4)=='2005' is needed.
Thanks

[1]: https://jira.toolserver.org/images/icons/emoticons/star_yellow.gif

Version: unspecified
Severity: major

Details

Reference
bz59384

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:27 AM
bzimport set Reference to bz59384.

From: Chris55 <moschris@googlemail.com>

Date: Tue, 22 Mar 2011 23:35:23

distinguish IP & registered users


From: Hoo man <hoo@online.de>

Date: Wed, 23 Mar 2011 19:08:33

So I'm understanding you right, you want it this way:
user_id, user_name, edits
for every month from the start of enwiki on?
(user_name is required if you want IP edits as well, cause IPs don't have a user_id)

If so, you have to be patient, cause this will take quite a while (I have to run it per month to not kill the server ![][1])
Further I will have a separate file for every month, if that's ok

[1]: https://jira.toolserver.org/images/icons/emoticons/tongue.gif

From: Chris55 <moschris@googlemail.com>

Date: Wed, 23 Mar 2011 21:16:03

Not interested in IP edits, so rev_user!=0 and so I don't need user names - but I do need the month in the output.
There are about 650,000 current users, growing approximately linearly since 2005, so that makes 23m possible entries; but probably most ids only appear in a single month, so the actual output will be much smaller, say 2m. *20 char/record=40mb output, compressed.
Can we compromise on yearly batches? I'm trying to nail down the problems with the growth of users mainly in the period 2005-8, so maybe yearly from 2005 with 2001-4 in one batch? But if you have to take smaller bites, so be it.


From: Hoo man <hoo@online.de>

Date: Thu, 24 Mar 2011 15:03:26

I just wanted to let you know, that I've started the script ![][1]

I'll let you know when it's finished ![][2]

[1]: https://jira.toolserver.org/images/icons/emoticons/wink.gif
[2]: https://jira.toolserver.org/images/icons/emoticons/smile.gif

From: Chris55 <moschris@googlemail.com>

Date: Tue, 29 Mar 2011 12:20:20

Have you any idea how long this might take? I presume no partial results will be available.
I wonder slightly at the tactic of splitting the task up too finely: it's a straight query without joins and doing it in too small chunks (ie 120 cf 10) might lead to a significant increase in overall runtime.


From: Hoo man <hoo@online.de>

Date: Tue, 29 Mar 2011 12:47:27

Be sure, I know what I'm doing (I'm not using SQL the first time ![][1])... running it in one would take days and we have an active query killer, so that queries aren't usually able to run for more than 5-6 hours, further that would eat a lot of ram ![][2]

I'm currently at 2008-09 and I'm confident that I can give you the results tomorrow (in one big file and in smaller per month files)

Sorry, that it took so long, but I first choose the wrong (a slower) DB server, so I lost a lot of time ![][3]

[1]: https://jira.toolserver.org/images/icons/emoticons/tongue.gif
[2]: https://jira.toolserver.org/images/icons/emoticons/wink.gif
[3]: https://jira.toolserver.org/images/icons/emoticons/sad.gif

From: Chris55 <moschris@googlemail.com>

Date: Tue, 29 Mar 2011 13:15:38

Thanks - wasn't doubting your competence! I'm aware that estimating resource requirements in large databases is a fine art.


From: Hoo man <hoo@online.de>

Date: Thu, 31 Mar 2011 12:32:28

I'm sorry, but I've calculated the ETA wrong (I have thought that the number of contributors would at maximum double after late 2005, but it grew even further), so my script is only at 2010-02 atm.

Therefore I can't give you the results today ![][1]

I've recalculated the ETA now and my result is over 40 hours, therefore you will have to wait till Saturday (April 2)

[1]: https://jira.toolserver.org/images/icons/emoticons/sad.gif

From: Chris55 <moschris@googlemail.com>

Date: Thu, 31 Mar 2011 14:29:40

Is it possible to release the early years now? I had thought of putting in a report for the UK Wiki conference and the deadline is 2 Apr so I need to do at least the 2007 analysis soon.


From: Hoo man <hoo@online.de>

Date: Thu, 31 Mar 2011 14:39:24

Sure, I can give you the results up to 2010-02 in a few hours (still have to re run some sub queries because the query killer killed some of the earlier queries) ![][1]

Do you still have a need for the rest of the data then (after April 2)?

[1]: https://jira.toolserver.org/images/icons/emoticons/wink.gif

From: Chris55 <moschris@googlemail.com>

Date: Thu, 31 Mar 2011 15:34:24

Hmm, it's obviously getting harder ![][1]
It seems that the whole db is getting too large to get proper stats these days. It's impossible to get any detail beyond the start of 2010. So - yes, why don't we call it a day at 2010/02?

[1]: https://jira.toolserver.org/images/icons/emoticons/sad.gif

From: Hoo man <hoo@online.de>

Date: Fri, 01 Apr 2011 11:45:03

A few more days and it would been finished...

Code (wrote it in php cause I was to lazy for bash ![][1], but that doesn't matter):

#!/usr/bin/php -q
<?php
$current = '201103';
$month =  '05';
$year = '2007';
$log_file = '/mnt/user-store/hoo/dbq-129.log.txt';
$save_file_pre = '/mnt/user-store/hoo/dbq/dbq-129-';
$db = 'enwiki_p';
$server = '10.24.1.33';
while(true) {
        $month++;
        if($month > 12) {
                $month = '01';
                $year++;
        }
        if(strlen($month) == 1) {
                $month = '0' . $month;
        }
        if($current < $year . $month) {
                break;
        }
        while(true) {
                $query = "SELECT /* SLOW_OK */ rev_user, COUNT(*) as edits, LEFT(rev_timestamp, 6) as month FROM revision WHERE rev_timestamp LIKE '" . $year . $month . "%' AND rev_user != 0 GROUP BY rev_user;";
                $file = $save_file_pre . $year . '-' . $month . '.txt';
                exec('mysql --host=' . $server . ' --database=' . $db . ' -e"' . $query . '" | cat > ' . $file);
                if(filesize($file) != 0) {
                        break;
                }
        }
}
?>

Result:
http://toolserver.org/~hoo/dbq/dbq-129/ (dir with the per month files)
http://toolserver.org/~hoo/dbq/dbq-129/full.txt (everything in one file (plain text ~125MB), I haven't tested it, but I hope it's going to import fine ![][2] )

[1]: https://jira.toolserver.org/images/icons/emoticons/tongue.gif
[2]: https://jira.toolserver.org/images/icons/emoticons/wink.gif

From: Chris55 <moschris@googlemail.com>

Date: Fri, 01 Apr 2011 13:04:32

Thanks, I've downloaded it and it looks fine.
However I asked that it wouldn't include IP edits but it appears to and I have no way of eliminating them.
The stats say that by Feb 2010 there were 573,891 Wikipedians but the file includes 3,328,228 user ids for 7,704,721 monthly records. Looking at the id's, there are a lot of 8 digit ids even in 2001 which most appear for very short periods.
Since I'm particularly interested in people who've created ids but never made more than a few edits it's crucial that I can eliminate IP edits. Is there any way to do it with the file you've given me?


From: Hoo man <hoo@online.de>

Date: Fri, 01 Apr 2011 13:17:10

The files seem to be correct, all IP edits have rev_user = 0. The strange user_id thingy seems fine as well eg. user "11419925" who made 2 edits in 2001-02:

mysql> SELECT * FROM user WHERE user_id = '11419925';
+----------+-------------------+-------------------+----------------+
| user_id  | user_name         | user_registration | user_editcount |
+----------+-------------------+-------------------+----------------+
| 11419925 | Cobrand.bomis.com | 20100112120114    |              0 |
+----------+-------------------+-------------------+----------------+
1 row in set (0.02 sec)

And on wiki you can see (two edits in ZeuS):
http://en.wikipedia.org/wiki/Special:Contributions/Cobrand.bomis.com

Things like this are caused by imports from the "old" wikipedia eg. for the example above the edits are from: http://nostalgia.wikipedia.org/w/index.php?title=ZeuS&action=history
such edits are linked to users with the same name, even if they registered years later ![][1]

[1]: https://jira.toolserver.org/images/icons/emoticons/wink.gif

From: Chris55 <moschris@googlemail.com>

Date: Fri, 01 Apr 2011 13:56:59

Hmm, the data is a lot more noisy than I'd hoped. <a href="http://en.wikipedia.org/wiki/Special:Statistics">This page</a> says there are 14m registered users and the number can't surely have quadrupled in the last year. It says 453m edits whereas the sum in the file you've given me is 230m which is nearer to being likely but still not reassuring. The edits by top users are nothing like the published figures (starting at 2.8m) but I'll guess those don't include bots.
Oh well, I'll continue to see if I can get anything meaningful out...


From: Hoo man <hoo@online.de>

Date: Fri, 01 Apr 2011 14:06:12

Well, of course the revision number differs because of three reasons: it ends in 2010-02, we excluded IPs and the archive table (which includes the revision of deleted pages) hasn't been included.
And yes, bots are included (this is only avoidable with a join against the user_groups table, which would be quite slow)

If you need it, I can give you a list of all bots, so that you can easily exclude them out of your data...

I'm now going to import the data myself to my workstation just to take a short look at the data


From: Hoo man <hoo@online.de>

Date: Fri, 01 Apr 2011 14:38:35

I've just noticed the following on the imported data: the full.txt file didn't include the two 2010 months ![][1]
I've updated it now ![][2]

(The script I used to merge them http://toolserver.org/~hoo/dbq/dbq-129/merge.sh used 200 as pattern to find the files, but the 2010 files aren't including 200 in their names. I've now changed the script)

[1]: https://jira.toolserver.org/images/icons/emoticons/sad.gif
[2]: https://jira.toolserver.org/images/icons/emoticons/wink.gif

This bug was imported as RESOLVED. The original assignee has therefore not been
set, and the original reporters/responders have not been added as CC, to
prevent bugspam.

If you re-open this bug, please consider adding these people to the CC list:
Original assignee: hoo@online.de
CC list: hoo@online.de