Page MenuHomePhabricator

DBQ-151 Need the revision histories from editors who made at least one revision or edit to (only) namespaces 0-5 during (only) the time period of April 1st, 2009 to March 31st 2010.
Closed, ResolvedPublic

Description

This issue was converted from https://jira.toolserver.org/browse/DBQ-151.
Summary: Need the revision histories from editors who made at least one revision or edit to (only) namespaces 0-5 during (only) the time period of April 1st, 2009 to March 31st 2010.
Issue type: Task - A task that needs to be done.
Priority: Major
Status: Done
Assignee: Hoo man <hoo@online.de>


From: Jmmalo04 <jmmalo03@gmail.com>

Date: Wed, 17 Aug 2011 01:11:00

It is easiest to describe this request in two steps:

(1) I need a list of the editors who made one or more revisions to namespaces 0-5 during the period of April 1st, 2009 to March 31st 2010. If an editor did not make at least one edit to namespaces 0-5 during the aforementioned time period, they should not be included.

(2) From this list of editors, please select (randomly as possible) 100,000 editors. For each of these editors I need a history of all their revisions. For each revision I need the following information:
(a) Timestamp from each revision made by the editor
(b) The increase / decrease in number of characters compared with the previous revision of the article
(c) Was the editors revision reverted? (yes/no)
(d) The namespace of the revision
(e) The current size in number of characters of the revision
(f) If it is available, the category of the article in the Wikimedia Taxonomy Project.

I'm new at this so please forgive me if my request is missing some details. I will be watching this closely so feel free to ask me questions if anything is unclear, I will answer promptly. Thanks in advance for your help!


Version: unspecified
Severity: major

Details

Reference
bz59413

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:29 AM
bzimport set Reference to bz59413.

From: Hoo man <hoo@online.de>

Date: Wed, 17 Aug 2011 16:46:25

I don't think that this is feasible the way you request it. The enwiki revision table is really big, so doing this for more than one month at a time might not be doable.... furthermore selecting user contribs. for over 100k users will be to much data either...


From: Jmmalo04 <jmmalo03@gmail.com>

Date: Fri, 19 Aug 2011 07:46:29

Thanks for your quick response. I've been learning to use the wiki API and that has helped a lot. However, I still need to get a list of all the users who made one or more edits to namespaces 0-5 during the time period of April 1st, 2009 to March 31st 2010. Is this request possible? I only need a list of user names now, I should be able to do the rest myself on the API.

Thanks, Jordan


From: Jmmalo04 <jmmalo03@gmail.com>

Date: Fri, 19 Aug 2011 15:52:59

Also, I do not need all the names. I could accept a random (or random as possible) subset of, preferably 100,000, or even just 50,000 if that's easier.


From: Hoo man <hoo@online.de>

Date: Mon, 22 Aug 2011 13:13:45

SQL:

INSERT /* SLOW_OK */ INTO u_hoo.dbq151 (user_name) SELECT DISTINCT rev_user_text FROM revision INNER JOIN page ON rev_page = page_id WHERE rev_user != 0 AND LEFT(rev_timestamp, 6) = 200904 AND page_namespace < 6 LIMIT 10000;

Which selects 10,000 users and saves them to a temp. table (ran it for every month). Afterwards I just needed to get all entries of that user list:

SELECT DISTINCT * FROM u_hoo.dbq151;

Many users are in the temp table multiple times, so from the 120,000 only 57,825 unique ones have been selected.

Result:
http://toolserver.org/~hoo/dbq/dbq-151.txt (plain text)


From: Jmmalo04 <jmmalo03@gmail.com>

Date: Mon, 22 Aug 2011 16:23:45

This will work! Thanks alot hoo man, I appreciate it!

This bug was imported as RESOLVED. The original assignee has therefore not been
set, and the original reporters/responders have not been added as CC, to
prevent bugspam.

If you re-open this bug, please consider adding these people to the CC list:
Original assignee: hoo@online.de
CC list: hoo@online.de