Page MenuHomePhabricator

Create SQL dumps for certain database tables, partial if necessary (tracking)
Closed, DeclinedPublic

Description

Currently the database dumps are configured to skip certain tables altogether due to privacy concerns. For example, the user table is marked private entirely.

The user table contains a number of columns that are not private (e.g., user_id, user_name, user_registration, user_editcount). These should be publicly dumped.

The same is true for the ipblocks table, which has a lot of columns that can be safely exposed.

There may be other tables that can be safely exposed, I haven't done a full audit.

Main page on the topic, with use cases: https://www.mediawiki.org/wiki/Research_Data_Proposals


Version: unspecified
Severity: enhancement
URL: http://dumps.wikimedia.org/backup-index.html

Details

Reference
bz25602

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 11:11 PM
bzimport set Reference to bz25602.

matthew.britton wrote:

(In reply to comment #0)

The user table contains a number of columns that are not private (e.g.,
user_id, user_name, user_registration, user_editcount). These should be
publicly dumped.

You'd think so, but given the amount of effort that goes into various mechanisms for hiding abusive user names from public lists I can't see them implementing this. ID and registration aren't much use without name, and even editcount has caused people to bitch about privacy and ask for removal when it's appeared on public lists before...

The useful info in those columns (e.g. editcount) can be derived in other ways without those columns.

(In reply to comment #2)

The useful info in those columns (e.g. editcount) can be derived in other ways
without those columns.

I think user.user_name and user.user_id are fairly important pieces of information (among others). Re-opening this.

(In reply to comment #3)

I think user.user_name and user.user_id are fairly important pieces of
information (among others). Re-opening this.

You are welcome to submit a patch to do this.

(In reply to comment #4)

(In reply to comment #3)

I think user.user_name and user.user_id are fairly important pieces of
information (among others). Re-opening this.

You are welcome to submit a patch to do this.

Yes. So is anyone else. That misses the larger point: don't close bugs as "wontfix" when the bug is valid. It doesn't matter if it's a low priority for the Wikimedia Foundation (or anyone else); there is no deadline.

Moving bug to Datasets product namespace.

Folks with the dumps are probably only interested in names of users that made at least one (non-hidden) edit. That information is available by grabbing user names and ids from the stubs dumps; while it might take a little work to grep it out, reformat it and pass it to uniq it's not prohibitive. Alternatively you could could get it from the pages-logging xml file. You could also get them from the api in a very straightforward way, whether or not they've ever made an edit: http://en.wikipedia.org/w/api.php?action=query&list=allusers

If I were going to produce these dumps, it wouldn't be a straight select, because we'd have to check if the user name is hidden or whether the public is permitted to see it. This means a pho script and xml output and probably batched queries for a measly two fields that are already available through other means. So I'm really leaning towards WONTFIX.

(In reply to comment #7)

If I were going to produce these dumps, it wouldn't be a straight select,
because we'd have to check if the user name is hidden or whether the public
is permitted to see it. This means a pho script and xml output and probably
batched queries for a measly two fields that are already available through
other means. So I'm really leaning towards WONTFIX.

This bug covers far more than two fields. If you'd like a full list of missing tables and fields from the public dumps, I can put one together.

The English Wikipedia has over 19,000,000 registered users. While the MediaWiki Web API can be used to retrieve some of this information, are we really suggesting that polling the API 3,800 times (this assumes batches of 5,000) is the best way to dump the user table? That seems kind of insane.

(In reply to comment #8)

This bug covers far more than two fields.

For the user table we are talking about two fields.

If you'd like a full list of missing
tables and fields from the public dumps, I can put one together.

If folks want other partial tables, they should request them in separate bugs. The discussion around privacy and/or necessity for each will likely be different.

The English Wikipedia has over 19,000,000 registered users. While the
MediaWiki
Web API can be used to retrieve some of this information, are we really
suggesting that polling the API 3,800 times (this assumes batches of 5,000)
is
the best way to dump the user table? That seems kind of insane.

With a script doing the work, as long as it respects maxlag, who cares? Just fire it up, go do other things, and come back at some point to see when it's done.

If you are concerned about multiple users doing this same set of requests instead of a single producer providing the list for download, I can see that as a legitimate complaint. But that too would be a different bug: it would be nice if users had a space that they could put data sets that they generate, for re-use by others. I think that could be managed by interested users off-site though, without WMF intervention.

(In reply to comment #9)

(In reply to comment #8)

This bug covers far more than two fields.

For the user table we are talking about two fields.

No, as I understand it, we operate on a principle of least restriction. Every field that can be safely exposed should be. In the user table, there's no reason to not expose user_id, user_name, user_editcount, and user_registration, as far as I can tell.

If folks want other partial tables, they should request them in separate
bugs.

Okay, I'll re-purpose this bug to focus specifically on the user table and will file separate bugs for the other tables.

With a script doing the work, as long as it respects maxlag, who cares? Just
fire it up, go do other things, and come back at some point to see when it's
done.

Respectfully, this doesn't make any sense. User IDs and user names are not some minor detail. Why would we provide dumps of every other table (many of which are actually useless) and then leave out the user table, which is one of the most important tables? That doesn't make sense.

If you are concerned about multiple users doing this same set of requests
instead of a single producer providing the list for download, I can see that
as a legitimate complaint. But that too would be a different bug: it would
be nice if users had a space that they could put data sets that they
generate, for re-use by others. I think that could be managed by interested
users off-site though, without WMF intervention.

Again, this misses the point. Who or what is served by requiring every individual re-user to generate (or re-generate) this data themselves through time-consuming API queries?

The issue surrounding hidden user names is a red herring, in my mind. It makes absolutely no sense to keep data in the database that shouldn't be there. If there are inappropriate user names, they should be changed. Leaving them in the user table indefinitely _guarantees_ that they'll eventually be exposed. (Surely there's a law that covers this.) We should take out the trash.

We have hidden user names, hidden revisions, etc etc. That's not something that's likely to change and certainly not something that's going to get addressed by me in the dumps. But again, if you feel strongly that this data should be removed permanently, it could be discussed on wikitech-l or bugzilla.

Given that this is how things are right now, and given that someone can run a script to grab the user names, ids and editcounts and any other publically exposed fields and we wouldn't be particularly more efficient about it, this is one of those cases where the community, if it had a place to put community-generated datasets for download by others, could take care of it themselves. This is what I suggest.

(In reply to comment #10)

Okay, I'll re-purpose this bug to focus specifically on the user table and
will file separate bugs for the other tables.

I decided it made more sense to make this bug a tracking bug with dependencies for each of the private tables.

For reference: https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=blob;f=xmldumps-backup/worker.py;h=a6256263e5c7d80ee662e75bd67436adee72b3df;hb=ariel#l600.

Considering that https://www.mediawiki.org/wiki/Research_Data_Proposals was created in 2010 and never went anywhere, probably also because researchers never found a way to agree on what was highest priority (except one bolded item, which was done by [[m:WikiTeam]] instead), how can we proceed here? We need an easy way that is also tracked long-term (nothing is going to happen overnight anyway).

The easiest solution is probably to just create one bug for each item and then ask researchers on wiki-research-l and elsewhere to vote (in bugzilla) on those they'd need/use.

Nemo_bis renamed this task from Create partial SQL dumps for certain database tables (tracking) to Create SQL dumps for certain database tables, partial if necessary (tracking).Jan 26 2015, 1:57 PM
Nemo_bis raised the priority of this task from Lowest to Medium.
Nemo_bis set Security to None.

Can a shell user make a complete list of all the existing tables in all DBs? Looking at one wiki is not enough because some extensions or features aren't enabled everywhere; but that would help as well.

Aklapper lowered the priority of this task from Medium to Low.Jan 26 2015, 9:46 PM
Aklapper subscribed.
Nemo_bis raised the priority of this task from Low to Medium.Apr 9 2015, 7:07 AM

All pending tasks are in the Dumps-Generation project; this task is going away.

Where specifically is the progress against https://www.mediawiki.org/wiki/Research_Data_Proposals and https://meta.wikimedia.org/wiki/Right_to_fork tracked?

Right to fork: Tasks that have been opened for partial dumps for relevant tables are in the backlog on this workboard: Dumps-Generation When they are moved to the 'Up Next' column, that means they'll be worked on soon (within 1-2 weeks unless something intervenes).
Research Data Proposals should have a task or several associated with it in the Dumps-Rewrite project. If you see that this is not the case, please correct.

Given that subtasks such as T51135: Create partial SQL dump of ipblocks table and T51132: Create partial SQL dump of user table have not been closed, I guess I'm fine with the recent changes to this task. Honestly, I'm confused how the board visible at Dumps-Generation is supposed to be better or replace a parent task with child tasks/subtasks. There are a series of tasks specifically about creating partial SQL dumps (T51132, T51133, T51134, T51135, and T65274 from a quick glance) that now seem more difficult to find as a unit because they're not linked together using Phabricator Maniphest's relationships feature. This unit (creating partial SQL dumps) is worthy of distinction from the broader category of tasks related to dump generation (tags tagged with Dumps-Generation), in my opinion. It's not clear to me what was wrong with having this task be a tracking/parent task.

I should have just rescued this task and retitled it to be something specific and completable. We don't label any tasks as tracking any more; projects are used for that. Let me create a new task that does what I should have fixed this one up to be, and I'll add the subscribers of this one to it.

Let me create a new task that does what I should have fixed this one up to be, and I'll add the subscribers of this one to it.

This is T140977: Dump public data from all sql tables that have mixed public/private data.