Page MenuHomePhabricator

Update fixUserRegistration.php to use newuserlog (where available, prior to r12207), and gaussian estimates for the fossils
Open, LowPublic

Description

Author: herd

Description:
Accounts which were created before the user.user_registration field existed (before rSVN12207) do not have that data in the database. Accounts without the data had the first edit time filled in as an approximation, but this is sometimes wildly inaccurate. In addition, the field remains NULL for accounts without any edits.

There are several months worth of data on Wikimedia wikis in the new user log from the extension (see rSVN10573 ) that could populate this data. Also, for users prior to even the extension, a Gaussian curve could be plotted from the data of available edits and log entries (all of which would be after the creation date) and normalized to a curve or wave of user creation date/ID.

Details

Reference
bz18638

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:35 PM
bzimport set Reference to bz18638.
bzimport added a subscriber: Unknown Object (MLST).

happy.melon.wiki wrote:

Go on, then, let's see this gaussian curve of yours :D Might as well work for your wontfix!!

The other suggestion, however, is good; that extension provided accurate log data; A quick check on the toolserver suggests that there are at least 290,000 entries in the relevant period; a substantial fraction of these could be recovered in this fashion. It should probably be a separate script, though; there's no guarrantee that wikis needing to populate the column would have had the extension installed, and no point in the script trying to use that data if it's not present.

herd wrote:

Go on, then, let's see this gaussian curve of yours :D

Too slow of a query to do it for everyone without actually, yknow, DOING it, as in populating the data. But here is 5000 from en.wp. Note there isn't much curve to it, and it skips all users with double/nulls, but there is definitely a trend line:
http://test.wikipedia.org/wiki/File:Example_of_user_first_actions_for_en.wp_400000-405000.gif

herd wrote:

Sampling of normalizable user first-contribution curve

Here is a more distributed sampling, of all users from 1k-750k (1:1000).

Copied from http://test.wikipedia.org/wiki/File:Example_of_user_first_actions_for_en.wp_1-750000_(by_thousand).gif

Attached:

Example_of_user_first_actions_for_en.wp_1-750000_(by_thousand).gif (600×800 px, 9 KB)

happy.melon.wiki wrote:

Wow, that's a much better fit than I was expecting, TBH. And the outliers tell their own story; particularly interesting the ones on the second graph that were registered in 2001-03, but not used until around 2008... More ammunition (as if it were needed) against deleting old accounts.

Still not entirely sure how you'd convert that data into registration timestamps, or are you going to assume that the curve approximately follows the registration time; that is, the average delay between registering and editing is zero? Seems a justifiable assumption, but I notice the curve gets a bit wobbly at the top; lots of double NULLs in the data...

  • Bug 22097 has been marked as a duplicate of this bug. ***