Page MenuHomePhabricator

Restore missing CheckUser logs
Closed, DeclinedPublic

Description

Author: mcdevitd

Description:
Apologies in advance if this was already reported/responded to (I thought I had reported it long ago).

The CheckUser logs currently date back to December 2006, while CheckUser as a user right in its current logged form dates back to June 2005. Prior to December 2006, the CheckUser broke entirely and the entire log up to that point went with it. This is true across all projects. Ideally, we should restore the missing log entries from that first year and a half of CheckUser. The record of these log entries still exists, they just need to be added to the log visible on the projects. I have the missing logs in a text file on my computer, but since that was sent to me by Tim, I assume it can be retrieved by developers somehow.

See also: T10710 and T15789

Details

Reference
bz27807

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:29 PM
bzimport set Reference to bz27807.
bzimport added a subscriber: Unknown Object (MLST).

Not shelling yet, probably needs a maintaince script or something written first.

What's the format of the file?

I'm guessing it's gonna be very simple, string split comma's, and then just do a database insert

Depending of course on how Tim generated that before

mcdevitd wrote:

I have a .log file (i.e., plain text in a text editor) with hundreds of lines like:

<li>23:35, June 13, 2006 Dmcdevit got IPs for Dmcdevit on enwiki</li>

The main complication may be that this is from the days of the single global log, so there are also entries like:

<li>20:46, 1 lip 2005 Taw got IPs for [user]</li>

Or it may be that however the local logs were originally created can also be applied to these log entries.

Thehelpfulonewiki wrote:

Has this been completed or does it still need to be completed?

(In reply to comment #4)

Has this been completed or does it still need to be completed?

It still needs to be done. I always wondered why these logs were missing.

(In reply to comment #3)

I have a .log file (i.e., plain text in a text editor) with hundreds of lines
like:

<li>23:35, June 13, 2006 Dmcdevit got IPs for Dmcdevit on enwiki</li>

The main complication may be that this is from the days of the single global
log, so there are also entries like:

<li>20:46, 1 lip 2005 Taw got IPs for [user]</li>

Or it may be that however the local logs were originally created can also be
applied to these log entries.

Do all the log entries state which wiki the check was run on? Your first example does and the second doesn't.

I don't know the history behind the CU extension, so how did the global log work? Where should we restore the log entries to?

(In reply to comment #6)

I don't know the history behind the CU extension, so how did the global log
work? Where should we restore the log entries to?

You can find some information on these old bugs:

(In reply to comment #7)

You can find some information on these old bugs:

Thanks. Turns out the code has already been written: https://github.com/wikimedia/mediawiki-extensions-CheckUser/blob/master/importLog.php

A shell user will need to get the log file, and then run the import script.

CCd Tim Starling, see https://en.wikipedia.org/w/index.php?title=User_talk:Dominic&oldid=587005235#Old_CU_logs

Tim, do you remember this? Do you know how to obtain the missing logs?

The log was in /home/wikipedia/logs. That directory was repurposed for MW UDP logs with automatic rotation, it's possible that the files were lost by the automatic rotation script at around that time. I couldn't find any backup on the server. However, I happen to have the relevant files on my hard drive, for June 2005 to May 2007. Note that that range overlaps with the range that is said to be in the database already, so duplicates will have to be removed somehow.

I copied them up to /home/wikipedia/logs/norotate/checkuser

So, Legoktm and I were just looking at it. There's a few broken entries that can be easily fixed with common sense (newlines in the middle and such).

The date regex is fair to naive to cater for all the localised date formats.

$rxTimestamp = '(?P<timestamp>\d+:\d+, \d+ \w+ \d+)';

We tried using '(?P<timestamp>.*?)'. It's a bit better, but with the optional comma after, but then causes issues with dates with early commas

[bad timestamp] <li>۲۱:۲۰, ۲۰ اکتبر ۲۰۰۶ Jon Harald Søby got edits XXX.XXX.XXX.XXX on fawiki</li>

And others such as 2006-10-25T20:29:01

	$regexes = array(
		'ipedits-xff' => "!^<li>$rxTimestamp,? $rxUser got edits for XFF $rxTarget on $rxWiki$rxReason</li>!",
		'ipedits'     => "!^<li>$rxTimestamp,? $rxUser got edits for" ." $rxTarget on $rxWiki$rxReason</li>!",
		'ipusers-xff' => "!^<li>$rxTimestamp,? $rxUser got users for XFF $rxTarget on $rxWiki$rxReason</li>!",
		'ipusers'     => "!^<li>$rxTimestamp,? $rxUser got users for" ." $rxTarget on $rxWiki$rxReason</li>!",
		'userips'     => "!^<li>$rxTimestamp,? $rxUser got IPs for".   " $rxTarget on $rxWiki$rxReason</li>!"
	);

The first comma seems to be optional between some formats, so was easily improved on.

The code is also using strtotime(), which isn't so good for these localised formats "Parse about any English textual datetime description into a Unix timestamp" - http://us1.php.net/strtotime

I'm guessing that the timestamp is in whatever format the person who did the action has set in their preferences. Awesome, no?

There seems to be 10-20% of rows that won't be processed without at least some manipulation of the code as it currently is

tomasz set Security to None.

@tstarling: Are these files on the cluster still? I guess the above path was on fenari... could you push them to fluorine or tin or so?

I might be willing to take a look at this sometime... the whole import script is so old by now, that it probably needs a rewrite :S

@hoo, @tstarling, @Reedy: And it's a year later... Any update here?

I have a .log file (i.e., plain text in a text editor) with hundreds of lines like:

@Dominicbm do you still have these logs?

I think the files are on tin somewhere, I forget where exactly it was when Reedy and I looked at it last..anyways, the problem in T29807#326028 still stands...the log is localized to different user's languages, which means the current parser can't handle it.

We could prepare an array of strings by language to help parsing.

Then, we'll have to see if timestamps are coherent.

@Legoktm To parse the localised entries we'd have to create different patterns for each and presumably just try each pattern since presumably the entry is long enough not to be ambiguous in terms of language (the pattern will need to be ^ and $ bound to be safe).

However, I'm not sure how anyone can help with this without access to the logs. I suppose we could anonymise the entries and publish the unique findings, but that would require knowing which part the variable part is. Which, if we knew that, presumably means we already know the pattern.

I'd recommend we try and manually find the patterns by looking at similar lines, shouldn't be that hard without understanding the language. After that all we need to know is which part is timestamp, user, target and reason. If that isn't obvious by looking at the entries, we can always publish the extracted for someone to help tell us (especially differentiating between user and target).

Seven years since this bug was first filed, for a loss of CU logs in 2006, and still nothing has been done.

@Krinkle @hoo @Legoktm @Reedy @tstarling @Dereckson I know it's probably difficult, but don't tell me you can't do it...

Not nothing. It's not an easy problem to solve

Do we know where these logs are currently?

Dreamy_Jazz subscribed.

Tagging with the CheckUser project for better visibility. I'm happy to help with the parsing of the logs as needed, but wouldn't be able to directly do the inserts by myself.

I think the files are on tin somewhere, I forget where exactly it was when Reedy and I looked at it last..anyways, the problem in T29807#326028 still stands...the log is localized to different user's languages, which means the current parser can't handle it.

@Legoktm do you think "tin" will still have these files? If provided the file I could take a crack at writing something to import them.

If you don't know or they are now gone, then it seems the logs are missing forever.

I think there weren't so many people with CheckUser powers in the first year and a half. With that list, you would have:

<timestamp><checkuser account>"got"

Validating the second field first should be easier, which would then leave us with just the timestamps to parse somehow.
What concerns me most is that just like the format was in user preference, the timezone was probably as well...

Reedy changed the task status from Open to Stalled.Jan 31 2023, 2:09 AM

I've got a feeling these logs may be gone, unless someone has them squirreled away in a home directory somewhere...

Closing this as declined as I feel the likelyhood of finding these entries is so slim that this can just be closed. If they are found, feel free to re-open and I'd be happy to take a look at parsing them.