Page MenuHomePhabricator

DBQ-65 Find most common hashes for Commons' deleted images
Closed, ResolvedPublic

Description

This issue was converted from https://jira.toolserver.org/browse/DBQ-65.
Summary: Find most common hashes for Commons' deleted images
Issue type: Task - A task that needs to be done.
Priority: Major
Status: Done
Assignee: Mr.Z-man <mrzmanwikimail@gmail.com>


From: Mike.lifeguard <mike.lifeguard+ts@gmail.com>

Date: Tue, 16 Jun 2009 18:31:09

I'd like to have a list of the most common sha1 hashes for Commons' deleted images. Specifically, I'm interested in vandalistic images, so the query could be limited to
a) those with \b(?:vandal(?:ism)?|goatse|shock image|...)\b in the delete reason and/or
b) the last ~10,000 deletions or something (if that'd actually make a difference)

For output, I'd like a list of the top 100 sha1 hashes and a file name for each hash so I can find the file.

So far as I know, the hashes aren't stored for deleted files.


Version: unspecified
Severity: major

Details

Reference
bz59320

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:22 AM
bzimport set Reference to bz59320.

From: DaB. <dab@ts.wikimedia.org>

Date: Fri, 19 Jun 2009 20:49:33

You can find a list at http://toolserver.org/~dab/queries/zman1.txt, the query was

SELECT CONCAT("# ", fa_storage_key, "[[:image:",fa_name,"]] "), COUNT(fa_storage_key) as anzahl FROM filearchive WHERE fa_archive_name IS NULL AND fa_deleted_timestamp>"20080000000000" AND fa_storage_key!="" GROUP BY fa_storage_key ORDER BY anzahl DESC LIMIT 100;

It listed the storage_key (that's a sha1-hash in a mediawikiformat), a image-name of a deleted picture with this key, and the count for all pictures that was delete after 1.1.2008.

I hope it help you.


From: Mr.Z-man <mrzmanwikimail@gmail.com>

Date: Fri, 19 Jun 2009 20:55:35

The hashes are stored for deleted files, concatenated with the file extension as fa_storage_key in the filearchive table.

Most common of all deleted files:

Query:

SELECT SUBSTRING(fa_storage_key, 1, 31) AS hash, COUNT(*), fa_name FROM filearchive 
WHERE fa_storage_key IS NOT NULL AND fa_storage_key != '' 
GROUP BY hash ORDER BY COUNT(*) DESC LIMIT 100;

Results: http://toolserver.org/~alexz/DBQ-65/all.txt

From the last 6 months:

Query:

SELECT SUBSTRING(fa_storage_key, 1, 31) AS hash, COUNT(*), fa_name FROM filearchive 
WHERE fa_storage_key IS NOT NULL AND fa_storage_key != '' AND fa_deleted_timestamp > 20090119000000 
GROUP BY hash ORDER BY COUNT(*) DESC LIMIT 100;

Results: http://toolserver.org/~alexz/DBQ-65/recent.txt

Vandalism noted in deletion reason:

Query:

SELECT SUBSTRING(fa_storage_key, 1, 31) AS hash, COUNT(*), fa_name FROM filearchive 
WHERE fa_storage_key IS NOT NULL AND fa_storage_key != '' AND fa_deleted_reason RLIKE "[[:<:]](vandal(ism)?|goatse|shock)[[:>:]]"
GROUP BY hash ORDER BY COUNT(*) DESC LIMIT 100;

Resuts: http://toolserver.org/~alexz/DBQ-65/vandal.txt

Recent vandalism deletions:

Query:

SELECT SUBSTRING(fa_storage_key, 1, 31) AS hash, COUNT(*), fa_name FROM filearchive 
WHERE fa_storage_key IS NOT NULL AND fa_storage_key != '' AND fa_deleted_timestamp > 20090119000000 AND fa_deleted_reason RLIKE "[[:<:]](vandal(ism)?|goatse|shock)[[:>:]]" 
GROUP BY hash ORDER BY COUNT(*) DESC LIMIT 100;

Results: http://toolserver.org/~alexz/DBQ-65/recentvandal.txt

This bug was imported as RESOLVED. The original assignee has therefore not been
set, and the original reporters/responders have not been added as CC, to
prevent bugspam.

If you re-open this bug, please consider adding these people to the CC list:
Original assignee: (none)
CC list: mike.lifeguard+bugs@gmail.com, wikimedia-bugzilla@dabpunkt.eu