Page MenuHomePhabricator

Make old images searchable by hash
Open, MediumPublicFeature

Description

I use the search by hash option to prevent my bots from uploading duplicate images. If an image gets changed (for example rotated) this won't work because the hash won't match. The hash of the old image is available in the oldimage table now. It would be very nice if the oldimage table is searchable in the api just like the image table.


Version: unspecified
Severity: enhancement

Details

Reference
bz21345

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:58 PM
bzimport set Reference to bz21345.
bzimport added a subscriber: Unknown Object (MLST).

This should be relatively simple (comment so we get some input from Roan)

Theres a few ways this could be implemented...

We could have an option only to search oldimage.. Or something similar to only search old image for a hash if the image isn't found in image.. Can easily set an attribute of "old" or similar.

We could have a version of the code for only oldimage, and one for new image (inheritance etc).. Which might not actually be a bad idea in the long run - People can get access to the old images via the file/image page..

Certainly searching oldimage when there is no need to is a bad idea

Maarten, is there any intended behaviour on your part? (Or preference one way or another)

Probably the nicest way is to have one search which only searches the image table by default, but with an option to also search oldimage or only search oldimage. Searching oldimage when an image is not available in the image table sure would be nice too. In the future the searching of the filearchive can be added in a similar way (but that's another bug).

Isn't file archive deleted stuff? And therefore, would have to be right limited?

[13:13:57] <RoanKattouw> Currently prop=imageinfo returns information from both image and oldimage, but its hash search only searches image. It could be made to search oldimage as well but that'd probably produce weird results à la using &revids= with prop=templates (see bug 22079)
[13:14:25] <RoanKattouw> i.e. the search would hit an old version of the image but you'd get imageinfo for the current version; that'd be weird and probably needs to be addressed
[13:15:03] <RoanKattouw> Which may require quite a bit of redesign of the imageinfo module, it's a bit of a mess right now

RoanKattouw Yeah I guess an action=findfilebyhash or something that searches all 3 tables for a user-provided hash makes sense
multichill That would be nice yes

By "all 3 tables" we mean image, oldimage and filearchive. Current solutions such as prop=duplicatefiles and aisha1 only check image and are badly suited to checking the others.

  • Bug 37376 has been marked as a duplicate of this bug. ***

Possible use cases:

  • Bots that check SHA1 *before* uploading
  • UploadWizard could compute SHA1 with fileReader in compat. browsers *before* uploading (or while it uploads other files)
  • Investigation - Which user previously uploaded the same file (to find socks of copyvio uploaders)

image and filearchive are anyway seeked by the servers when uploading an image (for throwing a warning)

  • Bug 58992 has been marked as a duplicate of this bug. ***

(In reply to Rainer Rillke @commons.wikimedia from comment #10)

For the time being:
https://tools.wmflabs.org/rillke/jsonapi.php?action=sha1lookup&sha1=<SHA1>

@Maarten Dammers, should we use this tool in pywikibot, until a mediawiki solution is deployed?

We might as well do it, but should probably keep an pywikibot bug open to keep track of this.

Mixed up some tables in T230196 (oldimage and filearchive). I wonder if the MCR is going to solve this.

The script moved to the expose-data tool

https://tools.wmflabs.org/expose-data/jsonapi.php?action=sha1lookup&sha1=4eee44b18576e84de7b163142b537d2fe6231845

with &showdeleted=1 for deleted images (very slow and with quotas).

Mixed up some tables in T230196 (oldimage and filearchive). I wonder if the MCR is going to solve this.

No, it won't. T28741: Migrate file tables to a modern layout (image/oldimage; file/file_revision; add primary keys) would.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:24 PM