Page MenuHomePhabricator

search for images/files by hash
Closed, ResolvedPublic

Description

Author: plugwash

Description:
it would be very usefull to be able to search for images by a hash (the exact
type of hash doesn't bother me too much md5 or sha1 would be fine)

this hash should also be displayed on the image description page somewhere.

the point of this is if i see an image in the commons that says "from german
wikipedia" and the uploader has renamed it i want to be able to find the image
in the german wikipedia.


Version: unspecified
Severity: enhancement

Details

Reference
bz1459
TitleReferenceAuthorSource BranchDest Branch
builds-api: bump to 0.0.104-20231113143657-f8b48e05repos/cloud/toolforge/toolforge-deploy!128raymond-ndibebump_builds-apimain
[build.start]: handle harbor timeout errorrepos/cloud/toolforge/builds-api!55raymond-ndibehandle_harbor_request_timeoutsmain
Customize query in GitLab

Related Objects

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:10 PM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz1459.
bzimport added a subscriber: Unknown Object (MLST).

nen wrote:

This feature would also help with duplicate files under different names, if
extended a bit. People upload a file not knowing that it's already there,
because the first one wasn't categorized very well or the duplicate uploader
just doesn't look thoroughly enough. There's however no reason that people would
have to do this searching manually.

On each upload of a file MediaWiki could:

  1. Generate a hash of the uploaded file
  1. Check if the generated hash is already known, ie. if the file is a duplicate
    • This part would be the only necessary database query for a hash search feature.

Then, depending on configuration based on analysis of possible false hash
collisions and such, it could then:

3a) Display a warning to the user that the file already exists
or
3b) Display an error to the user that the file already exists, and reject the file

This would require counting a hash, or even multiple hashes with different
methods, for all revisions of all existing files. Duplicate detection would not
work properly while hashes are being generated and added to the database. Hashes
for deleted files or revisions would also be useful for generating different
warnings when someone uploads a file already deleted before, but its
implementation might be more complicated.

plugwash wrote:

what would also be usefull is to generate hashes for all thumbnails that are
generated. As often the kind of people who copy images without proper
attribution are the kind of people who copy a thumbnail rather than the full res
image.

robchur wrote:

*** This bug has been marked as a duplicate of 5763 ***

Note there is now a properly-indexed SHA-1 hash field on the image table in recent versions. I have the vague recollection that there's a way to do lookups by hash in the API, but not in the UI at present.

Dupe file warnings are also not currently made.

(In reply to comment #4)

I have the vague recollection that there's a way to do lookups
by hash in the API, but not in the UI at present.

api.php?action=query&list=allimages&aisha1=123abc

[[Special:FileDuplicateSearch]] introduced with r32180. A link on the image description page to Special:FileDuplicateSearch/filename.ext added too.

Bug 11984 filed for dupe file warning at time of upload.

And bug 13434 is also filed :)