Page MenuHomePhabricator

Embed image author, description, and copyright data in file metadata fields
Open, LowPublic

Description

Image files downloaded from Wikimedia sites do not contain any information about the images (author, license, description page URL etc). This makes it hard to identify the source or the author in certain context (e.g. image reused on the web without proper attribution); arguably it causes certain ways in which Wikimedia publishes these files (such image tarballs) to be license violations. The situation could be improved by embedding this information into the file as metadata (e.g. EXIF fields).

This is tricky as it would mean that the image needs to change on upload and potentially every time someone makes an edit to the file page; or we would have to make the original image hard to access and instead offer a modified version (a kind of full-size thumbnail - see T67383) for download/view/export.


See also:

  • T2657 - using metadat in the opposite direction
  • T20871 - the same issue for thumbnails
  • T95217 - same issue for audio files

(Original reporter: reflection)

Details

Reference
bz3361

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 8:49 PM
bzimport set Reference to bz3361.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to alterego from comment #0)

an image dump contains no metadata concerning any of the images

What are ways to reproduce the problem nowadays? How to get an "image dump" in 2014, so to say? Is this still a problem?

reflection wrote:

Do the EXIF data about images contain copyright information etc? If not, the bug should be left open, and probably elevated in importance.

Errm, I'm a bit confused by the counter questions.
Could you answer comment 2, please?

Plus this is de-facto low priority and not planned for a future release, until somebody provides a patch. Resetting Target Milestone and priority to previous values which seem more realistic.

Ok to clarify:
*fileare not stored by their md5sum, its the md5sum of the file *name*. Deleted files do use their sha1 sum as file name.
*however we still make the assumption pretty much everywhere that each version of the file has a constant sha1 sum/is bit for bit identical. So any change must be a reupload.
*the file versioning code is not well adapted to having an excessively large number of versions of a file. (If an edit->pseudo new upload, it would probably explode if someone made 5000 edits, especially to a large file)
*to do this automatically (or perhaps to have mallable metadata included with the dump), it might be easier once wikidata hits commons.
*the most likely solution, at least in the meantime, i think would be to have an extension hook up to exiftool, which allows people to modify exif on the server side triggering upload. (Perhaps with button to import data from wikipage). This wouldnt be as quick as as total automation, but would be something, and more easily turned off if their is an issue


re andre, well we dont really have image dumps anymore (afaik, which is sad) the bug equally applies to people reusing our images from any form, or just wget'ing them off the server. The original poster wants the data from the image wikipage to be directly embedded in the file so that the data cannot be separated from the file (without malicious intent) where currently its common for reusers to lose this data if they dont care.

I agree this would be nice, think it may be difficult to do (fully) given our current infrastructure, and ultimately is a low priority compared to other more pressing issues we have with media files.

(In reply to Bawolff (Brian Wolff) from comment #5)

I agree this would be nice, think it may be difficult to do (fully)

Difficult as in time-consuming or as in really complex. I'm thing whether this cold become a GSoC project idea one day (not in the current round).

Very complex to do it fully (The original request of auto recording edits into image metadata). Doing it in a somewhat superficial manner (Just having an on-wiki interface to edit metadata) might potentially be gsoc worthy (Kind of like a continuation of my gsoc project from 2010)

Tgr renamed this task from Image author, description, and copyright data saved in EXIF fields to Embed image author, description, and copyright data in file metadata fields.Jul 31 2015, 7:17 PM
Tgr updated the task description. (Show Details)
Tgr set Security to None.

in germany it is common to send cease and desist letters which cost 500-1000 euro each. a couple of contributors showed up like this, one discussion here: https://commons.wikimedia.org/wiki/Commons:Administrators%27_noticeboard/Archive_53#Legal_action_resulting_from_photographs_by_Haraldbischoff

interesting is the numbers the club against cease and desist letters (interessensgemeinschaft gegen den abmahnwahn, http://www.iggdaw.de/) presents: 200'000 cases a year with a value requested 165 million euro. cc-by cases are only a low percentage.

putting the license in the metadata wold allow to adjust the toolchain afterwards. e.g. make an announcement so image programs can keep this information, or web browsers get an option to display the data, print programs get an option to include it automatically, etc.

Doing it in a somewhat superficial manner (Just having an on-wiki interface to edit metadata) might potentially be gsoc worthy (Kind of like a continuation of my gsoc project from 2010)

Adding Possible-Tech-Projects, then.

in germany it is common to send cease and desist letters which cost 500-1000 euro each. a couple of contributors showed up like this, one discussion here: https://commons.wikimedia.org/wiki/Commons:Administrators%27_noticeboard/Archive_53#Legal_action_resulting_from_photographs_by_Haraldbischoff

interesting is the numbers the club against cease and desist letters presents: 200'000 cases a year with a value requested 165 million euro. cc-by cases are only a low percentage.

putting the license in the metadata wold allow to adjust the toolchain afterwards. e.g. make an announcement so image programs can keep this information, or web browsers get an option to display the data, print programs get an option to include it automatically, etc.

Effectively leveraging metadata for use with non-specialist users is hard. Lots of people have tried to adjust the /general tool chain/ and usually meet with only limited success (See semantic web, microformats, etc. For a more relavent example to this bug, see http://commonsmachinery.se/ ).

That's not to say we shouldn't be doing much better on embedding data, we should be better. Our entire approach to media metadata on commons and in MediaWiki generally is extremely hap-hazard and well sucks. But I just want to caution you about being too optimistic. Even if we fix this bug, there is a long long long way to go to solving the types of problems you want to solve.

This is a message posted to all tasks under "Backlog" at Possible-Tech-Projects. Outreachy-Round-11 is around the corner. If you want to propose this task as a featured project idea, we need a clear plan with community support, and two mentors willing to support it.

This is a message sent to all Possible-Tech-Projects. The new round of Wikimedia Individual Engagement Grants is open until 29 Sep. For the first time, technical projects are within scope, thanks to the feedback received at Wikimania 2015, before, and after (T105414). If someone is interested in obtaining funds to push this task, this might be a good way.

IMPORTANT: This is a message posted to all tasks under "Need Discussion" at Possible-Tech-Projects. Wikimedia has been accepted as a mentor organization for GSoC '16. If you want to propose this task as a featured project idea, we need a clear plan with community support, and two mentors willing to support it.

why not starting with something easy? if making a thumbnail for wikipedia, leave exif in place?

why not starting with something easy? if making a thumbnail for wikipedia, leave exif in place?

If you leave the entire exif in place you can get very large files. Sometimes exif metadata can be larger then the entire rest of the thumbnail (Especially when they start to have embedded thumbnails inside them). Leaving exif would actually cause a significant increase in file size.

interesting point, what exif fields would be necessary to get the copyright ok? or add the copyright related fields of xmp?

Could this project be a good candidate for the current Outreachy-13 internship ( Dec 6 to March 6 )?
I had recently seen an RFC - T589 related to images. Are the two related?

Could this project be a good candidate for the current Outreachy-13 internship ( Dec 6 to March 6 )?

If someone was willing to mentor it, probably. You would need to clarify what the current status is of thumbor and how it fits into this.

I had recently seen an RFC - T589 related to images. Are the two related?

Nope.

interesting point, what exif fields would be necessary to get the copyright ok? or add the copyright related fields of xmp?

Bare minimun, that would probably be the "Artist" field and the "Copyright" field of Exif. (ImageDescription sometimes has some copyright related info too, but not as critical). [Part of the problem here, is that image magick doesn't really have fine grained options for what fields to keep if I remember correctly, although that's not something you should take my word for]

However, most guides for how to mark your image as creative commons, strongly suggest adding XMP metadata, so for proper maintaining of copyright info for freely-licensed works, its definitely a plus to keep at least those XMP fields.

CCing @Gilles to weigh in ref Thumbor.

The current Thumbor implementation, due to roll out for all thumbnail traffic by the end of the year, has selective EXIF filtering for thumbnails. While implementing the same thing in Mediawiki is worthwhile for Mediawiki itself, it soon won't be of any use for Wikimedia.

The idea of population EXIF fields based on wiki metadata is still something worth looking into, but it's much more challenging as a project, imho.

The idea of population EXIF fields based on wiki metadata is still something worth looking into, but it's much more challenging as a project, imho.

As long as you don't want to change the original image, just the thumbnails, it doesn't seem that bad. We have an API for getting the wiki metadata; formatting, filtering out HTML, length limiting etc. is nontrivial but not particularly hard.

I don't think that making thumbnail generation dependent on a mediawiki API is a great idea, the whole point of decoupling the thumbnailing infrastructure is to avoid having mediawiki in the mix of actual thumbnail generation. Having an API as a dependency would be a step back in terms of performance and availability of thumbnailing.

I would be more leaning towards storing that data as extra headers for the original in swift. This way thumbor thumbnail generation remains as mediawiki-agnostic as it currently is, it just applies extra data when it finds some in the existing request to read the original from swift.

Mediawiki would be responsible for keeping the extra headers up to date, and filtering the intake of metadata. Which is more in line with the soon-to-be new status quo where mediawiki is still responsible for all the metadata wrangling. The separation of concerns stays the same.

@Gilles can we have this or some part of it for an Outreachy-13 internship, and would you be willing to mentor?

I don't have time to mentor this at the moment, sorry, but I'm happy to keep commenting and providing feedback if some of it gets picked up for outreachy.

We're having a new round of GSoC, Outreachy and RGSoC internship. @Gilles do you have time to mentor this, in this term?

the upload wizard could guide the user in adding the information. or require the user to add it herself? so we have not the problem of "original photo does not contain the license information, but then it is added". what you think?

srishakatux subscribed.

Removing the Possible-Tech-Projects tag as we are planning to kill it soon! This project does not seem to fit in the Outreach-Programs-Projects category in its current state, so I'm not adding this tag right now!

a couple of reports of copyright / copyleft trolling, which could be avoided by such metadata: