Page MenuHomePhabricator

Ask additional copyright questions of users claiming "own work"
Open, LowPublicFeature

Description

Author: rd232

Description:
A lot of the bad uploads Commons gets are from users not understanding what "own work" means. Special:UploadWizard seems a prime target to try harder to educate users, yet once we get to the "own work" rights page the user is simply asked to confirm "This file is my own work." I suggest incorporating more of the issues mentioned in the UploadWizard tutorial and at Special:Upload/ownwork. If the uploader claims "own work", we should have them clarify what they understand by that, and ask them to confirm (using checkboxes, so they actually have to answer the questions!) things like

  1. this is a photo, and I took the photo myself
  2. or it's an original digital work I created myself, without using or relying on any files created by other people
  3. the photo doesn't include creative objects or images created by other people (eg paintings, statues, etc)

3a. except where photo taken in a public place, in any of these countries where Freedom of Panorama applies (...list...)

This list isn't exhaustive, it's just a first idea of what sort of thing to ask, and nor can the final list be exhaustive of all issues. But we can try and ask the most common questions where the wrong answer makes it clear that it's not "own work" and that either the file shouldn't be uploaded, or it should be uploaded and tagged for immediately needing additional information and/or help from more experienced users.

Since these questions are likely to need tweaking and testing, it would be really helpful if it was possible for admins to edit them.


Version: unspecified
Severity: enhancement

Details

Reference
bz40255

Related Objects

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:53 AM
bzimport added a project: UploadWizard.
bzimport set Reference to bz40255.
bzimport added a subscriber: Unknown Object (MLST).

potato.olivier wrote:

I second this proposal.
Dealing with outrageous copyvios is a snap, but dealing with "improper understanding of copyright ownership" cases is really time-expensive. I think the proposed scheme, at least for first-time uploads, would greatly help.

rd232 wrote:

I appreciate this is not an easy request to fulfil, so allow me to add an easy way to slightly improve matters: point people to https://commons.wikimedia.org/wiki/Commons:Own_work on the relevant UploadWizard page, with an appropriate "are you sure it's 'own work'? people often misunderstand the concept, please read this page". Could even be a checkbox for "yes I've read Commons:Own work".

NB I appreciate Commons:own work is currently a crappy draft, but that's fairly easily and quickly fixed: just give a few days' warning at COM:VP that the UploadWizard will soon reference it and I'm sure it'll be up to scratch. And frankly even that crappy draft is better than allowing people to just rely on their own (often mistaken) assumptions without any hint that those assumptions may be wrong.

FoddyZip348 wrote:

This topic, along with a few potential solutions, is also discussed at http://commons.wikimedia.org/wiki/Commons_talk:Upload_Wizard#.22own_work.22

With the amount of copyvio getting uploaded to Commons on a daily basis, I'm "upping" to priority on this one. All steps to get the number of copvio-files on Commons is good steps to take.

MarkTraceur lowered the priority of this task from Medium to Low.Dec 3 2015, 6:11 PM

I'm not convinced.

I mean, I think it's important to protect Commons from copyright violation. I do. But all the questions in the world won't stop bad actors, and ignorant actors, by this point, are very used to the idea that they need to click on a bunch of checkboxes to get anything done in this world. They don't really care what the checkboxes say.

Also, I'm not convinced UploadWizard is the best place to solve this problem - I'd imagine there are ways to detect copyvios with a strong level of certainty upon upload - if an image's description is one word, for example, or only barely meets the 5 character requirement. Maybe if it lacks categories, that could be a clue.

I'm going to lower the priority here, if anyone has a better idea than the above, open a new bug, but this isn't going to happen for some time.

The absence of categorization for new images is a good hint that there may be copyvio attempts, in order to use Commons as a repository for those stolen files that uploaders will want to maintain "hidden" from basic viewers.

But we have tools to track all new uploads: new uploads with no (or almost no) description, and no categories (or not very relevant categories, such as only a very populated category for unsorted images or tracking categoies images that need some complex fixing or that have minor bugs tracked in their description) can be used to move up these new images higher in the list of images to inspect in priority.

If these new images with obvious copyvios are detected early by these prioritized reporting lists, they'll be removed very fast, and the uploader will not be able to use Commons as a stable repository to broadcast their stolen contents to other people or via emails or social networks or by chatting or discussing in external forums where they'll publish the URL to the file hosted by Common.

If we can detect fast these copyvios, uploaders will attempt to retry their upload with a new deceptive name, but we can detect also numeric signatures of files to detect obvious copies of files that were already banned and removed (Commons, can have a database of SHA1 fingerprints for files that have already been removed). Then uploaders will attempt to upload slightly modified files (e.g. changing some internal tag fed only by randomized data, so that it does not really alter the quality of files): Commons may also create several fingerprints for different parts of the file, or by ignoring all non important tags added to the file).

But then uploaders will attempt to alter the file by inserting a random number of initial or final frames in their videos or audio files, or will add some "noise" to some images at places that are not visible, such as along the border of the image, or randomizing only some lightness in the dark areas: it's easy to do with JPEG and MPEG by slightly randomizing only the "high frequency components" of the DCT transforms of 8x8 blocks (because they are almost invisible and most original JPEG/MPEG compressed images (or compressed sounds) already have some noise on these high frequency components (randomization is more noticeable when it affects the lightness component of images, than when it affects the saturation or hue component of images; MPEG formats using YCbCr components can be more easily randomized without noticeable effect in the Cr component than in the Cb component...).

As well, images may be slightly altered in proportions, possibly using non-linear transforms or rotation, affecting only part of the image (generally the left or right part of landscape images) which will also be unnoticeable for most users. There may be also some randomized rotations by less than 2 degrees. These transforms are not necessarily very complex to create, given the hardware acceleration we have today in many graphic boards.

In audio and video, you can easily alter accelerate or slow down the speed by inserting or removing intermediate interpolated frames. This alteration of speed can also be variable at a rate that is small enough to get noticeable in terms of perceived quality. As well it's easy to change the duration of pauses at places where there are already pauses, or perform some cuts or where the content already has "breaks". You can easily insert blank frames between these breaks or duplicate the static frames, and change more radically the geometric transforms of images at those breaks (e.g. shifting the image up or down by severla pixels, given that the video will already have its top and bottom cut by some amount, or shifting the images by one or two 8x8 blocks to the left or right, given that the images are frequently cut to remove some logos or identification noramlly visible in the original.

In all cases, uploaders of copyvio contents will always find a trick to bypass the simple fingerprints of contents (or of some isolated data streams within a packaged content).

So ultimately, the most efficient way to detect these new repeated copyvios is to use priority lists based on deceptive descriptions and categorization of the uploaded content. Other hints may be used also to prioritize the list, such as more suspect IPv4 or IPv6 address ranges (or domain names), or lists of "openproxies", from which there's been more frequent or more recent abuses by uploaders.

Having such priority list easy to browse, with some minimum small preview of the content (not just the start frame of a video, but a frame or frame range selected randomly by the list in the middle of the video) and of the inserted metadata in descriptions or categories, will allow detecting these copyvios faster and act rapidly against their uploaders.

For videos and photos, another good detection hint would be its unusual display format (signaling that a part would have been cut along the borders) not folllowing the common standards. But beware that many valid images in commons are cropped and built specifically by legitimate authors after they've drawn it or worked on their own photographs to hide or recenter the focus or correct some light conditions along the borders, or cut some noisy borders created by their cameras, or some surexposed areas not felt as necesssary. This cut of format is more rare for videosn but may happen legitimately too (e.g. a video of someone talking in a static position, but whose left or right side in the original was showing unnecessary distracting artefacts).

Most copyvios in general are "one-time" shots, and eliminating them has no severe consequence, some users may have done it accidently by lack of knowledge and in good faith in some circumstances, or the question of "copyvio" was not so easy to demonstrate (so we have a reasonnable time to act, and most often no reason to block an uploader for that accident, but just some reason to explain them what was missing : a reliably verifiable permission, possibly via OTRS in a reasonnable time to allow uploaders to prove their legitimate right and to understand this process).

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:14 AM
Aklapper removed a subscriber: wikibugs-l-list.