Page MenuHomePhabricator

Refuse uploading JPEG files with extra junk at the end.
Open, MediumPublicFeature

Description

Original title: Refuse uploading files that contain huge data of other file types, especially if this data is encrypted


Version: 1.22.0
Severity: enhancement

Details

Reference
bz46921

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:15 AM
bzimport set Reference to bz46921.
bzimport added a subscriber: Unknown Object (MLST).

Wondering at which exact state this refusal would take place.
This request might be "UploadWizard" (or "Special:Upload") territory instead of "File management".

Wondering at which exact state this refusal would take place.
This request might be "UploadWizard" (or "Special:Upload") territory instead
of
"File management".

Probably the same stage we do other file type checks. (On the backend after the upload)

Given that these files have been deleted, could an example be attached to this bug so we can see what the file actually looks like?

I'm given to understand these were valid JPEG's with extra junk in metadata segments? I'm not sure we would be able to strip that without worrying about damaging real metadata.

If the images just had extra data embedded into the image data using stenography, it would be pretty difficult to detect in general.

Created attachment 12039
sample: https://commons.wikimedia.org/w/index.php?title=File:Fresh_Relic_4531.JPG

My computer just crashed (step by step human input devices stopped working) after viewing one of them so I really hope they do not contain evil code.

Attached:

File.Fresh_Relic_4531.JPG (492×489 px, 9 MB)

(In reply to comment #3)

Created attachment 12039 [details]
sample:
https://commons.wikimedia.org/w/index.php?title=File:Fresh_Relic_4531.JPG

My computer just crashed (step by step human input devices stopped working)
after viewing one of them so I really hope they do not contain evil code.

This one seems to contain a password protected file. Opening it with 7z prompts for a password. The second one (12040) should contain something also, although 7z was unable to detect an archive there. As one can see, both images are displayed fast (while downloading) and then the browser keeps downloading data even if the image is already displayed.

As I've read, it's extremely easy to add any file inside a jpeg and yet have an absolutely valid image that displays perfectly. It can be done by just concatenating the contents of a file to an existing jpeg image.

Attached:

File.Fresh_Relic_4531.JPG (492×489 px, 9 MB)

Hmm, if its just stuff concatenated at the end, it would probably be possible to detect (Look for the \xFF\xD9 marker, see if anything after it) [From a security paranoia, doing this would probably not be a bad idea. GIFAR and all]

Looking at these files, they are indeed just stuff stuffed at the end.

For 1239:

00011d40 e6 93 34 a7 ad 25 0b 61 85 14 51 4c 0f ff d9 37 |..4..%.a..QL...7|
00011d50 7a bc af 27 1c 00 03 d8 f3 90 3d 40 84 9c 00 00 |z..'......=@....|

Note the ff d9 denotes end of image (EOI). After that 37 7A BC AF 27 1C are the magic numbers for a 7z archive.

For the second image (1240) we have:

0000dc80 dd cf a1 f5 a6 9e b4 87 a9 a1 6b a8 92 3f ff d9 |..........k..?..|
0000dc90 43 d6 cd 64 8a dc f7 24 57 18 a8 2f e3 dd 38 34 |C..d...$W../..84|

Which doesn't have any magic numbers that I could see. However, it definitely doesn't appear to be JPEG data as we later on have ff sequences that aren't escaped. Maybe its the second part to some file split up over multiple jpegs or maybe encrypted, or something else.

MarkTraceur raised the priority of this task from Medium to High.Nov 28 2016, 6:19 PM
MarkTraceur moved this task from Backlog to Triaged on the Multimedia board.
MarkTraceur subscribed.

I don't think we can detect this with a 100% accuracy (we'd essentially have to write a JPEG decoder, then use it to decode the file, and see if there's anything left over), but we could probably reject the uploads based on some crude heuristic (e.g. if the data is longer than it would be for an uncompressed file of these dimensions, something is clearly fishy). But I'm afraid this will end up in an arms race (if we plug this for JPG files, these folks will just switch to another file format that is more difficult to evaluate).

Run something like jpegtran -o -copy none on it and discard it if the size reduction is significant? (Although jpegtran is designed to be lossless so probably there are better choices.)

Unless jpegtran is smarter than it should be, that won't help for files without the end-of-image marker (where essentially the extra junk data is part of the image scan data).

I would suggest that this is a case where the perfect is the enemy of the good. It would be impossible to defend against a knowledgeable adversary using good steganographic techniques, who is trying to upload an extra ~1% payload. I think it should be relatively easy to prevent ~800MB .iso images and movies, etc. from piggybacking on pretty much any file. On the other hand, it has been pointed out that we can catch a lot of this stuff with AbuseFilters.

Neat, I actually forgot that abusefilter can do this now. For future reference: https://commons.wikimedia.org/wiki/Special:AbuseFilter/160. We'll have to be careful tuning this to avoid blocking legitimate uploads, though. The current rule is a bit more rigorous than I would've recommended, but probably fine. I wonder if this means we can close this task?

On the other hand, it does nothing against someone uploading a suitably big JPG file that has no image data, but a whole movie tacked on at the end… I feel like T12847: Detect RAR concatenation in jpeg images would be a better way to discourage this. Embedding RAR files is popular because they can be extracted by just renaming the file to .rar, without having to edit binary files (or use dedicated software).

@matmarex are there any samples available?

I downloaded the first deleted file from the link in the summary:

$ cat WriterWavePlot.JPG | wc -c
24608472
$ jpegtran -o -copy none WriterWavePlot.JPG | wc -c
30888

Or you can just reuse the existing thumbnailing system to generate a thumbnail that's 1px smaller, and see if there is more than say 10% size difference.

In T48921#2829328, @Tgr wrote:

@matmarex are there any samples available?

I don't know if we had any uploaded, but you just need to take any JPG file, remove last two bytes, append garbage. I could create and upload an example.

In T48921#2829346, @Tgr wrote:

Or you can just reuse the existing thumbnailing system to generate a thumbnail that's 1px smaller, and see if there is more than say 10% size difference.

Thumbnailing strips all metadata, and legitimate metadata can be pretty large (e.g. https://commons.wikimedia.org/wiki/File:Profilfoto_FB.jpg, some more examples can be found in https://commons.wikimedia.org/wiki/User:Dispenser/Absurd_overhead).

FYI: I have coded (or I am coding) a bot to automatically detect such files. However, JPGs are weird in that many JPGs files seem to contain useless extra junk, but I have yet to understand what these extra junk actually contains, and whether they are legitimate.

MarkTraceur lowered the priority of this task from High to Medium.Jun 5 2017, 3:10 PM

I think we can move on this by adding a warning for files where 2/3 of the file is metadata, possibly only for files above a certain threshold (500kb or so may do the trick)

Note that "metadata" and "extra junk at the end" are different things. The first is about using fields provided by the file type spec to store arbitrary data; we already have code for detecting most of these (since we want to index metadata and whatnot) so we just need to measure it and decide what's a reasonable size limit. (See also T170251.) The second is about violating the file type spec in ways that are ignored by most tools (e.g. the file is supposed to consist of width then height then width x height bytes of pixel color; adding more bytes at the end will be ignored by the viewer but data put there can be recovered by a custom-made tool, or just splitting the file). Bawolff shared a real example for JPEG in T48921#484617. These are probably going to be harder to detect (but could be easily handled by something like T67383).

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:24 PM
Aklapper removed subscribers: Tbayer, wikibugs-l-list.