Page MenuHomePhabricator

Detect RAR concatenation in jpeg images
Closed, ResolvedPublic

Description

Author: lilewyn

Description:
HOW TO: Download the linked file (req. admin access on enwiki), rename to .rar, extract.
PROBLEM: Users using Wikipedia as RapidShare replacement by appending compressed files to legitimate graphics uploaded to our servers.
POSSIBLE SOLUTION: Add code to detect RAR compression appended to valid graphics files and fail the upload.


Version: unspecified
Severity: enhancement
URL: http://en.wikipedia.org/w/index.php?title=Special:Undelete&target=Image%3AStar_Wars_Republic_Commando_Triple_Zero.jpg&file=0a52grdaxrtrnm5wuk90bemab69hsnrs.jpg

Details

Reference
bz10847

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:48 PM
bzimport set Reference to bz10847.
bzimport added a subscriber: Unknown Object (MLST).

Why look for RAR and not five million other archive formats? What about trivially obfuscated files? Encrypted files? etc.

Alkivar wrote:

(In reply to comment #1)

Why look for RAR and not five million other archive formats? What about
trivially obfuscated files? Encrypted files? etc.

its simple really... your average jpg viewer stops reading the file after the end tag. rar ignores anything prior to the rar header. so you've got the perfect combination with jpg and rar. But a few other archive formats/image formats could potentially work. There are tutorials all over the internet including the EN WP article on RAR showing how to do the jpg/rar combination though.

Convenient. :)

Greg's putting together a list of files with known issues, we'll have a good test set of this and other formats.

Note that commons uploads are being checked (third-party) for embedded rars.

We could just search the files for the string "Rar!" (file header for RAR archives). But I'm not sure how often this could just randomly occur in the image data.

(Well, assuming random data and 4 MB photos, it's about 1 in 1000 files, which is unacceptably high. But perhaps JPEG data is not so randomly distributed and the chance is much smaller. With T151821, we could see how often it occurs in our existing files.)

Analyzing a large JPEG I uploaded a while ago, with strings + grep, I got a few close calls to Rar! signature:

$ strings in.jpg | grep 'Rar'
MRar
sYRar5
RarU
Rar	
Rare
Rar6Y
+%hRar
Rar1z
$ strings in.jpg | grep 'ar!'
ar!	JQ
{ar!	
ar!K]&F@
Iar!
ar!W
ar!*L
%ar!

Also, playing with a visual representation of the first few megabytes of the file's binary contents, using $ < in.jpg rawtoppm -rgb 1024 1024 | pnmtopng > out.png, with the JPEG I got:

Elisa_Bonaparte_with_her_daughter_Napoleona_Baciocchi_-_François_Gérard_-_Google_Cultural_Institute.jpg.vis.png (1×1 px, 3 MB)

And with /dev/urandom:
urandom.png (1×1 px, 3 MB)

I'd say the JPEG data is quite random.

matmarex's method, with a bit more work parsing the image, would work. See for example: http://stackoverflow.com/a/4614629/342196 Rather than detecting a specific file format, once you reach to the first FFD9 (the real jpeg EOF) , if we are not at the end of the file, then you have detected a problem.

Look, FFD9 is not a mandatory marker. See comment: The end-of-file marker in JPEG files is optional, so this doesn't really help. Matma Rex (talk) 19:49, 28 November 2016 (UTC)

User:Embedded Data Bot currently parses the JPEG with Pillow to find the EOF, but there can be false positives / negatives sometimes.

Then FFDA from parsing + offset + [optionally] FFD9 and in addition, some heuristics on file size would be more reliable.

The only other 100% secure option would be to losslessly convert the files before making them public to remove unknown blobs (or detect the size).

BTW, the RAR signature are 8 bytes: RAR 5.0 signature consists of 8 bytes: 0x52 0x61 0x72 0x21 0x1A 0x07 0x01 0x00. You need to search for this signature in supposed archive from beginning and up to maximum SFX module size. Very, very unlikely to happen by accident in the first megabyte (SFX zone), -no need for a full rar parser to detect that. http://www.rarlab.com/technote.htm#arcblocks

RAR 5.0 is apparently a completely different format from the previous versions. I think I saw it once, among a dozen or so funky files I examined some time ago.

From the same page: RAR 4.x 7 byte length signature: 0x52 0x61 0x72 0x21 0x1A 0x07 0x00

Yeah, @valhallasw also found some docs on the structure of the 4.0 format. Thanks!

It is not my intention to tell you how to do this, I was just trying to help doing it without having to install proprietary software on wikimedia servers due to T151794 rejection.

Closing this as resolved as this has been done with a bot, which is currently approved and active on Commons only. If anyone want to implement it to MediaWiki core so that all MediaWiki installs could have automated detection, feel free to reopen.