Page MenuHomePhabricator

MIME type detection of "application/x-php" gives false positives on any file with "<?" in it
Closed, ResolvedPublic

Description

Apparently, any file uploaded to Commons that has the character pair "<?" anywhere within the first 1024 bytes gets detected as having the MIME type "application/x-php". This seems a bit excessive. Some examples include:

http://commons.wikimedia.org/wiki/Image:Bundesarchiv_Bild_137-002552,_Türkei,_Anatolien,_Taurus,_Feldbahn.jpg
http://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2008Nov#Uploading-Error

Note that I don't believe there should be any security issues with serving such files to users: I'm not aware of any user agent that would execute downloaded PHP code, and certainly not one that would use such a hair-trigger check for detecting it.

(Ps. This might be a Wikimedia configuration issue: I haven't yet looked at the MIME type detection code closely enough to tell. Filing this for now as a MediaWiki bug, feel free to reclassify.)


Version: unspecified
Severity: normal
URL: http://commons.wikimedia.org/wiki/Image:Bundesarchiv_Bild_137-002552,_Türkei,_Anatolien,_Taurus,_Feldbahn.jpg

Details

Reference
bz16583

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:23 PM
bzimport set Reference to bz16583.
bzimport added a subscriber: Unknown Object (MLST).

On second thought, there might be an issue for wikis running on poorly configured webservers that might try to execute PHP scripts from within the "images" directory. I wouldn't expect that to happen unless the file had the extension ".php" (or possibly ".phpN", where N is a number), though.

ayg wrote:

Or .phtml or a couple of others. In that case, though, you'd have to check the whole file for "<?", not just the first 1024 bytes.

The overzealous check appears to be in MimeMagic::doGuessMimeType(). It appears that the check, as currently coded, will have a false positive rate of slightly over 1 in 4096 files, assuming a random distribution of octets (which should be reasonable for compressed file formats like JPEG, PNG or Ogg).

Probably needs some adjustment, yes.

What I'm wondering is whether the check really serves any purpose at all. Certainly it can't provide more than "security by obscurity", since any attacker who knows about it can circumvent it trivially.

If we simply want to filter out "normal" PHP files, I suppose we could do an octet-frequency check to filter out files that appear to be binary data. Even just requiring that the "<?" be followed by a couple of bytes of printable ASCII ought to reduce the false positive rate noticeably.

Anyway, anybody want to grep the logs for "doGuessMimeType: recognized .* as application/x-php" and see how often it's actually triggered (legitimately or by accident)? If my math is right, about one in every four thousand uploads we get should be failing with a mysterious error saying "Files of the MIME type "application/x-php" are not allowed to be uploaded."

ayg wrote:

I vote to just drop it. It seems to serve no useful purpose.

(In reply to comment #6)

I vote to just drop it. It seems to serve no useful purpose.

It is a real bug, triggered in real use cases. It should be fixed.

(In reply to comment #7)

It is a real bug, triggered in real use cases. It should be fixed.

I am an idiot.

+1 vote to fix MimeMagic::doGuessMimeType().

matthew.britton wrote:

(In reply to comment #1)

On second thought, there might be an issue for wikis running on poorly
configured webservers that might try to execute PHP scripts from within the
"images" directory. I wouldn't expect that to happen unless the file had the
extension ".php" (or possibly ".phpN", where N is a number), though.

In the default configuration, and on all Wikimedia configurations as far as I am aware, .php and the others cannot be uploaded anyway. How about removing this check and adding a note in the docs to the effect that anyone who enables upload of such files should be careful to configure their webserver not to execute them?

nephele wrote:

patch for MimeMagic.php, r55559: check whether file is binary

The attached patch fixes this issue for several known image files that were falsely identified, but still successfully detects typical php files being uploaded with an (incorrect) image extension.

The patch adds a check to see whether the file header contains three null characters in a row. It's a string that should be present in nearly all binary files, but shouldn't normally be found in text files. It's imperfect and kludge-like -- but so is checking for php files based on the presence of '<?'. And there's no real difference security-wise -- if someone wants to intentionally create a php file that is not recognized by doGuessMimeType, that's already easily possible.

This at least resolves the bug until someone wants to do a more thorough re-write of the code.

Attached:

azliq7 wrote:

(In reply to comment #3)

The overzealous check appears to be in MimeMagic::doGuessMimeType(). It
appears that the check, as currently coded, will have a false positive rate of
slightly over 1 in 4096 files, assuming a random distribution of octets (which
should be reasonable for compressed file formats like JPEG, PNG or Ogg).

I encountered this bug twice when I was uploading about 3,145 PNG files.
I think this is a quite a serious bug that needs to be looked into.

ayg wrote:

Problem mitigated in r58682: I just removed the check for three-byte strings ('<? ', "<?\n", "<?\t", '<?='). The shortest string it's checking for is now five bytes, so random false positives should be significantly less common -- I estimate on the order of 1/100,000,000. Is there any reason to keep the check at all, though?

Probably not a huge benefit from the check since it's only a portion of the file anyway. :)

wolfram.schmied wrote:

(In reply to comment #3)

It appears that the check, as currently coded, will have
a false positive rate of slightly over 1 in 4096 files,
assuming a random distribution of octets

16 bits, not 12, and you have to multiply by 1024, which gives us a false positive rate for random files on the order of 2^-6 ~= 1.7 %. That was corroborated by a little experiment I did. I measured 1.8 % for a set of metadata-free PNGs created by resizing 1498 JPEGs from my photo collection to 160x120 and converting to PNG.

The very low rate seen by azliq7@yahoo.com (comment #11) was apparently the result of large metadata headers. The original JPEG photos I used in my sample all have decidely non-random headers of ~8k.

(In reply to comment #14)

(In reply to comment #3)

It appears that the check, as currently coded, will have
a false positive rate of slightly over 1 in 4096 files,
assuming a random distribution of octets

16 bits, not 12, and you have to multiply by 1024, which gives us a false
positive rate for random files on the order of 2^-6 ~= 1.7 %.

The check which Simetrical removed in r58682 matched if the first 1024 bytes of the file contained "<?" followed by one of four possible bytes (' ', '\n', '\t' or '='). Thus, the probability of three random bytes matching this check is 4/(2^8)^3 = 1/2^22, and the probability of 1024 random bytes matching it is approximately 1024/2^22 = 1/2^12 = 1/4096.

(Taking into account the possibility of multiple matches and the fact that the last 2 out of 1024 positions can't match makes the probability about 1/4104.5. Most of the difference is due to the latter, since multiple matches are very unlikely events, occurring only for about one in every 2^24 files.)

Anyway, marking the bug as fixed: r58682 should reduce the false positive rate enough that what's left (like removing the check entirely?) is mainly just code cleanup.

wolfram.schmied wrote:

(In reply to comment #15)

The check which Simetrical removed in r58682 matched if the first 1024 bytes of
the file contained "<?" followed by one of four possible bytes (' ', '\n', '\t'
or '=').

Thanks for the clarification, that hadn't been clear from the discussion so far. Considering that I managed to catch that one with just a few dozen uploads, I better stay clear from games of chance. ;)

wolfram.schmied wrote:

It seems that [http://commons.wikimedia.org/w/index.php?title=Commons:Village_pump&oldid=32560832#Odd_bug ZIP detection has a similar problem]. Would it be possible to amend error messages relating to heuristic file detection with a pointer to a manual page with information about the possibility of false positives, and how to work around them?

(In reply to comment #17)

It seems that
[http://commons.wikimedia.org/w/index.php?title=Commons:Village_pump&oldid=32560832#Odd_bug
ZIP detection has a similar problem].

Yes, it seems that will happen if the last 65558 bytes of the file contain the 4-byte string "PK\x05\x06", which should happen with probability ~1/65514 assuming random data. Unfortunately, making this check significantly more specific (or moving it until later in the identification process) risks allowing malicious hybrid files, such as the well known "GIFAR" exploit, to pass it. It _might_ be possible to tighten it a bit somehow, but doing so safely would require knowledge of not only the ZIP file format but also of the ways in which various common ZIP implementations parse it. (In other words, we want our check to be broad enough to catch anything that some other program might mistake for a ZIP file, even if it doesn't exactly conform to the ZIP spec.)

Would it be possible to amend error
messages relating to heuristic file detection with a pointer to a manual page
with information about the possibility of false positives, and how to work
around them?

That would certainly be possible, either globally through translatewiki or locally by wiki sysops. The relevant system message seems to be [[MediaWiki:filetype-badmime]].

azliq7 wrote:

(In reply to comment #15)

(In reply to comment #14)

(In reply to comment #3)

It appears that the check, as currently coded, will have
a false positive rate of slightly over 1 in 4096 files,
assuming a random distribution of octets

16 bits, not 12, and you have to multiply by 1024, which gives us a false
positive rate for random files on the order of 2^-6 ~= 1.7 %.

The check which Simetrical removed in r58682 matched if the first 1024 bytes of
the file contained "<?" followed by one of four possible bytes (' ', '\n', '\t'
or '='). Thus, the probability of three random bytes matching this check is
4/(2^8)^3 = 1/2^22, and the probability of 1024 random bytes matching it is
approximately 1024/2^22 = 1/2^12 = 1/4096.

(Taking into account the possibility of multiple matches and the fact that the
last 2 out of 1024 positions can't match makes the probability about 1/4104.5.
Most of the difference is due to the latter, since multiple matches are very
unlikely events, occurring only for about one in every 2^24 files.)

Anyway, marking the bug as fixed: r58682 should reduce the false positive rate
enough that what's left (like removing the check entirely?) is mainly just code
cleanup.

I've reopened the bug since the previous bug fix did not solve the problem completely.
I encountered this bug again when I was uploading 1,000 PNG files.

Perhaps we should consider removing the check?

(In reply to comment #19)

I've reopened the bug since the previous bug fix did not solve the problem
completely.
I encountered this bug again when I was uploading 1,000 PNG files.

That's probably because the bugfix isn't live on Wikipedia yet. Re-closing.

azliq7 wrote:

Well, obviously. I used my own wiki to test the patch.

ayg wrote:

What string did the file have in it that triggered the false positive?

Ps. Discussion about ZIP file detection should properly go under bug 20924.