Page MenuHomePhabricator

Support proper mime type detection of Office Open XML
Closed, ResolvedPublic

Description

Author: anon.hui

Description:
According to, includes/MimeMagic.php

// Check for ZIP (before getimagesize)
if ( strpos( $tail, "PK\x05\x06" ) !== false ) {
        wfDebug( __METHOD__.": ZIP header present at end of $file\n" );
        return $this->detectZipType( $head );
}

Some xls (ms excel) files contain "PK\x05\x06", so it incorrectly detect as zip file and not pass filetype checking.

Here is the debugging message shown by $wgDebugComments=true,

mime: <application/zip> extension: <xls>

UploadForm::verifyExtension: mime type application/zip mismatches file extension xls, rejecting file

Version: 1.14.x
Severity: normal

Details

Reference
bz23642

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:10 PM
bzimport set Reference to bz23642.
bzimport added a subscriber: Unknown Object (MLST).

anon.hui wrote:

MimeMagic::doGuessMimeType: ZIP header present at end of /tmp/php4korsl
MimeMagic::detectZipType: /^mimetype(application\/vnd\.oasis\.opendocument\.(?:chart|chart-template|formula|formula-template|graphics|graphics-template|image|image-template|presentation|presentation-template|spreadsheet|spreadsheet-template|text|text-template|text-master|text-web))/
MimeMagic::detectZipType: unable to identify type of ZIP archive
MimeMagic::guessMimeType: final mime type of /tmp/php4korsl: application/zip
MediaHandler::getHandler: no handler found for application/zip.
File::getPropsFromPath: /tmp/php4korsl loaded, 43008 bytes, application/zip.
MacBinary::loadHeader: header bytes 0 and 74 not null
MimeMagic::doGuessMimeType: ZIP header present at end of /tmp/php4korsl
MimeMagic::detectZipType: /^mimetype(application\/vnd\.oasis\.opendocument\.(?:chart|chart-template|formula|formula-template|graphics|graphics-template|image|image-template|presentation|presentation-template|spreadsheet|spreadsheet-template|text|text-template|text-master|text-web))/
MimeMagic::detectZipType: unable to identify type of ZIP archive
MimeMagic::guessMimeType: final mime type of /tmp/php4korsl: application/zip

mime: <application/zip> extension: <xls>

UploadForm::verifyExtension: mime type application/zip mismatches file extension xls, rejecting file

Are you sure it isn't zip? From what I understand, excel 2007's default format (.xlsx, but I don't think excel cares what the extension is) is actually a zip archive.

Indeed, they are

And Version 1.14?

anon.hui wrote:

I'm not sure what format, but I can unzip it.

  1. The xml file inside the zip contains xmlns="http://schemas.openxmlformats.org/package/2006/content-types"
  2. The file can be opened by openoffice.org 2.4

anon.hui wrote:

  1. When unzip (in gnu/linux), it says,

    warning [filename.xls]: 22287 extra bytes at beginning or within zipfile

anon.hui wrote:

  1. The file can be opened by openoffice.org 2.0 (ubuntu 6.06)

Can this be duplicated on 1.15/1.16?

If not, it can be closed

anon.hui wrote:

It think it can be duplicated on the latest version, since the logic in MimeMagic::doGuessMimeType() look the same (when comparing 1.14 to svn).

Acording to comments in the code, MimeMagic::detectZipType only supports OpenDocument files, so its not surprising that Office Open XML can't be detected. (as a sidenote: why must all these formats be named so similarly?).

This is a difficult one to work around. I guess we could scan for [Content_Types].xml in the file, which would identify it as an Open Packaging Conventions ZIP file. The only way to then set the mime type correctly is by relying on the filename extension as far as I can see, because it is just a zip archive and can contain anything basically.

I'm not sure if [Content_Types].xml is always file1 in the zip however, so we may have to read the entire directory listing of the zip archive....

christian wrote:

Added myself to CC list.

Bryan.TongMinh wrote:

Tweaking summary accordingly

Created attachment 7484
Detect mime types for openxml files

This is my idea for fixing this problem.

I introduce a new mime type. This is application/x-opc+zip
This mimetype basically means "Open Packaging Conventions" archive and is a private mimetype that I came up with.

When initially checking for mimetype, we detect that this is an OPC file, and we use the extension to guess what type of OPC file. Then on the verify pass (where guessing based on extension is not allowed), we detect that the file is an OPC archive. We then check if the file extension (docx for instance) is an allowed file extension for this filetype, and we check if opc files are on the mime blacklist.

File entries are stored into the database with their 'proper' MS Office mimetype. Normally the OPC filetype should not ever be served, unless people disable mimeverification.

Attached:

Done in r68279

Note that as with any zip files, if you allow these files on your server, you potentially allow GIFAR like attacks on clients who do not have up to date JVMs.

We faced the same issue (T213841) with CBZ (Comic Book Archive files) and ePub.

So a solution that allows additional types to be detected is favourable.

We faced the same issue (T213841) with CBZ (Comic Book Archive files) and ePub.

So a solution that allows additional types to be detected is favourable.

You've left a comment on a ticket that has been closed for 10 years. I would highly suggest creating a new ticket