Page MenuHomePhabricator

cannot upload ms word 2007 files
Closed, ResolvedPublic

Description

Author: gsa

Description:
Trying to upload a .doc file generated with Microsoft Word 2007 results in:
"The file is corrupt or has an incorrect extension"

The Logfile:
MimeMagic::doGuessMimeType: ZIP header present at end of /tmp/phpq31oe6
MimeMagic::detectZipType: /^mimetype(application\/vnd\.oasis\.opendocument\.(?:chart-template|chart|formula-template|formu
la|graphics-template|graphics|image-template|image|presentation-template|presentation|spreadsheet-template|spreadsheet|tex
t-template|text-master|text-web|text))/
MimeMagic::detectZipType: unable to identify type of ZIP archive
MimeMagic::guessMimeType: final mime type of /tmp/phpq31oe6: application/zip

mime: <application/zip> extension: <doc>

UploadForm::verifyExtension: mime type application/zip mismatches file extension doc, rejecting file

This seems to be known, as http://www.mediawiki.org/wiki/Manual:$wgMimeDetectorCommand states "For example, 1.15.3 may misdetect .doc-files from MS Word 2007 as ZIP files", but I cannot find a corresponding bug.

23688, 23642, 18684 do not solve the problem.


Version: 1.15.x
Severity: minor

Details

Reference
bz24073

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:00 PM
bzimport set Reference to bz24073.
bzimport added a subscriber: Unknown Object (MLST).

Bryan.TongMinh wrote:

Can you try the current 1.17alpha SVN version?

cc. TheDJ

gsa wrote:

Behaviour does not change with 1.17alpha:

FileCache negative MISS for Testbericht_V02.doc
File::getPropsFromPath: Getting file info for /tmp/phpW9YCXV
MimeMagic::construct: loading mime types from /magwien/var/gondor-phpserver/html/wiki-ma48/includes/mime.types
MimeMagic::
construct: loading mime info from /magwien/var/gondor-phpserver/html/wiki-ma48/includes/mime.info
MimeMagic::doGuessMimeType: ZIP header present at end of /tmp/phpW9YCXV
MimeMagic::detectZipType: /^mimetype(application\/vnd\.oasis\.opendocument\.(?:chart-template|chart|formula-template|formula|graphics-template|graphics|image-template|image|presentation-template|presentation|spreadsheet-template|spreadsheet|text-template|text-master|text-web|text))/
MimeMagic::detectZipType: unable to identify type of ZIP archive
MimeMagic::guessMimeType: final mime type of /tmp/phpW9YCXV: application/zip
MediaHandler::getHandler: no handler found for application/zip.
File::getPropsFromPath: /tmp/phpW9YCXV loaded, 453632 bytes, application/zip.
MacBinary::loadHeader: header bytes 0 and 74 not null
MimeMagic::doGuessMimeType: ZIP header present at end of /tmp/phpW9YCXV
MimeMagic::detectZipType: /^mimetype(application\/vnd\.oasis\.opendocument\.(?:chart-template|chart|formula-template|formula|graphics-template|graphics|image-template|image|presentation-template|presentation|spreadsheet-template|spreadsheet|text-template|text-master|text-web|text))/
MimeMagic::detectZipType: unable to identify type of ZIP archive
MimeMagic::guessMimeType: final mime type of /tmp/phpW9YCXV: application/zip

mime: <application/zip> extension: <doc>

UploadForm::verifyExtension: mime type application/zip mismatches file extension doc, rejecting file

The extension for MS Office 2007 OpenXML documents is .docx not .doc

For this to work:

  • rename the file to it's proper file extension
  • you have to have a 1.17 checkout
  • overwrite $wgMimeTypeBlacklist, so that application/x-opc+zip is not in the list
  • Add .docx to the list of allowed filetype extensions. $wgFileExtensions

Although I have to say, that i'm expecting to see "detected an Open Packaging Conventions archive:" for these types of files in debug.

gsa wrote:

word 2007 testdocument

attachment Testbericht_V02.docx ignored as obsolete

gsa wrote:

I did it exactly as you described, here the debug:

File::getPropsFromPath: Getting file info for /tmp/phplRPqef
MimeMagic::construct: loading mime types from /magwien/var/gondor-phpserver/html/mwiki/includes/mime.types
MimeMagic::
construct: loading mime info from /magwien/var/gondor-phpserver/html/mwiki/includes/mime.info
MimeMagic::doGuessMimeType: ZIP header present at end of /tmp/phplRPqef
MimeMagic::detectZipType: /^mimetype(application\/vnd\.oasis\.opendocument\.(?:chart-template|chart|formula-template|formula|graphics-template|graphics|image-template|image|presentation-template|presentation|spreadsheet-template|spreadsheet|text-template|text-master|text-web|text))/
MimeMagic::detectZipType: unable to identify type of ZIP archive
MimeMagic::guessMimeType: final mime type of /tmp/phplRPqef: application/zip
MediaHandler::getHandler: no handler found for application/zip.
File::getPropsFromPath: /tmp/phplRPqef loaded, 453632 bytes, application/zip.
MacBinary::loadHeader: header bytes 0 and 74 not null
MimeMagic::doGuessMimeType: ZIP header present at end of /tmp/phplRPqef
MimeMagic::detectZipType: /^mimetype(application\/vnd\.oasis\.opendocument\.(?:chart-template|chart|formula-template|formula|graphics-template|graphics|image-template|image|presentation-template|presentation|spreadsheet-template|spreadsheet|text-template|text-master|text-web|text))/
MimeMagic::detectZipType: unable to identify type of ZIP archive
MimeMagic::guessMimeType: final mime type of /tmp/phplRPqef: application/zip

mime: <application/zip> extension: <docx>

UploadBase::verifyExtension: mime type application/zip mismatches file extension docx, rejecting file

Perhaps you can take a look at the attached Testdocument, may be it is not in the format you expect.

overlordq wrote:

(In reply to comment #3)

Although I have to say, that i'm expecting to see "detected an Open Packaging
Conventions archive:" for these types of files in debug.

That'd be kinda hard to do since it's just a zip file, it'd have to look inside the file to determine if it's just a zip or if it's a 'special' zip. That just opens a whole 'nother can of worms.

Looking at this file, but it doesn't seem to be an openXML file to me. Will take some time to figure out what is going on. (zipped .doc perhaps ?)

overlordq wrote:

Actual docx file

Testbericht_V02.docx: Microsoft Office Document

If you rename it to .doc it opens fine in word so I'm thinking it's a normal Word Document, resaved as Word Document in Word 2007 and now it identifies as

Testbericht_V02.docx: Zip archive data, at least v2.0 to extract

Attached:

I think that when saving in the old format, Word 2007 creates a kind of mixed format, by appending a zip structure to the .doc format.
warning [Testbericht_V02.docx]: 430308 extra bytes at beginning or within zipfile

Also see bug 23642 comment 5.

Platonides is right. Basically, 2007 saves a .doc file, but appends a .zip with OPC index to it.

I'll add a check for this, by scanning for the magic bytes of older MS Office documents in some way.

http://www.garykessler.net/library/file_sigs.html
MSOffice header: D0 CF 11 E0 A1 B1 1A E1

Office subheaders at bytepos 512

EC A5 C1 00 [512 byte offset]
DOC Word document subheader (MS Office)

FD FF FF FF nn 00 00 00 [512 byte offset]
PPT PowerPoint presentation subheader (MS Office)
(where nn has been seen with values 0x0E, 0x1C, and 0x43)

FD FF FF FF nn 00 [512 byte offset] or
FD FF FF FF nn 02 [512 byte offset]
XLS Excel spreadsheet subheader (MS Office)
(where nn = 0x10, 0x1F, 0x22, 0x23, 0x28, or 0x29)

Should we really be doing this? we don't allow openoffice files which are also zips because of security vulnerabilities which would be a bit weird if we preferred Word over OO.

Bryan.TongMinh wrote:

(In reply to comment #11)

Should we really be doing this? we don't allow openoffice files which are also
zips because of security vulnerabilities which would be a bit weird if we
preferred Word over OO.

Users who wish to enable OpenXML files, should be able to do so, just like with OpenOffice now.

I got this working, but it is starting to become a bit of a mess. I'm considering introducing a new configuration variable to allow/disallow all zip types, because i already have:

ODF, OpenXML, MS Office+OPC zip trailer and setting all that up will start to become more difficult for each and every zip type. With a seperate option, we could just remove the zip and the fake opc mime from the mimeblacklist and adding a seperate config option will make documenting and explaining the risks of zip based fileformats on open websites a lot easier I think.

$wgAllowZipFilesWhichCouldCompromiseMyUsers ?

I'd like to have Special:Upload ask to remove the (apparently useless) zip trailer.

(In reply to comment #14)

$wgAllowZipFilesWhichCouldCompromiseMyUsers ?

I'd like to have Special:Upload ask to remove the (apparently useless) zip
trailer.

Would that not damage the files if people wanted to download and reopen them, some systems are very pedantic about the formatting of their files?

gsa wrote:

Microsoft seems to create different .doc formats (2003, 2003 from 2007). Should not simply be seen this as a Microsoft bug, and longer be a mediawiki issue ?

(In reply to comment #15)

(In reply to comment #14)

$wgAllowZipFilesWhichCouldCompromiseMyUsers ?

I'd like to have Special:Upload ask to remove the (apparently useless) zip
trailer.

Would that not damage the files if people wanted to download and reopen them,
some systems are very pedantic about the formatting of their files?

If I understand correctly, the OPC trailer stores information that can not be saved in the 2003 format. So it is a method of creating a 2003 compatible file that still has all the 2007 and later features of the original file when opened in 2007 or later. Actually kinda handy I have to say.

but yes, the idea would be $wgAllowUploadsOfZipFilesBecauseItrustMyUploaders or something.

(In reply to comment #14)
Would that not damage the files if people wanted to download and reopen them,
some systems are very pedantic about the formatting of their files?

The newer Word still need to open pre-2007 files which don't have the trailer so no backwards compatibility issues there.
The provided trialer contains a "font Theme". That won't be a fundamental feature in most cases but some users might need it.

Note that while I support file stripping in certain cases, it should always happen with the uploader consent.

Bryan.TongMinh wrote:

(In reply to comment #18)

Note that while I support file stripping in certain cases, it should always
happen with the uploader consent.

And the user should have the possibility to upload the unstripped file (if allowed by the site administrator).

A generic upload post processing API would be nice; other things like image rotation from EXIF info falls in that category as well.

Created attachment 7534
gifar cleanup

A patch of what I am proposing:

1: Move zip and virus checks before mime checks
2: ZIP gifar check is now separate from mime checks
3: Added $wgAllowGIFARVulnerableFiles global variable
4: Add zip mime detection support for openxml trailers on 2003 Office files.

This will allow people to either choose to basically allow zip files uploads when they want. They would still need to whitelist filetypes, and in the case of actual zip files, they have to change the mime blacklist. But when setting $wgAllowGIFARVulnerableFiles=true and adding .doc .docx .odt to their whitelist, they will be able to upload such files none the less (and actual GIFAR files).

We could consider expanding on this to add a "best-effort" mode to detectGIFAR(), where it will only allow opendocument/openxml files, and disallow the rest, though that is somewhat of a fake security model in my opinion.

Attached:

Went with the original solution after all.

Fixed in r68873

Can people there have a look at Bug 34797 - Cannot upload Office 97-2003 DOC and XLS files

Seems a related issue :-) Thanks!

Gilles raised the priority of this task from Medium to Unbreak Now!.Dec 4 2014, 10:27 AM
Gilles added a project: Multimedia.
Gilles moved this task from Untriaged to Done on the Multimedia board.
Gilles lowered the priority of this task from Unbreak Now! to Medium.Dec 4 2014, 11:21 AM