Page MenuHomePhabricator

Uploading MS Word files doesn't work ("File extension does not match the detected MIME type of the file")
Closed, DeclinedPublic

Description

Uploading a .doc file to MW 1.19.1 (which, incidentally, isn't listed in the versions box here), I get the following error:

File extension ".doc" does not match the detected MIME type of the file (application/zip).

I tried removing "doc" from application/msword in /includes/mime.types and then also adding it to application/zip (recommended at [1]). The former edit did nothing; the latter resulted in:

The file is a corrupt or otherwise unreadable ZIP file. It cannot be properly checked for security.

I tried keeping the above modifications to /includes/mime.types and adding $wgAllowJavaUploads = true; to LocalSettings.php, as recommended at [2], and uploading of these files works correctly.

However, this seems pretty hacky. Surely allowing .doc files in $wgFileExtensions should be enough, and they shouldn't be treated as zip files or jars or anything?

[1] http://www.mediawiki.org/wiki/Manual_talk:Mime_type_detection#Fix_for_Uploading_MS_Word_2007_.28and_greater.29_Files
[2] http://www.mediawiki.org/wiki/Thread:Talk:MediaWiki_1.18/file_upload_error/reply


Version: 1.19.1
Severity: normal
See Also:

Details

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:45 AM
bzimport set Reference to bz38432.

The problem is that modern .doc files ARE actually zip files. Can you attach a specific file that has this behavior and detail the environment that you are running Mediawiki on (OS, PHP version etc ?)

It's probably one of the external tools that is incorrectly identifying this as a zip file, so you will need to tweak something in the environment if you want Mediawiki to be able to properly identify the file.

(In reply to comment #1)

The problem is that modern .doc files ARE actually zip files. Can you attach a
specific file that has this behavior and detail the environment that you are
running Mediawiki on (OS, PHP version etc ?)

Attached is a file that gives the "The file is a corrupt or otherwise unreadable ZIP file. It cannot be properly checked for security" error under the following configuration: 'doc' is listed as an extension for both application/msword and application/zip and $wgAllowJavaUploads is false (well, not set in LocalSettings.php, to be precise).

Environment is: Windows Server 2008, 64-bit; PHP 5.3.10; MediaWiki 1.19.1.

It's probably one of the external tools that is incorrectly identifying this as
a zip file, so you will need to tweak something in the environment if you want
Mediawiki to be able to properly identify the file.

Let me know if I can provide any more information. And thank you for your help!

A sample MS Word document that is giving the described error on upload.

Attached:

sumanah wrote:

Does this also happen on the currently running version of MW on WMF sites? And does it happen on an install of master?

None of the WMF sites permit the upload of MS Word files (so far as I know).

I'm installing from Git now, to let you know if it works on master.

Okay, so uploading the above file to MW 1.20alpha (7ab935b) on PHP 5.3.8 and Apache still gives:

The file is a corrupt or otherwise unreadable ZIP file. It cannot be
properly checked for security.

$ unzip test.doc

Archive: test.doc
warning [test.doc]: 6034 extra bytes at beginning or within zipfile

(attempting to process anyway)

I'm guessing it's these extra bytes, that are confusing our parser.

The file starts with d0 cf 11 e0 a1 b1 1a e1

Which is the header for the old .doc 2003 filetype. This file is probably saved in compatibility mode as a 2003 .doc file, with an internal 2010 .docx file. Apparently our code finds the .zip header before it finds the .doc header.

For future reference, info with even per version signature regexps of .doc files

http://beta.domd.info/category/mime-types/applicationmsword

Created attachment 10936
Patch to bypass xml document parser

A patch that bypasses the zip format detector in case the file starts with an office Compound Document Format header.

This isn't a working patch yet, because the upload subsequently fails in the JAVA detector, with:
ZipDirectoryReader: Fatal error: trailing bytes after the end of the file comment

I'm not sure if this error is required to be fatal, it will have to be checked with Tim Starling.

Attached:

Thanks for the patch, Derk-Jan.

You can use Developer access

https://www.mediawiki.org/wiki/Developer_access

to submit this as a Git branch directly into Gerrit:

https://www.mediawiki.org/wiki/Git/Tutorial

Putting your branch in Git makes it easier for us to review it quickly.
Thanks again for your contribution!

  • Bug 34797 has been marked as a duplicate of this bug. ***

Has this patch been submitted in gerrit yet? It looks about right, so I don't think there should be a problem getting it merged. Feel free to add me as a reviewer.

I've got a small patch that you can look at: gerrit change If30b53dd

This WFM, but needs work.

Created attachment 12192
An example java applet that would get through with this patch

These tests are important to prevent uploading of java applets. Its fairly easy to make a java applet with the msword header, I've attached a "hello world" example. If you have a jar file handy, just prepend "\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1" to the beginning and run zip -A foo.jar (Using the standard zip utility that is usually on linux computers). If you really wanted you could probably even make a java applet that is a valid ms word doc.


To fix this we would probably need to validate both parts of the file independantly (?). This is very similar to the issue with mixed pdf and odf files.

Attached:

https://gerrit.wikimedia.org/r/44379 (Gerrit Change If30b53dd9c05d92e64b893471b881ee34590ee5d) | change ABANDONED [by TheDJ]

Patch abandoned over a year ago.

fwiw, bug 31930 and bug 54105 suggest there are other false positives in the 'Prevent Java' detection algorithm.

From logs while uploading

MimeMagic::doGuessMimeType: analyzing head and tail of \tmp\php2FEC.tmp for magic numbers.
MimeMagic::doGuessMimeType: ZIP header present in \tmp\php2FEC.tmp
MimeMagic::detectZipType: detected a MS Office document with OPC trailer
MimeMagic::guessMimeType: guessed mime type of \tmp\php2FEC.tmp: application/msword
MimeMagic::improveTypeFromExtension: improved mime type for .doc: application/msword
MediaHandler::getHandler: no handler found for application/msword.
FSFile::getProps: \tmp\php2FEC.tmp loaded, 22016 bytes, application/msword.
mime: <application/msword> extension: <doc>
UploadBase::detectScript: checking for embedded scripts and HTML stuff
UploadBase::detectScript: no scripts found
ZipDirectoryReader: Fatal error: trailing bytes after the end of the file comment

It is possible to open the file with 7z, so there seems not be a problem with the file

Maybe there is a wrong asset in ZipDirectoryReader

oh guys, this bug from 2012 is still sitting in there even in 2019 in the latest production version
what a shame! :(
one really cannot upload xls files to mediawiki? wtf?
what about the java applets? nobody is ever using them and no browser supports them today.

@Pavel.petrovic: Welcome to Wikimedia Phabricator. No need for swearing. :) Different people and different teams may have different priorities given their limited resources. Please feel free to help by testing and improving the proposed patches linked above.

@Aklapper: it's neither helpful nor constructive to call out "wtf" as swearing. You're quite aware of this.

Change 484347 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] ZipDirectoryReader: Do not generate an error if the EOCDR unexpectedly long

https://gerrit.wikimedia.org/r/484347

I can't recall having seen this bug before. It is the simplest thing to fix. The patch is untested but probably works. Sorry @Pavel.petrovic, I don't know why it has taken so long.

Most comments seem to miss the point of ZipDirectoryReader. @TheDJ's patch is insecure, since Java will load classes from a file even if it has an MS Word header. The issue is that it is possible for a file to simultaneously be a valid MS Word file and a valid JAR file. That is to say, a maliciously uploaded JAR file could easily be made to look like an MS Word file. Skipping JAR detection when the file appears to be an MS Word file would defeat the whole purpose of ZipDirectoryReader.

Thank you people for fast reactions. But I am a bit confused by this conversation.

To summarize:

  • TheDJ proposed some patch, but at the same time said "This isn't a working patch yet",
  • despite of that Aklapper proposed to push it
  • MarkAHershberger came up with some other patch which he did not mention what sort of patch it is in this conversation, whether it addresses those problems mentioned or not, but yet, he said that it still needs work
  • other people mentioned various issues
  • yet, Aklapper is proposing me to test those patches without mentioning whether those other issues were addressed or not
  • tstarling says that it is the simplest thing to fix, and the bot reports there is some related patch somewhere... but it is not mentioned whether the issues were addressed and it is safe to apply this patch

I'd like to help, but having read this, I do not feel like testing anything unless there is someone here who is in full control of the situation.

You can apply the patch and test it. ZipDirectoryReader has changed very little since I introduced it in February 2011 to fix T26230. Based on the quoted debug log snippets, the issue is that Microsoft has extended the end of central directory record, adding new fields to the end of it. When I wrote ZipDirectoryReader, I had it flag any deviation from the ZIP specification as an error, since that seemed to be the safest way to flag potentially malicious files. For example, it is possible to deviate from the spec in ways that make files visible to one ZIP library but invisible to another. But extending the EOCDR does not help an attacker in any obvious way, so this particular error is needlessly paranoid. That's why I propose removing it in my patch.

Thanks. I see that the automated testing generated a failure. Is that because the patch modifies a place in the code that the tests used to verify and now those tests fail because they were not modified yet and you want somebody to test the patch first before you go ahead with also modifying the corresponding automated tests?
(I am sorry for a newbie question...)

I must be missing something, but after removing those three lines from the ZipDirectoryReader.php on my private wiki, attempt to upload .xls file
(this one: http://www.nczisk.sk/Documents/nzr/2018/ICD_3_1.xls) results in:

Upload warning
The file is a corrupt or otherwise unreadable ZIP file. It cannot be properly checked for security.

(ticking the checkbox

  • Ignore any warnings

has no outcome on this result).

I have cleared the cache of the browser, getting the same error.

OK, now that I have looked at the test cases, I see that I misdiagnosed the issue. The test cases are not ODF/ZIP files deviating from the spec, they are both OLE compound binary files with embedded ZIP files. The ZIP signature triggers ZIP detection when it is seen less than 64KB from the end. Since the only thing that really identifies a ZIP file is the EOCDR occurring exactly at the end of the file, we could instead treat the "trailing bytes" error as indicating that the file is not a ZIP file.

To make things extra complicated, we do actually have some MS Office file type detection code in MimeAnalyzer, but it is only triggered when there is an EOCDR in the last 64KB of the file. It fails to detect the type of the ICD_3_1.xls file, so that file is detected as application/zip, based on the misplaced EOCDR. It identifies Test.doc correctly, so it gives a type of application/msword for that file. But if we fix the ZIP detection code to only detect actual ZIP files, then it fails for both test cases.

https://www.mediawiki.org/wiki/Manual_talk:MIME_type_detection#Fix_for_MS_Excel_2003_file_saved_from_MS_Office_2007 has a list of stupid hacks to work around this broken file type detection.

A fairly simple approach to detect the type of one of these files would be to read the CLSID on the root directory entry. I have some draft code which does this. Archive Team have listed a few CLSIDs by application. The only problem is that this CLSID is mostly unused and may not be written correctly by non-Microsoft applications.

A more robust approach would be to read the whole CFB directory and to detect presence/absence of identifying storage names, such as "Workbook" and "WordDocument". That would require more code.

I added a CFB directory reader as a new patchset to the existing change. It should hopefully fix the issue, and any other failures to identify old MS Office files.

Change 484347 merged by jenkins-bot:
[mediawiki/core@master] Better detection for old MS Office files

https://gerrit.wikimedia.org/r/484347

For some reason, the attached test docx file (I obtained from the vendor/ruflin/elastica tree in MediaWiki 1.28) causes the error: "File extension ".docx" does not match the detected MIME type of the file (application/zip)." on my REL1_34 MediaWiki

Meanwhile, another testing.docx file (from the Querypath project) is uploaded successfully and identified by MIME type: application/vnd.openxmlformats-officedocument.wordprocessingml.document

I can verify that the code changes are present

[root@wiki public]# grep -r detectMicrosoftBinaryType /opt/htdocs/mediawiki/includes/libs/ 
/opt/htdocs/mediawiki/includes/libs/mime/MimeAnalyzer.php:                      return $this->detectMicrosoftBinaryType( $f );
/opt/htdocs/mediawiki/includes/libs/mime/MimeAnalyzer.php:      function detectMicrosoftBinaryType( $handle ) {

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/484347/ has been merged, hence removing Patch-For-Review.

@MarkAHershberger: This task has been assigned to you a while ago. Is there still more work to do? If yes, do you still plan to work on this task, or do you need any help? If this task is not resolved and only if you do not plan to work on this task: Please consider removing yourself as assignee (via Add Action...Assign / Claim in the dropdown menu). Thanks! :)

Un-assigning myself. @freephile, since you are the most recent to comment an the issue, can you tell us what issues, if any, still remain in REL1_36?

I'll try to reproduce this issue on a current MediaWiki and report back.

I setup a default MediaWiki-Docker environment:

Product	Version
MediaWiki	1.37.0-alpha (d1219fa)
17:56, 2 June 2021
PHP	7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1 (fpm-fcgi)
SQLite	3.16.2
ICU	57.1

And modified the upload settings

$wgEnableUploads = true;

$wgFileExtensions = array_merge( $wgFileExtensions,
    array( 'doc', 'docx', 'xls', 'mpp', 'pdf', 'ppt', 'xlsx', 'jpg', 
        'tiff', 'odt', 'odg', 'ods', 'odp'
    )
);

I still ran into the mime-type error for the 'test.docx' file above. However, after opening the file in Microsoft Word for Mac version 16.49 (2021) and saving it (no edits), the file uploaded into MediaWiki with no errors and the mime-type was correctly identified.

From MediaWiki:

Test.docx ‎(file size: 12 KB, MIME type: application/vnd.openxmlformats-officedocument.wordprocessingml.document)

I conclude that the problem is with the original file itself. The file format for docx is basically a zip file and somehow the container for the 'test.docx' file is funky. Using the command-line file utility, the mime-type is correctly identified both before and after re-saving it in Word. But only after re-saving it, would it upload into MediaWiki.

file --version
file-5.37
magic file from /usr/share/file/magic

file --mime test.docx 
test.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=binary

I would close this issue as "Won't fix" because I believe it has more to do with the old Microsoft Word than mime-detection in MediaWiki.

Thanks for looking into this. I agree that it sounds likely to be related to an ancient Word version.

Luckily, I no longer have any need to upload .doc files to MediaWiki, so don't really mind if it doesn't work. I have a vague memory that I did in the end manage to convince the people using that wiki that it was a good idea to edit wiki pages directly, instead of just attaching text documents to them. :-)

i am facing the same problem. an issue seems to be the order of the files present in the zip archive (see https://stackoverflow.com/questions/7274030/detect-excel-xlsx-file-mimetype-via-php).

with the test.docx above i get the order below. a docx that passes the mime type check with the openXMLRegex in MimeAnalyzer.php has the [Content_Types].xml reference in the first 1024 bytes of the file. the test.docx file does not. its reference is at the end of the file (shortly after 1bc0 hex bytes).

hexdump -C test.docx | grep "Content" 
00001bc0  00 5b 43 6f 6e 74 65 6e  74 5f 54 79 70 65 73 5d  |.[Content_Types]|
unzip -l test.docx
Archive: test.docx
  Length      Date    Time    Name
---------  ---------- -----   ----
      573  07-28-2015 12:01   _rels/.rels
      306  07-28-2015 12:01   word/settings.xml
      663  07-28-2015 12:01   word/_rels/document.xml.rels
      962  07-28-2015 12:01   word/fontTable.xml
     7542  07-28-2015 12:01   word/theme/theme1.xml
    16563  07-28-2015 12:01   word/styles.xml
     1743  07-28-2015 12:01   word/document.xml
      447  07-28-2015 12:01   docProps/app.xml
      588  07-28-2015 12:01   docProps/core.xml
     1233  07-28-2015 12:01   [Content_Types].xml
---------                     -------
    30620                     10 files