Page MenuHomePhabricator

Problem uploading compressed Dia file
Open, LowPublicFeature

Description

Author: filbranden

Description:
Hello,

I've been working with the Dia extension and uploading Dia files. When saving a file on Dia, you have a checkbox that asks you if you want to save a compressed file or not. If you do not save it compressed, it saves a XML file with xmlns:dia="http://www.lysator.liu.se/~alla/dia/". If you save it compressed, it will save the same XML file compressed with gzip. Both files receive the .dia extension, and Dia recognizes if the file is compressed internally.

This is the output of "file" for an uncompressed and a compressed file:

$ file *.dia
test-unc.dia: XML document text
test-cmp.dia: gzip compressed data, from Unix

Recently MediaWiki added support for recognizing Dia files. As I understand, if the file is XML, it will parse the file and look for the namespace(?), and it will recognize it as a Dia file if it finds this URL: http://www.lysator.liu.se/~alla/dia/.

The problem is that it does not recognize compressed Dia files. First, it will (expectedly) assign it the file the MIME type application/x-gzip, and then in "verifyExtension()" it will not match ".dia" to application/x-gzip, therefore stating that the file is corrupted.

I've been thinking about how to solve this problem. One way would be recognizing that the file is gzipped, then trying to look inside and, if the contents look like a XML, then do the logic to try to guess what the type is from the namespace of the XML. However, that seems to be too complex and too much overhead for this task.

I still think that, in that particular case, the extension is the easiest way to reliably recognize a Dia file.

So I was thinking about patching MediaWiki to include a new table (like MM_WELL_KNOWN_MIME_TYPES or MM_WELL_KNOWN_MIME_INFO) with information on how to override a MIME type based on the extension of the file. So, the entry for Dia on this table would be something like (not exact PHP syntax here, I'm not good in PHP):

extension => ".dia",
detected_mime => array('application/x-dia-diagram', 'application/xml', 'application/x-gzip' ),
override_mime => 'application/x-dia-diagram'

What this means is, if on a file upload MediaWiki detects that the extension is ".dia" (or more generally, that the extension is in this table), it will check that the detected MIME type of the contents match one of the items of the array (in the case of Dia, it will be either a XML or a gzip compressed file), and if that is true, it will override the MIME to application/x-dia-diagram.

Now, I know that MediaWiki has tried to move away from detecting the type of the content based on the extension, but I really do not know what to do with Dia files. Of course I blame the problem on the Dia developers, after all they should probably not use a bare gzipped file and use somthing with a specific header instead, but now we already have a legacy of many Dia files and we will have to handle them in one or another way...

So, do you think my idea for a way to solve this is OK? If you do think so, I will work on a patch to do it and submit it to this bug.

Thanks!
Filipe


Version: 1.13.x
Severity: enhancement

Details

Reference
bz15538

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:17 PM
bzimport set Reference to bz15538.
bzimport added a subscriber: Unknown Object (MLST).

Hmm, a good question. Support for gzipped SVG would be useful too, with similar pressures (though there a separate extension, .svgz is used -- see bug 4947).

It probably wouldn't be that hard to detect that the file looks like gzip and dive in for content checks with gzopen() etc. But we may then have to distinguish between xml types we know we can take gzipped and those we don't, so it could complicate things a little.

filbranden wrote:

The problem of .svgz as I see it is that it requires browser and tool support for it to work, and probably needs Apache configuration in order to work properly.

The idea of using gzopen and parsing the XML inside is really good. I will try to come up with some code to do it for the specific case of Dia files.

I will try to do it configurable and extensible, then maybe someone will come up with a way to build support for .svgz from there on.

I should have a patch in some days, I'll post it here for your evaluation.

This bug applies to all of the compressed files, including files ms office.

Change 106657 had a related patch set uploaded by Pastakhov:
fix bug 15538

https://gerrit.wikimedia.org/r/106657

I can not upload files doc and xls from MS Office to my wiki.
I get an error 'The file is a corrupt or otherwise unreadable ZIP file. It cannot be properly checked for security.'
Perhaps it is because of this same error.

Created attachment 14278
I can not upload this file. (for example)

Attached:

Change 106657 abandoned by Pastakhov:
fix bug 15538

Reason:
Excuse me, it is very old patch and it is no suitable here, although the problem remained. I will describe it in bugzilla.
Thanks for links.

https://gerrit.wikimedia.org/r/106657

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:01 AM
Aklapper removed a subscriber: wikibugs-l-list.