Page MenuHomePhabricator

File extensions for the same file type should not allow variations of a file name (File:X.jpg, File:X.jpeg, File:X.JPG should all refer to the same file)
Open, MediumPublic

Description

Author: rd232

Description:
Please see http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)/Archive_74#Several_changes_to_file_naming - a proposal to fix certain consistency issues with file naming. The main points which cause unnecessary confusion are filenames which vary solely in the file extension, but are of the same file type: File:X.jpg, File:X.jpeg, File:X.JPG should all refer to the same file, but currently do not.

a. Multiple filetype extensions for the same filetype: As it stands, two
separate users could upload two separate images of two separate subjects as
File:TestImage.jpg and File:TestImage.jpeg. There is no reason for this.

b. Case sensitivity in filetype extensions: As it stands, and as does happen, two separate images can be uploaded as File:TestImage.jpg and File:TestImage.JPG. This has the potential to cause even more problems that the above situations. There is no reason why filetype extensions should be case sensitive.

Note: this is split off from Comment 71 of task T6421.

See Also:
T6421: Image file extension should not be part of the name
T42479: File extensions should be automatically decided by MIME type at upload
T213484: Normalize file extensions (capital vs small letters; jpg vs jpeg) for new uploads on Commons
T34660: File extensions for the same file type should not allow variations of a file name (File:X.jpg, File:X.jpeg, File:X.JPG should all refer to the same file)
T31284: Upload form should change file extensions to the canonical form automatically (lowercase, jpeg→jpg etc.)
T144593: File extension changes automatically while moving ogg audio file on Commons, caused by a gadget

Details

Reference
bz32660
TitleReferenceAuthorSource BranchDest Branch
d/changelog: bump to 0.103.3repos/cloud/toolforge/tools-webservice!27dcarobump_to_0.103.3main
cli: Warn when --mount is not set on buildservice toolsrepos/cloud/toolforge/tools-webservice!19taavitaavi/warn-buildservice-mountsmain
volume-admission: bump versionrepos/cloud/toolforge/toolforge-deploy!94taavitaavi/volume-admissionmain
Add label to specify whether to mount volumes or notrepos/cloud/toolforge/volume-admission!6taavitaavi/nfs-mounts-9c03main
Add an option to disable NFS accessrepos/cloud/toolforge/tools-webservice!2raymond-ndibetavi_allow_disabling_nfs_accessmain
Customize query in GitLab

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:01 AM
bzimport set Reference to bz32660.
bzimport added a subscriber: Unknown Object (MLST).

rd232 wrote:

Just found Bug 12992 - Add new setting option switch whether case-sensitive or case-insensitive upload file extension name checking. That has a patch from 2008, but no recent activity.

svenmanguard wrote:

Thanks. A few technical issues to resolve:

  1. If this is implemented, will it be implemented for all projects or just for English Wikipedia at first?
  2. Will there be a grandfather clause built into the change, so that only files uploaded after the change is made will be effected, or will there be no grandfather clause?

My preference would be to start with only English Wikipedia, because I have no way of getting the message to the other projects, and to grandfather old files in, because while they'll get taken care of anyways (see below), I don't want to risk things breaking.

I'll note that a big push happened a few months back to try to knock out, on the English Wikipedia at least, all instances of 'largely duplicitive file names', which is what this issue has become known by. It's been moderately successful, but a few hundred remain, in part because there was no need to do it, and there are so very many other things that need doing in the namespace. If there's any chance that this is going to get worked on soon, please post it here, and I'll spread the word to the people that worked on the first effort. We can get the number of instances down to zero in a few days or weeks. (Irregardless of if its needed or not, we'll still do it once this thread starts seeing action from devs, because it's the excuse we need to fix the damned problem.)

(In reply to comment #2)

Thanks. A few technical issues to resolve:

  1. If this is implemented, will it be implemented for all projects or just for

English Wikipedia at first?

Depends on the implantation, but I would imagine this would apply to all projects. This doesn't sound like something that a config option would be appropriate for.

  1. Will there be a grandfather clause built into the change, so that only files

uploaded after the change is made will be effected, or will there be no
grandfather clause?

It's unlikely we would fix this is a non-backwards compatible way. MediaWiki is not just used by Wikimedia folks. We also have to make sure this doesn't break files on non-Wikimedia wikis where they wouldn't be aware of the change.


My idea on how to implement this would be (This is off the top of my head, I haven't really looked into the issues involved)

*When uploading new files. The file extension is normalized (If you try to upload Foo.JPeG in gets saved as Foo.jpg). [This part should be easy]
*When linking/looking at/whatever for a file named Bar.JpEG, MediaWiki first checks to see if Bar.JpEg exists, and if not assumes you meant Bar.jpg and uses that instead. [This part might be a little more tricky. Have to check to make sure that code doesn't assume that if a file is returned it matches the requested name]

Bryan.TongMinh wrote:

(In reply to comment #3)

*When linking/looking at/whatever for a file named Bar.JpEG, MediaWiki first
checks to see if Bar.JpEg exists, and if not assumes you meant Bar.jpg and uses
that instead. [This part might be a little more tricky. Have to check to make
sure that code doesn't assume that if a file is returned it matches the
requested name]

That would normally work; the same happens with redirects.

svenmanguard wrote:

(In reply to comment #3)

(In reply to comment #2)

Thanks. A few technical issues to resolve:

  1. If this is implemented, will it be implemented for all projects or just for

English Wikipedia at first?

Depends on the implantation, but I would imagine this would apply to all
projects. This doesn't sound like something that a config option would be
appropriate for.

  1. Will there be a grandfather clause built into the change, so that only files

uploaded after the change is made will be effected, or will there be no
grandfather clause?

It's unlikely we would fix this is a non-backwards compatible way. MediaWiki is
not just used by Wikimedia folks. We also have to make sure this doesn't break
files on non-Wikimedia wikis where they wouldn't be aware of the change.

In all honestly, as long as the answer to 2 was "yes there will be a grandfather clause", the answer to 1 could be anything at all and it would be fine. (The vice versa is true as well; as long as it only applies to English Wikipedia, we don't need to grandfather things in, because I can get the old instances fixed, and other projects can do the same when they fix their issues.)

I'm going to start working on the English Wikipedia LDFNs again. If you give me an expected date for this task to be completed, I can rouse others right before that comes up.

See also bug 29284 which has an old patch, looks like it only covers one tiny case and would need more thorough work.

Although I would prefer going fully to bug 4421 (don't include extensions on the user-accessible filenames at all; probably normalize extensions on the actual files) normalizing on upload when not grandfathering an update to an existing file would be a step in the right direction.

svenmanguard wrote:

The reason why I personally like the extensions, even though this isn't really much of a reason all things considered, is that it's the fastest way to tell what type of file I've got on my hands when I'm working. When I've got a bulleted list of files and no thumbnails, it's really the only way to tell, save clicking on each individual item. (This comes to mind since I work in the non-free file size reduction requests queue when the bot goes down, and the extensions allow me to pick up which ones are sound files very quickly.)

Yes, leaving them allows for the problem of Foo.jpg and Foo.png, which sucks, but happens so comparatively rarely to the problem in this bugzilla as to almost not be an issue. You probably could code this problem away too, but I don't want to ask too much.

I guess what I'm saying (yes, I know I'm very TLDR) is that there are valid reasons, at least for the file maintenance people, to keep the extensions in. Not great reasons, but reasons none-the-less.

It's your call. Either fix will be an improvement, and if we eliminate file extensions from public view, I can always ask a local to create a user script that makes it easier for me to tell the file type myself.

*bump*

there seem to be consensus on enwiki for this, I guess it would be best if it was implemented as core features

either something like option for strict (case sensitive file extensions + check for redundancy of JPEG, JPG etc.) and option for redundant files (files with same name but different extension, however before it could be enabled there would be need to clean up existing files) what do you think? Is it good idea or not?

Another idea:
Make options for upload form so that existing files would be kept but upload form would not allow upload of files which are not matching new criteria

johnnymrninja wrote:

(In reply to comment #7)

Yes, leaving them allows for the problem of Foo.jpg and Foo.png, which sucks,
but happens so comparatively rarely to the problem in this bugzilla as to
almost not be an issue. You probably could code this problem away too, but I
don't want to ask too much.

Even if file extensions are kept in the page titles, it would be desirable to disallow Foo.jpg/Foo.png duplication.

Also, when standardizing extensions, is the software simply going to be replacing .Jpeg -> .jpg, or is it going to ignore the uploader and base the extension off of the data in the file? The second would be preferable, and it seems that MW already checks to make sure that the file is the correct extension when it is uploaded.

svenmanguard wrote:

I was thinking the second one too. The issue with the first one is if the software goes around replacing all cases of .jpeg with .jpg, it is going to have to find some way to deal with the existing naming conflicts, and I don't see how the software would do that without creating a mess.

Sven

johnnymrninja wrote:

(In reply to comment #10)

I was thinking the second one too. The issue with the first one is if the
software goes around replacing all cases of .jpeg with .jpg, it is going to
have to find some way to deal with the existing naming conflicts, and I don't
see how the software would do that without creating a mess.

Sven

The easiest fix I could see is that the software wouldn't need to replace anything except at the time of upload. If something called Foo.PIC is uploaded and the software sees that it is a JPEG, it is named Foo.jpg. Then no other image is allowed to be uploaded at Foo.***. So the software is actually renaming the images to the normalized titles before they are fully uploaded. The software wouldn't touch existing files, and existing conflicts would be allowed. The only thing changed would be the file uploader and file moving.

This would NOT fix the problem, but it would be an easier stop-gap. .JpeG and .JPG would still be allowed by the architecture, just no more would be created.

johnnymrninja wrote:

I'm breaking the upload fix into a separate bug Bug 40479 "File extensions should be automatically decided by MIME type at upload". This will help with future issues, but will not fix any existing ones.