Page MenuHomePhabricator

Filenames in the HTML snapshot by extension dumpHTML
Closed, InvalidPublic

Description

The dumpHTML.php generates filenames which

  • are UNICODE encoded, which is not very well supported by every tools
  • have very long names

This is not very critical, as long as the dump is on a Hard-disk (filesystem) ;
but that is very problematic if you want to put the dump on a CD-ROM or DVD-ROM
and that is exactly what we try to do now.

To my opinion the generated filenames should be ISO 9660 Level-3 compliant,
which is the most well supported filesystem.

To solve the problem I propose to save the article in a truncated version of the
Md5 hash of each title.

Almost the same problem exists with pictures/media files.


Version: unspecified
Severity: enhancement
URL: http://www.mediawiki.org/wiki/Extension:DumpHTML

Details

Reference
bz8147

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:30 PM
bzimport set Reference to bz8147.

(In reply to comment #0)

The dumpHTML.php generates filenames which are UNICODE encoded, which is not critical, as long as the dump is on a Hard-disk (filesystem) ;
but that is problematic if you want to put the dump on a CD-ROM or DVD-ROM ...
To solve the problem I propose to save the article in a truncated version of the
Md5 hash of each title. Almost the same problem exists with pictures/media files.

... propose to save the article in a truncated version of the Md5 hash of each title.

I developed such a modified version of DumpHTML which creates snapshots with filenames of articles and picture/media files using MD5-hashed filenames only. All links and URLs are MD5 hashed versions of the original (Unicode) filenames.

Snapshots were burnt onto DVDs and tested successfully on different operating systems (Windows 2000, Windows XP, Linux SUSE 11.0).

The diff to the current DumpHTML checkout will be posted soon.

Created attachment 5248
difference to dumpHTML-MW1.12-r30339.inc

The attachment solves that problem: it lets the dumpHTML.inc module encode links and local filenames of articles, images, thumbnail images and media file with the MD5 hash of the original filename. This alllows to store snapshots on CD/DVD filesystems. Resulting snapshots on DVD have been succesfully checked on Windows 2000, Windows XP and Linux SUSE 11.0 systems.

I can post the whole dumpHTML extension on request.

attachment dumpHTML.diff ignored as obsolete

Created attachment 5278
new version 2.11 (diff to dumpHTML-MW1.12-r30339.inc)

New version 2.11 (diff to dumpHTML-MW1.12-r30339.inc) fixes a small problem for articles and/or image/media files with slashes in it.

Attached:

(In reply to comment #3)

New version 2.11 (diff to dumpHTML-MW1.12-r30339.inc) fixes a small problem for
articles and/or image/media files with slashes in it.

I prepared a tgz of the original and modified dumpHTML files including the diff which is available on http://www.tgries.de/mediawiki/dumpHTML-v2.11.tgz

Is this patch integrated in trunk ?

(In reply to comment #5)

Is this patch integrated in trunk ?

No. I am not a developer and do not have SVN access. Since the publication of the https://bugzilla.wikimedia.org/attachment.cgi?id=5278 I haven't noticed any problems using it for several different dumps. Looks stable.

dasch wrote:

the patch is buggy, i can't apply it to my files

dasch wrote:

Clean patch for r47214

Clean patch for r47214

well I made a new patch for the trunk version, maybe somebody could comit it to SVN

Attached:

In my wiki, people use the name of the dumped file to figure out what page the file corresponds too, so using hashed filenames would be bad. Since we generate files on Windows though, we do end up filtering out characters that aren't appropriate for that OS with a regex. ASCII transliteration would probably work too. Regardless, if this in included, please make it optional.

(In reply to comment #9)

In my wiki, people use the name of the dumped file to figure out what page the
file corresponds too, so using hashed filenames would be bad. Since we
generate files on Windows though, we do end up filtering out characters that
aren't appropriate for that OS with a regex. ASCII transliteration would
probably work too. Regardless, if this in included, please make it optional.

Please feel free to present a better solution, "filtering out" non-ASCII may not be the best solution, as it introduces at least some kind of irregularities; I admit it helps to guess filenames, but this was not required at the first place (how often do your users access your MediaWiki articles by modifying the URL?).

Working with many different systems (Windows, Linux, ISO file systems on CD/DVD) I found the "hash" solution a robust one (progammed in reasonable time) to store all pages and files reliably on different media.

The original (official) DumpHTML by Tim appeared not to work on different file systems (it works fine on LINUX servers), when you copy the created dumps between Linux - DVD - Windows, for example, you will quickly encounter problems with non-ASCII page- and image filenames like Umlauts in the "Begrüßungsbox".

Perhaps Tim can be motivated to present a robust solution which fits all needs.

Our users actually find the dumped files via a search engine (which I have no control over) which displays the file name to users as the page title. Our page titles are also all in English which helps. Regardless, I'm not saying you should change everything to suit my edge case. I'm just saying if you implement your hashed solution, make it something that can be turned off via a configuration option so that you can still get today's functionality.

(In reply to comment #11)

Our users actually find the dumped files via a search engine (which I have no
control over) which displays the file name to users as the page title. Our
page titles are also all in English which helps.

The _page _titles are preserved: the "hash" solution does not touch the page titles, <title> tag content is always preserved. Only the last parts of the url (file _name_ parts) are changed - file extensions are also preserved (html, jpg, png, gif, doc and so on)

stfnmstr wrote:

Does the patch work for anyone?

I tried the current dumpHTML version from svn (30.Nov.2010) and also r47214 with the patch applied but I end up always like this:

  • when not applying the patch I have problems with pages with Umlaute in the title, eg "Zuständigkeiten",.everything else seems fine
  • when applying the patch, I get "PHP Warning: urldecode() expects parameter 1 to be string, object given in [...]/dumpHTML.inc on line 18" and the dump is completely broken.

I generate the dump on a CentOS 5 with php 5.1.6 and MediaWiki v1.13.2, zipped it and sent it to my Windows 7 / Windows 2003 boxes.

I didn't try the first patch because the revision number seems wrong.

*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*

sumanah wrote:

DaSch, if you have time to update your patch to work with current trunk, that would be neat.

sumanah wrote:

DaSch, I'm sorry for the wait in response! Thank you for the patch.

If this issue is still something that you'd like to follow up on, take a look
at our current codebase and consider updating and submitting your patch
directly into our new Git source control system.

https://www.mediawiki.org/wiki/Git/Workflow

You can do this by getting and using "developer access"

https://www.mediawiki.org/wiki/Developer_access

Thanks again, and I apologize for the wait.

http://www.mediawiki.org/wiki/Special:Code/MediaWiki/115597

The munging strategy can be configured with a new --munge-title argument. I tried not to fix any bugs with this patch ;) so the default munge algorithm should be the same as previous behavior. The "md5" munge uses T. Gries's patch above, and the "windows" munge exposes some inaccessible code from the "getFriendly..." method.

aditaa05 wrote:

(In reply to comment #17)

http://www.mediawiki.org/wiki/Special:Code/MediaWiki/115597

The munging strategy can be configured with a new --munge-title argument. I
tried not to fix any bugs with this patch ;) so the default munge algorithm
should be the same as previous behavior. The "md5" munge uses T. Gries's patch
above, and the "windows" munge exposes some inaccessible code from the
"getFriendly..." method.

using --munge-title windows or any other options i get this error:

Unexpected non-MediaWiki exception encountered, of type "Exception" exception 'Exception' with message 'no such titlemunger exists: 1' in /dir/w/extensions/DumpHTML/MungeTitle.inc:18 Stack trace:

0 /dir/w/extensions/DumpHTML/dumpHTML.inc(92): MungeTitle->__construct(1)
1 /dir/w/extensions/DumpHTML/dumpHTML.php(132): DumpHTML->__construct(Array)
2 {main}

Thanks for the report! The argument processing should be fixed in r115629.

jason wrote:

With git head of dumpHTML and MediaWiki 1.19.2 on and EXT4 filesystem on Ubuntu 12.10, there is some encoding issue that sprinkles 2F (Unicode for forward slash) into my image src URLs and filenames. This is without using the munge parameter as I want to use an existing local image mirror.

sudo /usr/bin/php /var/lib/mediawiki/extensions/DumpHTML/dumpHTML.php -d
/s/wikidumptest --image-snapshot

results in links like:

file:///s/wikidumptest/images/thumb2F/d/2F//d/d7/Lager_beer_in_glass.jpg/180px-Lager_beer_in_glass.jpg

With the last commit before the munge parameter everything is fine.

ktrader wrote:

i download dumphtml,with chinese windows os,
run
php D:\A\extensions\DumpHTML\dumpHTML.php -d d:\wikidump -k monobook --image-snapshot --force-copy --munge-title windows

but images are not in proper folder,

D:\wikidump2\articles\文\件\7E\文件~Jr01.gif.html can open,but can not see picture,
the picture url is D:\wikidump2\images\4\42\Jr01.gif,can not open, then i search Jr01.gif,the result is in the folder D:\wikidump2\images\4\_\4.

what is wrong?

[ASSIGNED status since comment 8 in 2009; obviously not the case. Resetting.]

Nemo_bis claimed this task.

The original report is superseded. There is a separate report for md5 and another for a src issue probably related to the one above. Most advanced FS solutions should just adopt the ZIM format IMHO.