Page MenuHomePhabricator

Tool to check ZIM integrity needed
Closed, InvalidPublic

Description

We need to have a way to check the quality of of zim file.

Should be at least checked:

  • (WARNING) has a welcome page
  • (ERROR) broken local HTML links
  • (WARNING) redundant content
  • ...
  • Comment #1 From Tommi Mäkitalo 2010-03-27 10:35:25 -------

I will make a zimlint for that.

  • Comment #2 From Emmanuel Engelhart 2010-04-16 10:52:30 -------

Should also be checked if the HTML content do not have any online dependences.
For example <img src=http://....
Also in the CSS.

/* This bug was migrated from the previous openZIM bug tracker */


Version: unspecified
Severity: enhancement

Details

Reference
bz47407

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 1:19 AM
bzimport added a project: openZIM-zimlib.
bzimport set Reference to bz47407.

Here is in detailed a list of things which should be checked:

1 - Internal checksum: launch internal checksum verification

2 - Dead internal urls: check all ZIM internal urls an verify if the target exists. That means css/javascript loading urls, images src and url href.... an probably a few others

3 - Checks that urls in CSS files are not external, and internal urls are valid

4 - Veryfy that there is not online dependencies (images, javascript/css loading, ....) in HTML code

5 - Check if the following metadata entries are there: title, creator, publisher, date, description language. Check if date and language are in the correct format.http://openzim.org/wiki/Metadata

6 - Verify that the favicon is there

7 - Verify the main page header entry is defined and point to a valid content.

8 - Check duplicate content: be sure that the same content is not available under two different url. For example two times the same picture.

9 - Verify that internal urls are not absolute

kiranmathewkoshy wrote:

I have implemented a primitive version of the above tool...

https://github.com/kiranmathewkoshy/zimcheck/

It implements the following checks:
1- Internal checkSum
2- Verify that there are no online dependencies
3- Check for all metadata entries
4- Verify favicon.png
5- Main Page Header.
6- Duplicate content.

Although search for Duplicate content was initially slow on large files, I have managed to speed it up to run in less than 2 minutes on the 2.6 GB wikipedia zim file.

However, checking internal URLs is still slow, and being a CPU intensive process, I have decided to go on with dividing the work on a few threads.

Also note that the regex library used is a part of C++11, and I'm not aware if the rest of zimlib is compatible with C++11.

My feedback, sorry if I only speaks about things which does not work ;)

  • Should be good to have a help about the availabel options and purposes printed by calling "zimcheck" or "zimcheck --help" or "zimcheck -h"
  • By running it against ICD-10, it seems to be a problem with the favicon... but in favicon is OK AFAIK
  • It reports a "unknown mime type code 65535" (with the same file"... this is unclear what it means for me.
  • In the code add the license on the top of the file (GPL2 for the openzim project)
  • Regarding the redundancy check computing a hash for all the contents in every case seems to me to be a little bit slow. I propose a way to get it done faster:

1 - go trough all articles and all and save the size
2 - for all articles with the same size make a hash comparison

  • I think the check of internal urls will always returns false for the simple reason that this is allowed to have external links "href" (but not external dependencies) in the pages... and that is what you check.

kiranmathewkoshy wrote:

A more polished version of the program,which fixes a few bugs mentioned above, has been implemented.

https://github.com/kiranmathewkoshy/zimcheck/

Checking internal URLs and redundancy check have not been modified. They will be modified in the next version.

  • Usage is more or less OK (few visual things to fix, have a look to the help of tools like "grep" or "perl"). The way to get it should use pre-existing code and not reinventing the wheel, have a look to "getopt"
  • "./zimcheck ICD10-fr.zim" prints the usage(), it should run the checks.
  • Code style should be similar to the rest of the zimlib and clean.

My general advise is: Take your time and try to code as perfectly as you can. Don't ask for a review if you see yourself a better way to do it, still something you could improve. What matters is the quality, not the quantity. Don't try to do everything/all features, focus on a few features, but try to implement them is the most intelligent and beautiful manner. And the most important: test your own code as much as possible.

kiranmathewkoshy wrote:

Everything except the MIME checks have been implemented, the MIME checks can be implemented after functions to return the MIME types are implemented in zimlib.

https://github.com/kiranmathewkoshy/zimcheck