Page MenuHomePhabricator

Incremental update: zimdiff & zimpatch
Closed, ResolvedPublic

Description

We need such tools to provide incremental updates.

/* This bug was migrated from the previous openZIM bug tracker */

See also https://www.mediawiki.org/wiki/Kiwix/ZIM_incremental_updates


Version: unspecified
Severity: enhancement

Details

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:19 AM
bzimport added a project: openZIM-zimlib.
bzimport set Reference to bz47406.

kiranmathewkoshy wrote:

I can add the zimdiff and zimpatch functionality as a part of my GSoC 13 project.
Please assign this to me.

Just a note to say that Kiran Mathew Koshy has submitted a GSoC proposal related to this report: https://www.mediawiki.org/wiki/User:Kiran_mathew_1993/ZIM_incremental_updates_for_Kiwix

kiranmathewkoshy wrote:

Regarding zimdiff, which is the first phase of my GSoC project, here is the file format I'm proposing:

The zimdiff tool will take two zim files as input. Lets call them start_file and end_file. The purpose is to generate a third file, diff_file, which can be used to update the contents of start_file to obtain end_file.

Files:
1.start_file
2.end_file
3.diff_file

File format for diff_file:

The diff_file can be a ZIM file itself, with a few additional metadata entries.
The articles(non-metadata) inside the diff_file will be:

1.the articles that have been added to the end_file, which were not present in start_file. These will share a common namespace. There will be a list of new articles in the metadata and their corresponding namespaces, so that the original namespace information is not lost.

  1. Articles that have been modified.(the new version of the article).A list of articles that have been modified will be kept among metadata, and using this list, the old articles can be removed and the new articles inserted when the update is done.

There is an alternate approach, using the patience-diff algorithm (used by git), which will take a much longer time to complete, given the huge number of articles.
The alternate approach is as follows:
For all modified articles, apply a diff(patience-diff preferred, since it will b) algorithm between the original and final article and store the output instead of the entire new article.
Advantages:
(i) Most of the updates in Wikipedia are tiny ones, so this approach will free a lot of space, and the zimdiff file would be considerably smaller.
(ii) Computation is done in servers, and not very often(once in two weeks or so), so time isn't exactly a constraint.

Disadvantages:
(i) More complex algorithm, thereby prone
(ii) More time required to compute.

New articles in Metadata:

  1. Hashes of the start_file and end_file.
  2. List of new articles, and their namespaces
  3. List of modified articles, and their corresponding old articles.
  4. List of deleted articles.

Please add your comments, especially on the method for storing updated articles.

I also think we should use the ZIM format for the ZIM diff format. I'm not sure to fully understand the proposition of Kiran, but here is the actual status of my thoughts about this topic.

ZIM files contain:

  • articles: text, metadata, pictures, sounds, videos... Multimedia contents often compose the majority of the size of a ZIM file
  • A header with a few technical data
  • A sorted title list and a sorted url list

The ZIM diff file must contain the necessary data to allow to make:
start_file + diff_file = end_file

So you need to be able to:

  • delete/replace/add articles
  • replace/update headers
  • Rewrite url/title lists

... and everything in a way you will get exactly the same end_file at the end of the patch process.

To add articles, the easier way is simply to add all article which are available in the diff_file and not available in the start_file.

To replace article, follow the same process like to add article. I'm not a fan of having text article diffs in the diff_file because (1) This makes everything more complicated to code (2) I'm not convinced at all that this saves a lot of mass storage space because Wikipedia articles are pretty heavily modified. This is true especially if you have many multimedia contents in the diff file (3) This will really slow-down the patch process... and this process will already be pretty slow.

To delete articles, we need a list somewhere as a metadata, see my paragraph about additional metadata at the end of this comment.

  • To replace/update headers, I propose simply to overwrite the old start_file ones with the diff_files ones (except diff specific metadata).
  • Both url/title list must be recomputed during the patch process... but we don't need IMO to store anything related to that here.

So, we could achieve to have a diff file with only the new/modified articles, new metadata values and a few aditional diff infos which could be stored in a special metadata entry "M/Diff" with couple of values. I see at least two mandatory values:

  • "originuuid", containing the uuid of the start_file
  • "deleteUrls", with the list of urls to be deleted.

kiranmathewkoshy wrote:

Yes, a diff algorithm would make the code quite complex.

However, I don't think the patch process would be slowed down if a diff algorithm is implemented, since the most time consuming portion involving file diff is the creation of the diff file.

How about if we leave a provision for the diff algorithm ?
i.e, for an updated article, either the entire new article can be stored in the diff_file, or the output of the diff algorithm can be stored.
Once the rest of the code is implemented, the diff algorithm can be done, as an alternative to storing the entire article.

Once completed, the final version of the program will compute the diff of the articles, and use the smaller of the two(diff and new article).

A list of articles whose diff is given will be maintained in metadata.

@ Kiran

You proposition makes sense to me but I would propose to make the zimdiff tool working fine without this before working on that. So, if everything works well and you have time left, then this would be an interesting improvement.

My last concern is realated to the ZIM format internals: the ability to produce exactly the same file end_file at the end of the patch process. That means that the articles are exactly stored in the same clusters, in the same order, etc... Do you have invested time on that to see how this could be guaranteed?

cip wrote:

I agree with that it is a good idea to support both replacement of entire articles and diffs.
This is because I believe that omission of diff may have a huge impact on file size: In particular articles which are updated infrequently, and there are a lot of them, often have changes which only affect a single line, such as spelling fixes or interwiki link updates. Thus not supporting diffs may lead to very large diff file.
On the other hand, when diff-support is available, it still may make sense
to store complete articles instead of diffs for some articles, for example
to reduce processing effort in diff creation, but also merge.

One other point which is worth considering is disk space usage during merge. In particular on mobile devices, when the planned download manager is implemented, the incremental update capability would be very useful. However, if the update process requires free space for the old, the diff, and the new file, it won't be useable in many cases because there is not enough freespace on mobile devices.
Therefore something like in-place updating the old zim file, and merging while dowloading the diff would make sense.
For sure this is not trivial, and hardly possible to be implemented in the GSoC, but it would make sense to keep this in mind when defining the diff and merge processes and the diff file format, so that this feature could be added in the future. For example, with such an approach it is important that the zim file stays consistend during the update, and later resuming is possible.
Note: If actual in-place update is not feasible, the zim-split feature could be used, e.g. on download of full zim file is split in let's say 64 MB chunks. On update new chunk (size may change) is written completely before old is deleted. Thus during update only 64 MB additional storage is required, while without in-place update tens of GB could be necessary.

And other interesting feature could be to support on-the-fly merge: When this feature is used the diff zim file is not merged during update. Instead it is just stored besides the old zim file, and the zimlib merges the articles from both (or more) files on access. Benefit is that the end user does not need to run the probably pretty long running merge task. In addition it is faster, because no recompression needs to be done. Furthermore, no additional storage during update is required. Drawback is that is uses more storage, thus it depends on the size of the diff-files, whether this approach is actually feasible.
Anyway, I think its worth considering this feature, as it should not require much additional effort, in fact it could even be an intermediate step for the currently planned approach.
For sure for this on-the-fly approach supporting multiple deltas would make sense, but this could be implemented later.

kiranmathewkoshy wrote:

http://openzim.org/wiki/Zimdiff
Page has been created for zimdiff.
I have added the pseudo code there.

kiranmathewkoshy wrote:

@Christian Puehringer
Although I don't fully understand the in-place update, I think it is a good idea to support the on-the-fly merge- This can be quite useful on mobile devices where the zimpatch would take quite a long time.

kiranmathewkoshy wrote:

Zimdiff and zimpatch has been completed, and currently we are in the testing phase.

I've fixed most of the bugs, but i'm sure there are some left. For instance, we haven't achieved a checksum match between the generated zim file and the original zim file. (The checksum matches in smaller zim files,but fails in larger files.)
One probable reason is because of different compression algorithms used in older files.

Everyone is invited to check out the code and post their comments.

zimdiff:
https://github.com/kiranmathewkoshy/zimdiff

zimpatch:
https://github.com/kiranmathewkoshy/zimpatch

Change 82449 had a related patch set uploaded by Kiran mathew koshy 1993:
Bug 47406 Incremental update: zimdiff and zimpatch.

https://gerrit.wikimedia.org/r/82449

GSoC "soft pencils down" date was yesterday and all coding must stop on 23 September. Has this project been completed?

@Quim

The project is for now not completed. Kiran is still working on it and will continue until the deadline: monday. What is actually done represents around 70% of the whole project and I have some expectations that Kiran could reach 80%. Kiran is more optimistic.

kiranmathewkoshy wrote:

GSoC 2013 Summary:

Thanks to everyone, I managed to pass GSoC 2013. Its been a great summer, and I'll be sticking around for more.

Regarding zimdiff and zimpatch- Both tools are ready for action-
https://github.com/wikimedia/openzim

The Kiwix integration is not complete yet, there is some UI stuff left, and some work in linking the zimpatch class with javascript. I will be working with these in the near future.

https://github.com/kiwix/kiwix_mirror

If you have open tasks or bugs left, one possibility is to list them at https://www.mediawiki.org/wiki/Google_Code-In and volunteer yourself as mentor.

We have heard from Google and free software projects participating in Code-in that students participating in this programs have done a great work finishing and polishing GSoC projects, many times mentores by the former GSoC student. The key is to be able to split the pending work in little tasks.

More information in the wiki page. If you have questions you can ask there or you can contact me directly.

Change 82449 had a related patch set uploaded by Nemo bis:
Incremental update: zimdiff and zimpatch

https://gerrit.wikimedia.org/r/82449

(In reply to Gerrit Notification Bot from comment #18)

Change 82449 had a related patch set uploaded by Nemo bis:
Incremental update: zimdiff and zimpatch

https://gerrit.wikimedia.org/r/82449

Has anyone ever used that code?

We have currently have a bottleneck on side "code reviewing" at openzim.
I have to work to integrate this, it's not forgotten.
I use this code yes and it works, although I still want to fix a few details before releasing.

Nemo_bis set Security to None.

Change 82449 had a related patch set uploaded (by Qgil):
Incremental update: zimdiff and zimpatch

https://gerrit.wikimedia.org/r/82449

@Kelson: Any chance to give https://gerrit.wikimedia.org/r/82449 a review? Patch applies cleanly, I just checked...

Would be great if you found some time, or at least some initial review of https://gerrit.wikimedia.org/r/82449 ...

Kelson claimed this task.

The review process is finally almost done and this will be merge to git master within a week. Sorry for the so long delay.