Zimdiff

From openZIM
Jump to navigation Jump to search

zimdiff is a proposed tool in order to facilitate incremental updates for a large ZIM file. It will be written using the zimlib library. The zimdiff is released under the GPLv2 license terms. Note that zimdiff is currently under development. The bugzilla page can be found here

This page discusses the details of the zimdiff tool.

The Zimdiff tool will be used to generate a diff_file between two normal zim files. Lets call them start_file and end_file.

diff_file format

A diff_file will be a normal ZIM file, with some additional data. The ZIM diff file must contain the necessary data to allow to make: start_file + diff_file = end_file

Actions that need to be performed using the diff_file:
1. add
2. remove
3. update

The diff_file will store all articles that have to be added to the start_file. A list of such articles will be maintained in a metadata article. Another article in metadata will contain a list of articles to be removed from the start_file. For updating an article, there will be two options.
1.Store the new article among the list of articles to be added.
2. Store the diff (generated by a diff algorithm )between the old article and the new article in a separate article in the diff_file. A list of such diff articles will be maintained in metadata.

Using the above format, the diff_file can be used to store the difference between the start_file and the end_file, and can be used to update the start_file to obtain the end_file using the zimpatch tool.

Pseudocode for zimdiff

class articleInfo
{
    std::string Title;
    int hash;
    int index;
};

1. Start, open '''start_file''' and '''end_file'''
2. Parse through '''start_file''' and '''end_file''', add data to a linked list of articleInfo objects.('''start_list''' and '''end_list''')
3. Sort the lists(for faster searches later on)
4. For each articleInfo object in '''start_list''', loop through the '''end_list''' searching for an article with the same title.
5.   If no article is found, move the object to a '''delete_list''' (list of articles to be deleted),and 
6.   If an article is found, compare the hashes. If the hashes are the same, then the article is to remain unchanged. Delete the entry from both the lists.
7.   If the hashes are different, then the article has changed. Move the articleInfo object (from the end_list) to the '''update_list'''.
8. Once all articles have been processed in the '''start_list''', all the remaining objects in the '''end_list''' are newly added articles. 
Add them directly to the '''add_list'''.
9. Start writing the '''diff_file'''.
10. Add all articles in the add_list to the '''diff_file'''. Create a list of these article titles(in XML format) and store it as metadata.
11. Add all articles in the '''update_list''' to the '''diff_file'''.
 Create a different article, containing a list of the titles of these articles(in XML format) and store it as metadata.
12. Create a list of titles of articles(XML format) in '''delete_list'''. Store the article as metadata.
13. End.

Additional metadata entries

file_details

Article Name: 'file_details'
This article will contain the UIDs of both the start_file and the end_file, to prevent wrong updates.

add_list

Article Name: 'add_list'
This article will contain the list of articles to be added to start_file. The articles themselves can then be obtained from the contents of the diff_file.
The Namespace and Title of the articles will be stored.

delete_list

Article Name: 'delete_list'
A list of articles to be deleted from the start_file. Namespace and Title provided.

update_list

A List of articles to be updated. The new version of the articles can be found in the contents of the diff_file.
Namespace and Title stored.