https://wiki.openzim.org/w/api.php?action=feedcontributions&user=Kelson&feedformat=atomopenZIM - User contributions [en]2024-03-29T09:54:57ZUser contributionsMediaWiki 1.36.1https://wiki.openzim.org/w/index.php?title=Content_team&diff=31348Content team2024-02-09T12:04:33Z<p>Kelson: /* Goals */</p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format ("books" being understood here as web content stored as single web archives). <br />
== Purpose ==<br />
<br />
Provide web-based educational content to people without internet access, and make the experience as seamless as possible. Access and discovery must be user-friendly and market ready, the content up-to-date and as portable as can technically be.<br />
<br />
== Goals ==<br />
* Book curation must remain focused on educational material, broadly construed;<br />
* Books should have proper visual formatting;<br />
* Books should be up-to-date like custom apps;<br />
* The Kiwix Library should allow easy and friendly discovery of content.<br />
<br />
== Responsabilities ==<br />
* Content Requests<br />
** Collaborate with requesters to qualify requests properly. Keep them informed.<br />
** Ensure we are allowed and able to fullfill requests<br />
** Initiate new recipes and manage first publishing if new book<br />
** Collaborate with scraper dev. team if necessary<br />
** Keep the tickets up2date<br />
<br />
* Scraping<br />
** Ensure Zimfarm works fine and contribute to its improvements with dev. team<br />
** Analyses failures or unexpected behaviors<br />
** Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team<br />
** Ensure workers are online and are properly configured<br />
** Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)<br />
<br />
* Library management<br />
** Ensure ZIM filenames and location (paths) are correct<br />
** Ensure ZIM Metadata are correct<br />
** Ensure ZIM are recent and kept up2date (AFAP)<br />
** Ensure library is coherent and user-friendly<br />
<br />
== Policies ==<br />
<br />
=== Publishing ===<br />
* Content has to be legal in Switzerland<br />
* Content should not advertise [https://en.wikipedia.org/wiki/Fringe_theory fringe theory]<br />
* Content should betterne [https://en.wikipedia.org/wiki/Free_content free content]<br />
* If not free, content should be:<br />
** Open content OR<br />
** Educational content OR<br />
** has an authorization of reproduction<br />
* Any content we publish should<br />
** have (almost) no user visible error<br />
** have proper/correct metadata<br />
** be easily discoverable in the public library<br />
<br />
=== Content Requests ===<br />
* Allow everybody to request new, changes or deletion of content<br />
* In full transparency track the lifecycle of our content portfolio<br />
* New content should be assessed and vetted content against publishing policy (see above)<br />
* Content requests should be closed:<br />
** when fully implemented (user visible)<br />
** if refusal or impossibility of implementation<br />
* ZIM Medata should be given for new content<br />
* Only once all prerequisites are satisfied, then start with scraping<br />
<br />
=== Scraping ===<br />
* Scraping leadership means the initiative should come from the content team<br />
* First analysis of error should be done by content team<br />
* If error in scraper is suspected<br />
** Issue should be updated to corresponding scraper code repository<br />
** Scraper problem analysis does not super-seed in any manner content request<br />
* ZIM quality should be vetted against publishing policy<br />
* Any recipe should run successfully first in dev before been put in production<br />
* Hardware resources should be saved<br />
<br />
=== Library Management ===<br />
<br />
=== Custom Apps ===<br />
<br />
== Processes ==<br />
<br />
=== Content Requests ===<br />
<br />
=== Scraping ===<br />
<br />
=== Library Management ===<br />
<br />
=== Custom Apps ===<br />
<br />
== Worflows ==<br />
<br />
<br />
## To create a new recipe for youtube files<br />
<br />
**It’s recommended to clone an existing Youtube recipe.**<br />
<br />
* Create the recipe name as per the naming conventions [here](https://github.com/openzim/overview/wiki/Naming-Convention).<br />
* In the Language space, choose the language of the website you are creating the recipe for.<br />
* From Category space, choose (other)<br />
* From warehouse path space, choose (/.hidden/.dev) always as a first time in order to test the resulted file, if the file is tested and all is correct then you update the recipe with the proper path (videos).<br />
* Make sure the Status is set to Enabled.<br />
* You can choose Periodicity to be monthly or quarterly.<br />
* In Offliner space choose: Youtube<br />
* In platform space choose Youtube.<br />
* Keep the rest the same with no change.<br />
<br />
**In Youtube command flags:**<br />
<br />
* In Playlist mode: choose (Not Set) if you are doing the recipe for a whole channel.<br />
* If you are doing the recipe for a playlist, choose (Set).<br />
* In Type: choose (Channel) or (Playlist) as per your required file.<br />
* In Youtube ID: type the ID of the channel or the playlist.<br />
* For the API Key: There is a list of keys mostly as per the channel or the playlists sizes, ask for the list to choose the appropriate API Key.<br />
* In Zim Name: the recipe name as per the naming conventions [here](https://github.com/openzim/overview/wiki/Naming-Convention).<br />
* In Title: type the name you want for the output file.<br />
* Description: type a short description of your required zim file.<br />
* Leave Optimisation Cache URL as it is (cloned from old recipe).<br />
* Leave the rest of the fields empty or as per the cloned recipe.<br />
* Finally, click in the bottom on (Update offliner details).<br />
* Review all your entries once again, then go back to the top of the page and click on (Request).<br />
* After about an hour, check the recipe if it failed or succeeded (or the next day if the source website is large).<br />
* If successful, go to this link ([dev.library.kiwix.org](https://dev.library.kiwix.org/)) and check your created file, check the size and check if the file is working properly. If the file does not appear, wait a bit as updates are made every 15 minutes.<br />
* If the file looks good and complete, go back to your recipe, In warehouse path space, change(/.hidden/.dev) to the proper category related to your file content (Wikipedia, Wikihow, … etc).<br />
* Click on Update offliner details and then click on Request again.<br />
* Finally, check the file in (https://library.kiwix.org/ ). If all is good, do not forget to go back to the initial ticket (most likely at zim-requests) and put the link of the output file and close the ticket.<br />
<br />
== Members ==<br />
* [https://github.com/Popolechien Popolechien], manager in line<br />
* [https://github.com/RavanJAltaie Ravan], content manager<br />
* [https://github.com/benoit74 Benoit74], scrapers lead dev<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=ZIM_file_format&diff=31346ZIM file format2024-02-03T14:48:49Z<p>Kelson: /* Major & Minor versions */</p>
<hr />
<div>Beginning 2021, we change the way we handle namespaces in ZIM file format.<br />
<br />
This is major change in the way we handle entries in libzim and the semantics around but it is not a break in the binary zim format. Old library/readers are compatible with new zim files.<br />
<br />
This page describe the new format. The old format can be found here : [[ZIM file format old namespace]].[[Image:Schema File Format.png|500px|right]]<br />
== Header ==<br />
A ZIM archive starts with a header :<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !! Offset !! Length !! Description <br />
|-<br />
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)<br />
|-<br />
| [[#Major_.26_Minor_versions|majorVersion]]<br />
| integer<br />
| 4<br />
| 2<br />
| Major version of the ZIM archive format. Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)<br />
|-<br />
| [[#Major_.26_Minor_versions|minorVersion]] || integer || 6 || 2 || Minor version of the ZIM archive format. Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1) <br />
|-<br />
| uuid || integer || 8 || 16 || unique id of this zim archive <br />
|-<br />
| entryCount || integer || 24 || 4 || total number of entries <br />
|-<br />
| clusterCount || integer || 28 || 4 || total number of clusters <br />
|-<br />
| urlPtrPos || integer || 32 || 8 || position of the directory pointerlist ordered by URL <br />
|-<br />
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title<br />
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.<br />
|-<br />
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list <br />
|-<br />
| mimeListPos || integer || 56 || 8 || position of the MIME type list (also header size) <br />
|-<br />
| mainPage || integer || 64 || 4 || main page or 0xffffffff if no main page <br />
|-<br />
| layoutPage || integer || 68 || 4 || [[Layout Page|layout page]] or 0xffffffffff if no layout page (deprecated, always 0xffffffffff) <br />
|-<br />
| checksumPos || integer || 72 || 8 || pointer to the md5checksum of this archive without the checksum itself. This points always 16 bytes before the end of the archive.<br />
|}<br />
<br />
A ZIM archive may be embedded in another file at a specific offset. In the context of the ZIM format, the start of the ZIM header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.<br />
<br />
=== Major & Minor versions ===<br />
<br />
Versioning of the file format specification has not been done [https://semver.org/ rigorously] until version 6.<br />
<br />
{| class="wikitable"<br />
|+ ZIM format versions<br />
|-<br />
! Major !! Minor !! Backward compatible !! Description <br />
|-<br />
| 0 || 0 || no || ''This version features have not been tracked properly''<br />
|-<br />
| 1 || 0 || no || ''This version features have not been tracked properly'' <br />
|-<br />
| 2 || 0 || no || ''This version features have not been tracked properly'' <br />
|-<br />
| 3 || 0 || no || ''This version features have not been tracked properly'' <br />
|-<br />
| 4 || 0 || yes || Introduces title index<br />
|-<br />
| 5 || 0 || yes || Introduces extended clusters <br />
|-<br />
| rowspan="3" | 6 || 0 || no || Still uses [[ZIM file format old namespace|"old" namespaces]]<br />
|-<br />
| 1 || no || Introduces [[#Namespaces|"new" namespaces]]<br />
|-<br />
| 2 || yes || Explicitly allows alias entries (several entries pointing to the same cluster/blob)<br />
|}<br />
<br />
== MIME Type List (mimeListPos) ==<br />
The MIME type list always follows directly after the header, so the ''mimeListPos'' also defines the end and size of the ZIM file header.<br />
<br />
The MIME types in this list are zero terminated strings. An empty string marks the end of the MIME type list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st MIME Type> || string || 0 ||zero terminated|| declaration of the <1st MIME Type> <br />
|-<br />
| <2nd MIME Type> || string || n/a ||zero terminated|| declaration of the <2nd MIME Type> <br />
|-<br />
| ... || string || ... ||zero terminated|| ... <br />
|-<br />
| <last entry / end> || string || n/a ||zero terminated|| empty string - end of MIME type list <br />
|}<br />
<br />
== URL Pointer List (urlPtrPos) ==<br />
The URL pointer list is a list of 8 byte offsets to the directory entries.<br />
<br />
The directory entries are always ordered by "full" URL (<code><namespace><path></code>). Ordering is simply done by comparing the URL strings.<br />
<br />
Since directory entries have variable sizes this is needed for random access.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st URL> || integer || 0 || 8 || pointer to the directory entry of <1st URL> <br />
|-<br />
| <2nd URL> || integer || 8 || 8 || pointer to the directory entry of <2nd URL> <br />
|-<br />
| <nth URL> || integer ||(n-1)*8|| 8 || pointer to the directory entry of <nth URL> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
Libzim caches directory entries and references the cached entries via the URL pointers.<br />
<br />
== Title Pointer List (titlePtrPos) ==<br />
The title pointer list is a list of entry indices ordered by title (<code><namespace><title></code>). The title pointer list actually points to entries in the URL pointer list.<br />
<br />
Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.<br />
<br />
To get the offset of an entry from the title pointer list, you have to look it up in the URL pointer list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Title> || integer || 0 || 4 || pointer to the URL pointer of <1st Title> <br />
|-<br />
| <2nd Title> || integer || 4 || 4 || pointer to the URL pointer of <2nd Title> <br />
|-<br />
| <nth Title> || integer ||(n-1)*4|| 4 || pointer to the URL pointer of <nth Title> <br />
|-<br />
| ... || integer || ... || 4 || ... <br />
|}<br />
<br />
The indirection from titles via URLs to directory entries has two reasons:<br />
* the pointer list is only half in size as 4 bytes are enough for each entry<br />
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim.<br />
<br />
== Directory Entries ==<br />
Directory entries hold the meta information about all entries, images and other objects in a ZIM archive.<br />
<br />
There are different types of directory entries:<br />
<br />
=== Content Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || MIME type number as defined in the MIME type list <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters (must be 0) <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| cluster number || integer || 8 || 4 || cluster number in which the data of this directory entry is stored <br />
|-<br />
| blob number || integer || 12 || 4 || blob number inside the compressed cluster where the contents are stored <br />
|-<br />
| url || string || 16 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
=== Redirect Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || 0xffff for redirect <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| redirect index || integer || 8 || 4 || pointer to the directory entry of the redirect target <br />
|-<br />
| url || string || 12 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
None of the strings should have control characters from U+0000 through U+001F.<br />
<br />
=== Linktarget or deleted Entry (DEPRECATED) ===<br />
There is two kinds of deprecated entry that could be found in pretty old zim files (I, main develloper of libzim, never saw it).<br />
<br />
They have mimetype equal to 0xfffe or 0xfffd. Reader implementation may check for those value and ignore the whole dirent.<br />
<br />
== Cluster Pointer List (clusterPtrPos) ==<br />
The cluster pointer list is a list of 8 byte offsets which point to all data clusters in a ZIM archive.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Cluster> || integer || 0 || 8 || pointer to the <1st Cluster> <br />
|-<br />
| <1st Cluster> || integer || 8 || 8 || pointer to the <2nd Cluster> <br />
|-<br />
| <nth Cluster> || integer ||(n-1)*8|| 8 || pointer to the <nth Cluster> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
== Clusters ==<br />
The clusters contain the actual data of the directory entries. Clusters can be compressed or uncompressed. The purpose of the clusters are that data of more than one directory entry can be compressed inside one cluster, making the compression much more efficient. Typically clusters have a size of about 1 MB.<br />
<br />
The first byte of the cluster identifies some information about the cluster.<br />
<br />
The first fourth low bits identifies if the cluster compression type:<br />
* No compression is indicated by a value of 1<br />
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).<br />
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.<br />
* 0 is an obselete code for no compression (inhereted from the Zeno)<br />
<br />
The fifth bit identifies the cluster is extended or not :<br />
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.<br />
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.<br />
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.<br />
<br />
The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].<br />
<br />
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed<br />
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8) <br />
|-<br />
| colspan="5" | The following data bytes have to be uncompressed!<br />
|-<br />
| <1st Blob> || integer || 1 || OFFSET_SIZE || offset to the <1st Blob> <br />
|-<br />
| <2nd Blob> || integer || 1+OFFSET_SIZE || OFFSET_SIZE || offset to the <2nd Blob> <br />
|-<br />
| <nth Blob> || integer ||(n-1)*OFFSET_SIZE+1|| OFFSET_SIZE || offset to the <nth Blob> <br />
|-<br />
| ... || integer || ... || OFFSET_SIZE || ... <br />
|-<br />
| <last blob / end> || integer || n/a || OFFSET_SIZE || offset to the end of the cluster <br />
|-<br />
| <1st Blob> || data || n/a || n/a || data of the <1st Blob> <br />
|-<br />
| <2nd Blob> || data || n/a || n/a || data of the <2nd Blob> <br />
|-<br />
| ... || data || ... || n/a || ... <br />
|}<br />
<br />
The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by OFFSET_SIZE. The size of one blob is calculated by the difference of two consecutive offsets.<br />
<br />
== Namespaces ==<br />
Namespaces separate different types of directory entries - which might have the same title or url - stored in the ZIM archive Format.<br />
<br />
The new namespace usage put a strong semantics on the namespaces. The libzim uses this semantics and provide different kind of API to access the different kind of entries.<br />
<br />
The libzim hide the namespace to the user, so a entry `foo.html` in namespace `C` will be accessible as `foo.html`. libzim provides specific API to access metadata.<br />
<br />
Other implementation are free to explicit the namespace or not (but need to provide some fallback for all namespace usage)<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Namespace !! Description <br />
|-<br />
| C || User content entries - see [[Article Format]] <br />
|-<br />
| M || ZIM metadata - see [[Metadata]] <br />
|-<br />
| W || Well know entries (MainPage, Favicon) - see [[Well known entries]] <br />
|-<br />
| X || search indexes - see [[Search indexes]]<br />
|}<br />
<br />
== URLs ==<br />
<br />
=== URL Encoding ===<br />
The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt)<br />
<br />
Some readers process the requests that already do the url decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to libzim.<br />
<br />
=== Local Anchors ===<br />
Many articles - especially when a table of contents is used - use local anchors to jump within an article. <br />
<br />
<pre><br />
<a href="foo#headline1">jump to article foo, headline 1</a><br />
</pre><br />
<br />
The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example "foo". After the article has been loaded the browser will then search for the local anchor tag and jump to the right location.<br />
<br />
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.<br />
<br />
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim.<br />
<br />
== Encodings ==<br />
=== Character Encoding ===<br />
The standard encoding for ZIM archive content is UTF-8. So both article data and URLs should be handled accordingly.<br />
<br />
=== Integer Encoding ===<br />
All types are little-endian.<br />
<br />
All integers are unsigned integers (uint_16, uint_32, uint_64).<br />
<br />
All lengths are bytes.<br />
<br />
== Split ZIM files ==<br />
ZIM archives can be split in multiple files. This is necessary to be able to store big (over 4GB for example) ZIM archives to limited file systems (like FAT32). That said, the files can be of any size, but the naming is really important. The ZIM files should be named like following (the file name extensions matter): ''foobar.zimaa, foobar.zimab, foobar.zimac''...<br />
<br />
== See also ==<br />
* [[Zeno file format]] (deprecated)<br />
* [[ZIM File Format/4]] (deprecated)<br />
* [[ZIM File Example]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=ZIM_file_format&diff=31345ZIM file format2024-02-03T14:48:23Z<p>Kelson: /* Major & Minor versions */</p>
<hr />
<div>Beginning 2021, we change the way we handle namespaces in ZIM file format.<br />
<br />
This is major change in the way we handle entries in libzim and the semantics around but it is not a break in the binary zim format. Old library/readers are compatible with new zim files.<br />
<br />
This page describe the new format. The old format can be found here : [[ZIM file format old namespace]].[[Image:Schema File Format.png|500px|right]]<br />
== Header ==<br />
A ZIM archive starts with a header :<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !! Offset !! Length !! Description <br />
|-<br />
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)<br />
|-<br />
| [[#Major_.26_Minor_versions|majorVersion]]<br />
| integer<br />
| 4<br />
| 2<br />
| Major version of the ZIM archive format. Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)<br />
|-<br />
| [[#Major_.26_Minor_versions|minorVersion]] || integer || 6 || 2 || Minor version of the ZIM archive format. Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1) <br />
|-<br />
| uuid || integer || 8 || 16 || unique id of this zim archive <br />
|-<br />
| entryCount || integer || 24 || 4 || total number of entries <br />
|-<br />
| clusterCount || integer || 28 || 4 || total number of clusters <br />
|-<br />
| urlPtrPos || integer || 32 || 8 || position of the directory pointerlist ordered by URL <br />
|-<br />
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title<br />
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.<br />
|-<br />
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list <br />
|-<br />
| mimeListPos || integer || 56 || 8 || position of the MIME type list (also header size) <br />
|-<br />
| mainPage || integer || 64 || 4 || main page or 0xffffffff if no main page <br />
|-<br />
| layoutPage || integer || 68 || 4 || [[Layout Page|layout page]] or 0xffffffffff if no layout page (deprecated, always 0xffffffffff) <br />
|-<br />
| checksumPos || integer || 72 || 8 || pointer to the md5checksum of this archive without the checksum itself. This points always 16 bytes before the end of the archive.<br />
|}<br />
<br />
A ZIM archive may be embedded in another file at a specific offset. In the context of the ZIM format, the start of the ZIM header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.<br />
<br />
=== Major & Minor versions ===<br />
<br />
Versioning of the file format specification has not been done rigorously until version 6.<br />
<br />
{| class="wikitable"<br />
|+ ZIM format versions<br />
|-<br />
! Major !! Minor !! Backward compatible !! Description <br />
|-<br />
| 0 || 0 || no || ''This version features have not been tracked properly''<br />
|-<br />
| 1 || 0 || no || ''This version features have not been tracked properly'' <br />
|-<br />
| 2 || 0 || no || ''This version features have not been tracked properly'' <br />
|-<br />
| 3 || 0 || no || ''This version features have not been tracked properly'' <br />
|-<br />
| 4 || 0 || yes || Introduces title index<br />
|-<br />
| 5 || 0 || yes || Introduces extended clusters <br />
|-<br />
| rowspan="3" | 6 || 0 || no || Still uses [[ZIM file format old namespace|"old" namespaces]]<br />
|-<br />
| 1 || no || Introduces [[#Namespaces|"new" namespaces]]<br />
|-<br />
| 2 || yes || Explicitly allows alias entries (several entries pointing to the same cluster/blob)<br />
|}<br />
<br />
== MIME Type List (mimeListPos) ==<br />
The MIME type list always follows directly after the header, so the ''mimeListPos'' also defines the end and size of the ZIM file header.<br />
<br />
The MIME types in this list are zero terminated strings. An empty string marks the end of the MIME type list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st MIME Type> || string || 0 ||zero terminated|| declaration of the <1st MIME Type> <br />
|-<br />
| <2nd MIME Type> || string || n/a ||zero terminated|| declaration of the <2nd MIME Type> <br />
|-<br />
| ... || string || ... ||zero terminated|| ... <br />
|-<br />
| <last entry / end> || string || n/a ||zero terminated|| empty string - end of MIME type list <br />
|}<br />
<br />
== URL Pointer List (urlPtrPos) ==<br />
The URL pointer list is a list of 8 byte offsets to the directory entries.<br />
<br />
The directory entries are always ordered by "full" URL (<code><namespace><path></code>). Ordering is simply done by comparing the URL strings.<br />
<br />
Since directory entries have variable sizes this is needed for random access.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st URL> || integer || 0 || 8 || pointer to the directory entry of <1st URL> <br />
|-<br />
| <2nd URL> || integer || 8 || 8 || pointer to the directory entry of <2nd URL> <br />
|-<br />
| <nth URL> || integer ||(n-1)*8|| 8 || pointer to the directory entry of <nth URL> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
Libzim caches directory entries and references the cached entries via the URL pointers.<br />
<br />
== Title Pointer List (titlePtrPos) ==<br />
The title pointer list is a list of entry indices ordered by title (<code><namespace><title></code>). The title pointer list actually points to entries in the URL pointer list.<br />
<br />
Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.<br />
<br />
To get the offset of an entry from the title pointer list, you have to look it up in the URL pointer list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Title> || integer || 0 || 4 || pointer to the URL pointer of <1st Title> <br />
|-<br />
| <2nd Title> || integer || 4 || 4 || pointer to the URL pointer of <2nd Title> <br />
|-<br />
| <nth Title> || integer ||(n-1)*4|| 4 || pointer to the URL pointer of <nth Title> <br />
|-<br />
| ... || integer || ... || 4 || ... <br />
|}<br />
<br />
The indirection from titles via URLs to directory entries has two reasons:<br />
* the pointer list is only half in size as 4 bytes are enough for each entry<br />
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim.<br />
<br />
== Directory Entries ==<br />
Directory entries hold the meta information about all entries, images and other objects in a ZIM archive.<br />
<br />
There are different types of directory entries:<br />
<br />
=== Content Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || MIME type number as defined in the MIME type list <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters (must be 0) <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| cluster number || integer || 8 || 4 || cluster number in which the data of this directory entry is stored <br />
|-<br />
| blob number || integer || 12 || 4 || blob number inside the compressed cluster where the contents are stored <br />
|-<br />
| url || string || 16 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
=== Redirect Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || 0xffff for redirect <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| redirect index || integer || 8 || 4 || pointer to the directory entry of the redirect target <br />
|-<br />
| url || string || 12 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
None of the strings should have control characters from U+0000 through U+001F.<br />
<br />
=== Linktarget or deleted Entry (DEPRECATED) ===<br />
There is two kinds of deprecated entry that could be found in pretty old zim files (I, main develloper of libzim, never saw it).<br />
<br />
They have mimetype equal to 0xfffe or 0xfffd. Reader implementation may check for those value and ignore the whole dirent.<br />
<br />
== Cluster Pointer List (clusterPtrPos) ==<br />
The cluster pointer list is a list of 8 byte offsets which point to all data clusters in a ZIM archive.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Cluster> || integer || 0 || 8 || pointer to the <1st Cluster> <br />
|-<br />
| <1st Cluster> || integer || 8 || 8 || pointer to the <2nd Cluster> <br />
|-<br />
| <nth Cluster> || integer ||(n-1)*8|| 8 || pointer to the <nth Cluster> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
== Clusters ==<br />
The clusters contain the actual data of the directory entries. Clusters can be compressed or uncompressed. The purpose of the clusters are that data of more than one directory entry can be compressed inside one cluster, making the compression much more efficient. Typically clusters have a size of about 1 MB.<br />
<br />
The first byte of the cluster identifies some information about the cluster.<br />
<br />
The first fourth low bits identifies if the cluster compression type:<br />
* No compression is indicated by a value of 1<br />
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).<br />
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.<br />
* 0 is an obselete code for no compression (inhereted from the Zeno)<br />
<br />
The fifth bit identifies the cluster is extended or not :<br />
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.<br />
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.<br />
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.<br />
<br />
The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].<br />
<br />
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed<br />
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8) <br />
|-<br />
| colspan="5" | The following data bytes have to be uncompressed!<br />
|-<br />
| <1st Blob> || integer || 1 || OFFSET_SIZE || offset to the <1st Blob> <br />
|-<br />
| <2nd Blob> || integer || 1+OFFSET_SIZE || OFFSET_SIZE || offset to the <2nd Blob> <br />
|-<br />
| <nth Blob> || integer ||(n-1)*OFFSET_SIZE+1|| OFFSET_SIZE || offset to the <nth Blob> <br />
|-<br />
| ... || integer || ... || OFFSET_SIZE || ... <br />
|-<br />
| <last blob / end> || integer || n/a || OFFSET_SIZE || offset to the end of the cluster <br />
|-<br />
| <1st Blob> || data || n/a || n/a || data of the <1st Blob> <br />
|-<br />
| <2nd Blob> || data || n/a || n/a || data of the <2nd Blob> <br />
|-<br />
| ... || data || ... || n/a || ... <br />
|}<br />
<br />
The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by OFFSET_SIZE. The size of one blob is calculated by the difference of two consecutive offsets.<br />
<br />
== Namespaces ==<br />
Namespaces separate different types of directory entries - which might have the same title or url - stored in the ZIM archive Format.<br />
<br />
The new namespace usage put a strong semantics on the namespaces. The libzim uses this semantics and provide different kind of API to access the different kind of entries.<br />
<br />
The libzim hide the namespace to the user, so a entry `foo.html` in namespace `C` will be accessible as `foo.html`. libzim provides specific API to access metadata.<br />
<br />
Other implementation are free to explicit the namespace or not (but need to provide some fallback for all namespace usage)<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Namespace !! Description <br />
|-<br />
| C || User content entries - see [[Article Format]] <br />
|-<br />
| M || ZIM metadata - see [[Metadata]] <br />
|-<br />
| W || Well know entries (MainPage, Favicon) - see [[Well known entries]] <br />
|-<br />
| X || search indexes - see [[Search indexes]]<br />
|}<br />
<br />
== URLs ==<br />
<br />
=== URL Encoding ===<br />
The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt)<br />
<br />
Some readers process the requests that already do the url decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to libzim.<br />
<br />
=== Local Anchors ===<br />
Many articles - especially when a table of contents is used - use local anchors to jump within an article. <br />
<br />
<pre><br />
<a href="foo#headline1">jump to article foo, headline 1</a><br />
</pre><br />
<br />
The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example "foo". After the article has been loaded the browser will then search for the local anchor tag and jump to the right location.<br />
<br />
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.<br />
<br />
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim.<br />
<br />
== Encodings ==<br />
=== Character Encoding ===<br />
The standard encoding for ZIM archive content is UTF-8. So both article data and URLs should be handled accordingly.<br />
<br />
=== Integer Encoding ===<br />
All types are little-endian.<br />
<br />
All integers are unsigned integers (uint_16, uint_32, uint_64).<br />
<br />
All lengths are bytes.<br />
<br />
== Split ZIM files ==<br />
ZIM archives can be split in multiple files. This is necessary to be able to store big (over 4GB for example) ZIM archives to limited file systems (like FAT32). That said, the files can be of any size, but the naming is really important. The ZIM files should be named like following (the file name extensions matter): ''foobar.zimaa, foobar.zimab, foobar.zimac''...<br />
<br />
== See also ==<br />
* [[Zeno file format]] (deprecated)<br />
* [[ZIM File Format/4]] (deprecated)<br />
* [[ZIM File Example]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=ZIM_file_format&diff=31344ZIM file format2024-02-03T14:46:51Z<p>Kelson: /* Major & Minor versions */</p>
<hr />
<div>Beginning 2021, we change the way we handle namespaces in ZIM file format.<br />
<br />
This is major change in the way we handle entries in libzim and the semantics around but it is not a break in the binary zim format. Old library/readers are compatible with new zim files.<br />
<br />
This page describe the new format. The old format can be found here : [[ZIM file format old namespace]].[[Image:Schema File Format.png|500px|right]]<br />
== Header ==<br />
A ZIM archive starts with a header :<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !! Offset !! Length !! Description <br />
|-<br />
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)<br />
|-<br />
| [[#Major_.26_Minor_versions|majorVersion]]<br />
| integer<br />
| 4<br />
| 2<br />
| Major version of the ZIM archive format. Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)<br />
|-<br />
| [[#Major_.26_Minor_versions|minorVersion]] || integer || 6 || 2 || Minor version of the ZIM archive format. Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1) <br />
|-<br />
| uuid || integer || 8 || 16 || unique id of this zim archive <br />
|-<br />
| entryCount || integer || 24 || 4 || total number of entries <br />
|-<br />
| clusterCount || integer || 28 || 4 || total number of clusters <br />
|-<br />
| urlPtrPos || integer || 32 || 8 || position of the directory pointerlist ordered by URL <br />
|-<br />
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title<br />
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.<br />
|-<br />
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list <br />
|-<br />
| mimeListPos || integer || 56 || 8 || position of the MIME type list (also header size) <br />
|-<br />
| mainPage || integer || 64 || 4 || main page or 0xffffffff if no main page <br />
|-<br />
| layoutPage || integer || 68 || 4 || [[Layout Page|layout page]] or 0xffffffffff if no layout page (deprecated, always 0xffffffffff) <br />
|-<br />
| checksumPos || integer || 72 || 8 || pointer to the md5checksum of this archive without the checksum itself. This points always 16 bytes before the end of the archive.<br />
|}<br />
<br />
A ZIM archive may be embedded in another file at a specific offset. In the context of the ZIM format, the start of the ZIM header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.<br />
<br />
=== Major & Minor versions ===<br />
<br />
{| class="wikitable"<br />
|+ ZIM format versions<br />
|-<br />
! Major !! Minor !! Backward compatible !! Description <br />
|-<br />
| 0 || 0 || no || ''This version features have not been tracked properly''<br />
|-<br />
| 1 || 0 || no || ''This version features have not been tracked properly'' <br />
|-<br />
| 2 || 0 || no || ''This version features have not been tracked properly'' <br />
|-<br />
| 3 || 0 || no || ''This version features have not been tracked properly'' <br />
|-<br />
| 4 || 0 || yes || Introduces title index<br />
|-<br />
| 5 || 0 || yes || Introduces extended clusters <br />
|-<br />
| rowspan="3" | 6 || 0 || no || Still uses [[ZIM file format old namespace|"old" namespaces]]<br />
|-<br />
| 1 || no || Introduces [[#Namespaces|"new" namespaces]]<br />
|-<br />
| 2 || yes || Explicitly allows alias entries (several entries pointing to the same cluster/blob)<br />
|}<br />
<br />
== MIME Type List (mimeListPos) ==<br />
The MIME type list always follows directly after the header, so the ''mimeListPos'' also defines the end and size of the ZIM file header.<br />
<br />
The MIME types in this list are zero terminated strings. An empty string marks the end of the MIME type list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st MIME Type> || string || 0 ||zero terminated|| declaration of the <1st MIME Type> <br />
|-<br />
| <2nd MIME Type> || string || n/a ||zero terminated|| declaration of the <2nd MIME Type> <br />
|-<br />
| ... || string || ... ||zero terminated|| ... <br />
|-<br />
| <last entry / end> || string || n/a ||zero terminated|| empty string - end of MIME type list <br />
|}<br />
<br />
== URL Pointer List (urlPtrPos) ==<br />
The URL pointer list is a list of 8 byte offsets to the directory entries.<br />
<br />
The directory entries are always ordered by "full" URL (<code><namespace><path></code>). Ordering is simply done by comparing the URL strings.<br />
<br />
Since directory entries have variable sizes this is needed for random access.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st URL> || integer || 0 || 8 || pointer to the directory entry of <1st URL> <br />
|-<br />
| <2nd URL> || integer || 8 || 8 || pointer to the directory entry of <2nd URL> <br />
|-<br />
| <nth URL> || integer ||(n-1)*8|| 8 || pointer to the directory entry of <nth URL> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
Libzim caches directory entries and references the cached entries via the URL pointers.<br />
<br />
== Title Pointer List (titlePtrPos) ==<br />
The title pointer list is a list of entry indices ordered by title (<code><namespace><title></code>). The title pointer list actually points to entries in the URL pointer list.<br />
<br />
Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.<br />
<br />
To get the offset of an entry from the title pointer list, you have to look it up in the URL pointer list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Title> || integer || 0 || 4 || pointer to the URL pointer of <1st Title> <br />
|-<br />
| <2nd Title> || integer || 4 || 4 || pointer to the URL pointer of <2nd Title> <br />
|-<br />
| <nth Title> || integer ||(n-1)*4|| 4 || pointer to the URL pointer of <nth Title> <br />
|-<br />
| ... || integer || ... || 4 || ... <br />
|}<br />
<br />
The indirection from titles via URLs to directory entries has two reasons:<br />
* the pointer list is only half in size as 4 bytes are enough for each entry<br />
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim.<br />
<br />
== Directory Entries ==<br />
Directory entries hold the meta information about all entries, images and other objects in a ZIM archive.<br />
<br />
There are different types of directory entries:<br />
<br />
=== Content Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || MIME type number as defined in the MIME type list <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters (must be 0) <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| cluster number || integer || 8 || 4 || cluster number in which the data of this directory entry is stored <br />
|-<br />
| blob number || integer || 12 || 4 || blob number inside the compressed cluster where the contents are stored <br />
|-<br />
| url || string || 16 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
=== Redirect Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || 0xffff for redirect <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| redirect index || integer || 8 || 4 || pointer to the directory entry of the redirect target <br />
|-<br />
| url || string || 12 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
None of the strings should have control characters from U+0000 through U+001F.<br />
<br />
=== Linktarget or deleted Entry (DEPRECATED) ===<br />
There is two kinds of deprecated entry that could be found in pretty old zim files (I, main develloper of libzim, never saw it).<br />
<br />
They have mimetype equal to 0xfffe or 0xfffd. Reader implementation may check for those value and ignore the whole dirent.<br />
<br />
== Cluster Pointer List (clusterPtrPos) ==<br />
The cluster pointer list is a list of 8 byte offsets which point to all data clusters in a ZIM archive.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Cluster> || integer || 0 || 8 || pointer to the <1st Cluster> <br />
|-<br />
| <1st Cluster> || integer || 8 || 8 || pointer to the <2nd Cluster> <br />
|-<br />
| <nth Cluster> || integer ||(n-1)*8|| 8 || pointer to the <nth Cluster> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
== Clusters ==<br />
The clusters contain the actual data of the directory entries. Clusters can be compressed or uncompressed. The purpose of the clusters are that data of more than one directory entry can be compressed inside one cluster, making the compression much more efficient. Typically clusters have a size of about 1 MB.<br />
<br />
The first byte of the cluster identifies some information about the cluster.<br />
<br />
The first fourth low bits identifies if the cluster compression type:<br />
* No compression is indicated by a value of 1<br />
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).<br />
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.<br />
* 0 is an obselete code for no compression (inhereted from the Zeno)<br />
<br />
The fifth bit identifies the cluster is extended or not :<br />
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.<br />
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.<br />
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.<br />
<br />
The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].<br />
<br />
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed<br />
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8) <br />
|-<br />
| colspan="5" | The following data bytes have to be uncompressed!<br />
|-<br />
| <1st Blob> || integer || 1 || OFFSET_SIZE || offset to the <1st Blob> <br />
|-<br />
| <2nd Blob> || integer || 1+OFFSET_SIZE || OFFSET_SIZE || offset to the <2nd Blob> <br />
|-<br />
| <nth Blob> || integer ||(n-1)*OFFSET_SIZE+1|| OFFSET_SIZE || offset to the <nth Blob> <br />
|-<br />
| ... || integer || ... || OFFSET_SIZE || ... <br />
|-<br />
| <last blob / end> || integer || n/a || OFFSET_SIZE || offset to the end of the cluster <br />
|-<br />
| <1st Blob> || data || n/a || n/a || data of the <1st Blob> <br />
|-<br />
| <2nd Blob> || data || n/a || n/a || data of the <2nd Blob> <br />
|-<br />
| ... || data || ... || n/a || ... <br />
|}<br />
<br />
The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by OFFSET_SIZE. The size of one blob is calculated by the difference of two consecutive offsets.<br />
<br />
== Namespaces ==<br />
Namespaces separate different types of directory entries - which might have the same title or url - stored in the ZIM archive Format.<br />
<br />
The new namespace usage put a strong semantics on the namespaces. The libzim uses this semantics and provide different kind of API to access the different kind of entries.<br />
<br />
The libzim hide the namespace to the user, so a entry `foo.html` in namespace `C` will be accessible as `foo.html`. libzim provides specific API to access metadata.<br />
<br />
Other implementation are free to explicit the namespace or not (but need to provide some fallback for all namespace usage)<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Namespace !! Description <br />
|-<br />
| C || User content entries - see [[Article Format]] <br />
|-<br />
| M || ZIM metadata - see [[Metadata]] <br />
|-<br />
| W || Well know entries (MainPage, Favicon) - see [[Well known entries]] <br />
|-<br />
| X || search indexes - see [[Search indexes]]<br />
|}<br />
<br />
== URLs ==<br />
<br />
=== URL Encoding ===<br />
The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt)<br />
<br />
Some readers process the requests that already do the url decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to libzim.<br />
<br />
=== Local Anchors ===<br />
Many articles - especially when a table of contents is used - use local anchors to jump within an article. <br />
<br />
<pre><br />
<a href="foo#headline1">jump to article foo, headline 1</a><br />
</pre><br />
<br />
The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example "foo". After the article has been loaded the browser will then search for the local anchor tag and jump to the right location.<br />
<br />
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.<br />
<br />
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim.<br />
<br />
== Encodings ==<br />
=== Character Encoding ===<br />
The standard encoding for ZIM archive content is UTF-8. So both article data and URLs should be handled accordingly.<br />
<br />
=== Integer Encoding ===<br />
All types are little-endian.<br />
<br />
All integers are unsigned integers (uint_16, uint_32, uint_64).<br />
<br />
All lengths are bytes.<br />
<br />
== Split ZIM files ==<br />
ZIM archives can be split in multiple files. This is necessary to be able to store big (over 4GB for example) ZIM archives to limited file systems (like FAT32). That said, the files can be of any size, but the naming is really important. The ZIM files should be named like following (the file name extensions matter): ''foobar.zimaa, foobar.zimab, foobar.zimac''...<br />
<br />
== See also ==<br />
* [[Zeno file format]] (deprecated)<br />
* [[ZIM File Format/4]] (deprecated)<br />
* [[ZIM File Example]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=ZIM_file_format&diff=31343ZIM file format2024-02-03T14:46:14Z<p>Kelson: /* Major & Minor versions */</p>
<hr />
<div>Beginning 2021, we change the way we handle namespaces in ZIM file format.<br />
<br />
This is major change in the way we handle entries in libzim and the semantics around but it is not a break in the binary zim format. Old library/readers are compatible with new zim files.<br />
<br />
This page describe the new format. The old format can be found here : [[ZIM file format old namespace]].[[Image:Schema File Format.png|500px|right]]<br />
== Header ==<br />
A ZIM archive starts with a header :<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !! Offset !! Length !! Description <br />
|-<br />
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)<br />
|-<br />
| [[#Major_.26_Minor_versions|majorVersion]]<br />
| integer<br />
| 4<br />
| 2<br />
| Major version of the ZIM archive format. Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)<br />
|-<br />
| [[#Major_.26_Minor_versions|minorVersion]] || integer || 6 || 2 || Minor version of the ZIM archive format. Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1) <br />
|-<br />
| uuid || integer || 8 || 16 || unique id of this zim archive <br />
|-<br />
| entryCount || integer || 24 || 4 || total number of entries <br />
|-<br />
| clusterCount || integer || 28 || 4 || total number of clusters <br />
|-<br />
| urlPtrPos || integer || 32 || 8 || position of the directory pointerlist ordered by URL <br />
|-<br />
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title<br />
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.<br />
|-<br />
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list <br />
|-<br />
| mimeListPos || integer || 56 || 8 || position of the MIME type list (also header size) <br />
|-<br />
| mainPage || integer || 64 || 4 || main page or 0xffffffff if no main page <br />
|-<br />
| layoutPage || integer || 68 || 4 || [[Layout Page|layout page]] or 0xffffffffff if no layout page (deprecated, always 0xffffffffff) <br />
|-<br />
| checksumPos || integer || 72 || 8 || pointer to the md5checksum of this archive without the checksum itself. This points always 16 bytes before the end of the archive.<br />
|}<br />
<br />
A ZIM archive may be embedded in another file at a specific offset. In the context of the ZIM format, the start of the ZIM header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.<br />
<br />
=== Major & Minor versions ===<br />
<br />
{| class="wikitable"<br />
|+ ZIM format versions<br />
|-<br />
! Major !! Minor !! Backward compatible !! Description <br />
|-<br />
| 0 || 0 || no || ``This version features have not been tracked properly`` <br />
|-<br />
| 1 || 0 || no || ``This version features have not been tracked properly`` <br />
|-<br />
| 2 || 0 || no || ``This version features have not been tracked properly`` <br />
|-<br />
| 3 || 0 || no || ``This version features have not been tracked properly`` <br />
|-<br />
| 4 || 0 || yes || Introduces title index<br />
|-<br />
| 5 || 0 || yes || Introduces extended clusters <br />
|-<br />
| rowspan="3" | 6 || 0 || no || Still uses [[ZIM file format old namespace|"old" namespaces]]<br />
|-<br />
| 1 || no || Introduces [[#Namespaces|"new" namespaces]]<br />
|-<br />
| 2 || yes || Explicitly allows alias entries (several entries pointing to the same cluster/blob)<br />
|}<br />
<br />
== MIME Type List (mimeListPos) ==<br />
The MIME type list always follows directly after the header, so the ''mimeListPos'' also defines the end and size of the ZIM file header.<br />
<br />
The MIME types in this list are zero terminated strings. An empty string marks the end of the MIME type list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st MIME Type> || string || 0 ||zero terminated|| declaration of the <1st MIME Type> <br />
|-<br />
| <2nd MIME Type> || string || n/a ||zero terminated|| declaration of the <2nd MIME Type> <br />
|-<br />
| ... || string || ... ||zero terminated|| ... <br />
|-<br />
| <last entry / end> || string || n/a ||zero terminated|| empty string - end of MIME type list <br />
|}<br />
<br />
== URL Pointer List (urlPtrPos) ==<br />
The URL pointer list is a list of 8 byte offsets to the directory entries.<br />
<br />
The directory entries are always ordered by "full" URL (<code><namespace><path></code>). Ordering is simply done by comparing the URL strings.<br />
<br />
Since directory entries have variable sizes this is needed for random access.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st URL> || integer || 0 || 8 || pointer to the directory entry of <1st URL> <br />
|-<br />
| <2nd URL> || integer || 8 || 8 || pointer to the directory entry of <2nd URL> <br />
|-<br />
| <nth URL> || integer ||(n-1)*8|| 8 || pointer to the directory entry of <nth URL> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
Libzim caches directory entries and references the cached entries via the URL pointers.<br />
<br />
== Title Pointer List (titlePtrPos) ==<br />
The title pointer list is a list of entry indices ordered by title (<code><namespace><title></code>). The title pointer list actually points to entries in the URL pointer list.<br />
<br />
Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.<br />
<br />
To get the offset of an entry from the title pointer list, you have to look it up in the URL pointer list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Title> || integer || 0 || 4 || pointer to the URL pointer of <1st Title> <br />
|-<br />
| <2nd Title> || integer || 4 || 4 || pointer to the URL pointer of <2nd Title> <br />
|-<br />
| <nth Title> || integer ||(n-1)*4|| 4 || pointer to the URL pointer of <nth Title> <br />
|-<br />
| ... || integer || ... || 4 || ... <br />
|}<br />
<br />
The indirection from titles via URLs to directory entries has two reasons:<br />
* the pointer list is only half in size as 4 bytes are enough for each entry<br />
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim.<br />
<br />
== Directory Entries ==<br />
Directory entries hold the meta information about all entries, images and other objects in a ZIM archive.<br />
<br />
There are different types of directory entries:<br />
<br />
=== Content Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || MIME type number as defined in the MIME type list <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters (must be 0) <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| cluster number || integer || 8 || 4 || cluster number in which the data of this directory entry is stored <br />
|-<br />
| blob number || integer || 12 || 4 || blob number inside the compressed cluster where the contents are stored <br />
|-<br />
| url || string || 16 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
=== Redirect Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || 0xffff for redirect <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| redirect index || integer || 8 || 4 || pointer to the directory entry of the redirect target <br />
|-<br />
| url || string || 12 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
None of the strings should have control characters from U+0000 through U+001F.<br />
<br />
=== Linktarget or deleted Entry (DEPRECATED) ===<br />
There is two kinds of deprecated entry that could be found in pretty old zim files (I, main develloper of libzim, never saw it).<br />
<br />
They have mimetype equal to 0xfffe or 0xfffd. Reader implementation may check for those value and ignore the whole dirent.<br />
<br />
== Cluster Pointer List (clusterPtrPos) ==<br />
The cluster pointer list is a list of 8 byte offsets which point to all data clusters in a ZIM archive.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Cluster> || integer || 0 || 8 || pointer to the <1st Cluster> <br />
|-<br />
| <1st Cluster> || integer || 8 || 8 || pointer to the <2nd Cluster> <br />
|-<br />
| <nth Cluster> || integer ||(n-1)*8|| 8 || pointer to the <nth Cluster> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
== Clusters ==<br />
The clusters contain the actual data of the directory entries. Clusters can be compressed or uncompressed. The purpose of the clusters are that data of more than one directory entry can be compressed inside one cluster, making the compression much more efficient. Typically clusters have a size of about 1 MB.<br />
<br />
The first byte of the cluster identifies some information about the cluster.<br />
<br />
The first fourth low bits identifies if the cluster compression type:<br />
* No compression is indicated by a value of 1<br />
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).<br />
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.<br />
* 0 is an obselete code for no compression (inhereted from the Zeno)<br />
<br />
The fifth bit identifies the cluster is extended or not :<br />
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.<br />
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.<br />
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.<br />
<br />
The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].<br />
<br />
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed<br />
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8) <br />
|-<br />
| colspan="5" | The following data bytes have to be uncompressed!<br />
|-<br />
| <1st Blob> || integer || 1 || OFFSET_SIZE || offset to the <1st Blob> <br />
|-<br />
| <2nd Blob> || integer || 1+OFFSET_SIZE || OFFSET_SIZE || offset to the <2nd Blob> <br />
|-<br />
| <nth Blob> || integer ||(n-1)*OFFSET_SIZE+1|| OFFSET_SIZE || offset to the <nth Blob> <br />
|-<br />
| ... || integer || ... || OFFSET_SIZE || ... <br />
|-<br />
| <last blob / end> || integer || n/a || OFFSET_SIZE || offset to the end of the cluster <br />
|-<br />
| <1st Blob> || data || n/a || n/a || data of the <1st Blob> <br />
|-<br />
| <2nd Blob> || data || n/a || n/a || data of the <2nd Blob> <br />
|-<br />
| ... || data || ... || n/a || ... <br />
|}<br />
<br />
The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by OFFSET_SIZE. The size of one blob is calculated by the difference of two consecutive offsets.<br />
<br />
== Namespaces ==<br />
Namespaces separate different types of directory entries - which might have the same title or url - stored in the ZIM archive Format.<br />
<br />
The new namespace usage put a strong semantics on the namespaces. The libzim uses this semantics and provide different kind of API to access the different kind of entries.<br />
<br />
The libzim hide the namespace to the user, so a entry `foo.html` in namespace `C` will be accessible as `foo.html`. libzim provides specific API to access metadata.<br />
<br />
Other implementation are free to explicit the namespace or not (but need to provide some fallback for all namespace usage)<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Namespace !! Description <br />
|-<br />
| C || User content entries - see [[Article Format]] <br />
|-<br />
| M || ZIM metadata - see [[Metadata]] <br />
|-<br />
| W || Well know entries (MainPage, Favicon) - see [[Well known entries]] <br />
|-<br />
| X || search indexes - see [[Search indexes]]<br />
|}<br />
<br />
== URLs ==<br />
<br />
=== URL Encoding ===<br />
The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt)<br />
<br />
Some readers process the requests that already do the url decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to libzim.<br />
<br />
=== Local Anchors ===<br />
Many articles - especially when a table of contents is used - use local anchors to jump within an article. <br />
<br />
<pre><br />
<a href="foo#headline1">jump to article foo, headline 1</a><br />
</pre><br />
<br />
The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example "foo". After the article has been loaded the browser will then search for the local anchor tag and jump to the right location.<br />
<br />
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.<br />
<br />
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim.<br />
<br />
== Encodings ==<br />
=== Character Encoding ===<br />
The standard encoding for ZIM archive content is UTF-8. So both article data and URLs should be handled accordingly.<br />
<br />
=== Integer Encoding ===<br />
All types are little-endian.<br />
<br />
All integers are unsigned integers (uint_16, uint_32, uint_64).<br />
<br />
All lengths are bytes.<br />
<br />
== Split ZIM files ==<br />
ZIM archives can be split in multiple files. This is necessary to be able to store big (over 4GB for example) ZIM archives to limited file systems (like FAT32). That said, the files can be of any size, but the naming is really important. The ZIM files should be named like following (the file name extensions matter): ''foobar.zimaa, foobar.zimab, foobar.zimac''...<br />
<br />
== See also ==<br />
* [[Zeno file format]] (deprecated)<br />
* [[ZIM File Format/4]] (deprecated)<br />
* [[ZIM File Example]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=ZIM_file_format&diff=31342ZIM file format2024-02-03T14:41:53Z<p>Kelson: /* Major & Minor versions */</p>
<hr />
<div>Beginning 2021, we change the way we handle namespaces in ZIM file format.<br />
<br />
This is major change in the way we handle entries in libzim and the semantics around but it is not a break in the binary zim format. Old library/readers are compatible with new zim files.<br />
<br />
This page describe the new format. The old format can be found here : [[ZIM file format old namespace]].[[Image:Schema File Format.png|500px|right]]<br />
== Header ==<br />
A ZIM archive starts with a header :<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !! Offset !! Length !! Description <br />
|-<br />
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)<br />
|-<br />
| [[#Major_.26_Minor_versions|majorVersion]]<br />
| integer<br />
| 4<br />
| 2<br />
| Major version of the ZIM archive format. Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)<br />
|-<br />
| [[#Major_.26_Minor_versions|minorVersion]] || integer || 6 || 2 || Minor version of the ZIM archive format. Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1) <br />
|-<br />
| uuid || integer || 8 || 16 || unique id of this zim archive <br />
|-<br />
| entryCount || integer || 24 || 4 || total number of entries <br />
|-<br />
| clusterCount || integer || 28 || 4 || total number of clusters <br />
|-<br />
| urlPtrPos || integer || 32 || 8 || position of the directory pointerlist ordered by URL <br />
|-<br />
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title<br />
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.<br />
|-<br />
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list <br />
|-<br />
| mimeListPos || integer || 56 || 8 || position of the MIME type list (also header size) <br />
|-<br />
| mainPage || integer || 64 || 4 || main page or 0xffffffff if no main page <br />
|-<br />
| layoutPage || integer || 68 || 4 || [[Layout Page|layout page]] or 0xffffffffff if no layout page (deprecated, always 0xffffffffff) <br />
|-<br />
| checksumPos || integer || 72 || 8 || pointer to the md5checksum of this archive without the checksum itself. This points always 16 bytes before the end of the archive.<br />
|}<br />
<br />
A ZIM archive may be embedded in another file at a specific offset. In the context of the ZIM format, the start of the ZIM header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.<br />
<br />
=== Major & Minor versions ===<br />
<br />
{| class="wikitable"<br />
|+ ZIM format versions<br />
|-<br />
! Major !! Minor !! Backward compatible !! Description <br />
|-<br />
| 0 || 0 || || <br />
|-<br />
| 1 || 0 || || <br />
|-<br />
| 2 || 0 || || <br />
|-<br />
| 3 || 0 || || <br />
|-<br />
| 4 || 0 || || <br />
|-<br />
| 5 || 0 || yes || Introduces extended clusters <br />
|-<br />
| rowspan="3" | 6 || 0 || no || Still uses [[ZIM file format old namespace|"old" namespaces]]<br />
|-<br />
| 1 || no || Introduces [[#Namespaces|"new" namespaces]]<br />
|-<br />
| 2 || yes || Explicitly allows alias entries (several entries pointing to the same cluster/blob)<br />
|}<br />
<br />
== MIME Type List (mimeListPos) ==<br />
The MIME type list always follows directly after the header, so the ''mimeListPos'' also defines the end and size of the ZIM file header.<br />
<br />
The MIME types in this list are zero terminated strings. An empty string marks the end of the MIME type list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st MIME Type> || string || 0 ||zero terminated|| declaration of the <1st MIME Type> <br />
|-<br />
| <2nd MIME Type> || string || n/a ||zero terminated|| declaration of the <2nd MIME Type> <br />
|-<br />
| ... || string || ... ||zero terminated|| ... <br />
|-<br />
| <last entry / end> || string || n/a ||zero terminated|| empty string - end of MIME type list <br />
|}<br />
<br />
== URL Pointer List (urlPtrPos) ==<br />
The URL pointer list is a list of 8 byte offsets to the directory entries.<br />
<br />
The directory entries are always ordered by "full" URL (<code><namespace><path></code>). Ordering is simply done by comparing the URL strings.<br />
<br />
Since directory entries have variable sizes this is needed for random access.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st URL> || integer || 0 || 8 || pointer to the directory entry of <1st URL> <br />
|-<br />
| <2nd URL> || integer || 8 || 8 || pointer to the directory entry of <2nd URL> <br />
|-<br />
| <nth URL> || integer ||(n-1)*8|| 8 || pointer to the directory entry of <nth URL> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
Libzim caches directory entries and references the cached entries via the URL pointers.<br />
<br />
== Title Pointer List (titlePtrPos) ==<br />
The title pointer list is a list of entry indices ordered by title (<code><namespace><title></code>). The title pointer list actually points to entries in the URL pointer list.<br />
<br />
Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.<br />
<br />
To get the offset of an entry from the title pointer list, you have to look it up in the URL pointer list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Title> || integer || 0 || 4 || pointer to the URL pointer of <1st Title> <br />
|-<br />
| <2nd Title> || integer || 4 || 4 || pointer to the URL pointer of <2nd Title> <br />
|-<br />
| <nth Title> || integer ||(n-1)*4|| 4 || pointer to the URL pointer of <nth Title> <br />
|-<br />
| ... || integer || ... || 4 || ... <br />
|}<br />
<br />
The indirection from titles via URLs to directory entries has two reasons:<br />
* the pointer list is only half in size as 4 bytes are enough for each entry<br />
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim.<br />
<br />
== Directory Entries ==<br />
Directory entries hold the meta information about all entries, images and other objects in a ZIM archive.<br />
<br />
There are different types of directory entries:<br />
<br />
=== Content Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || MIME type number as defined in the MIME type list <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters (must be 0) <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| cluster number || integer || 8 || 4 || cluster number in which the data of this directory entry is stored <br />
|-<br />
| blob number || integer || 12 || 4 || blob number inside the compressed cluster where the contents are stored <br />
|-<br />
| url || string || 16 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
=== Redirect Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || 0xffff for redirect <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| redirect index || integer || 8 || 4 || pointer to the directory entry of the redirect target <br />
|-<br />
| url || string || 12 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
None of the strings should have control characters from U+0000 through U+001F.<br />
<br />
=== Linktarget or deleted Entry (DEPRECATED) ===<br />
There is two kinds of deprecated entry that could be found in pretty old zim files (I, main develloper of libzim, never saw it).<br />
<br />
They have mimetype equal to 0xfffe or 0xfffd. Reader implementation may check for those value and ignore the whole dirent.<br />
<br />
== Cluster Pointer List (clusterPtrPos) ==<br />
The cluster pointer list is a list of 8 byte offsets which point to all data clusters in a ZIM archive.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Cluster> || integer || 0 || 8 || pointer to the <1st Cluster> <br />
|-<br />
| <1st Cluster> || integer || 8 || 8 || pointer to the <2nd Cluster> <br />
|-<br />
| <nth Cluster> || integer ||(n-1)*8|| 8 || pointer to the <nth Cluster> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
== Clusters ==<br />
The clusters contain the actual data of the directory entries. Clusters can be compressed or uncompressed. The purpose of the clusters are that data of more than one directory entry can be compressed inside one cluster, making the compression much more efficient. Typically clusters have a size of about 1 MB.<br />
<br />
The first byte of the cluster identifies some information about the cluster.<br />
<br />
The first fourth low bits identifies if the cluster compression type:<br />
* No compression is indicated by a value of 1<br />
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).<br />
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.<br />
* 0 is an obselete code for no compression (inhereted from the Zeno)<br />
<br />
The fifth bit identifies the cluster is extended or not :<br />
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.<br />
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.<br />
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.<br />
<br />
The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].<br />
<br />
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed<br />
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8) <br />
|-<br />
| colspan="5" | The following data bytes have to be uncompressed!<br />
|-<br />
| <1st Blob> || integer || 1 || OFFSET_SIZE || offset to the <1st Blob> <br />
|-<br />
| <2nd Blob> || integer || 1+OFFSET_SIZE || OFFSET_SIZE || offset to the <2nd Blob> <br />
|-<br />
| <nth Blob> || integer ||(n-1)*OFFSET_SIZE+1|| OFFSET_SIZE || offset to the <nth Blob> <br />
|-<br />
| ... || integer || ... || OFFSET_SIZE || ... <br />
|-<br />
| <last blob / end> || integer || n/a || OFFSET_SIZE || offset to the end of the cluster <br />
|-<br />
| <1st Blob> || data || n/a || n/a || data of the <1st Blob> <br />
|-<br />
| <2nd Blob> || data || n/a || n/a || data of the <2nd Blob> <br />
|-<br />
| ... || data || ... || n/a || ... <br />
|}<br />
<br />
The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by OFFSET_SIZE. The size of one blob is calculated by the difference of two consecutive offsets.<br />
<br />
== Namespaces ==<br />
Namespaces separate different types of directory entries - which might have the same title or url - stored in the ZIM archive Format.<br />
<br />
The new namespace usage put a strong semantics on the namespaces. The libzim uses this semantics and provide different kind of API to access the different kind of entries.<br />
<br />
The libzim hide the namespace to the user, so a entry `foo.html` in namespace `C` will be accessible as `foo.html`. libzim provides specific API to access metadata.<br />
<br />
Other implementation are free to explicit the namespace or not (but need to provide some fallback for all namespace usage)<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Namespace !! Description <br />
|-<br />
| C || User content entries - see [[Article Format]] <br />
|-<br />
| M || ZIM metadata - see [[Metadata]] <br />
|-<br />
| W || Well know entries (MainPage, Favicon) - see [[Well known entries]] <br />
|-<br />
| X || search indexes - see [[Search indexes]]<br />
|}<br />
<br />
== URLs ==<br />
<br />
=== URL Encoding ===<br />
The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt)<br />
<br />
Some readers process the requests that already do the url decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to libzim.<br />
<br />
=== Local Anchors ===<br />
Many articles - especially when a table of contents is used - use local anchors to jump within an article. <br />
<br />
<pre><br />
<a href="foo#headline1">jump to article foo, headline 1</a><br />
</pre><br />
<br />
The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example "foo". After the article has been loaded the browser will then search for the local anchor tag and jump to the right location.<br />
<br />
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.<br />
<br />
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim.<br />
<br />
== Encodings ==<br />
=== Character Encoding ===<br />
The standard encoding for ZIM archive content is UTF-8. So both article data and URLs should be handled accordingly.<br />
<br />
=== Integer Encoding ===<br />
All types are little-endian.<br />
<br />
All integers are unsigned integers (uint_16, uint_32, uint_64).<br />
<br />
All lengths are bytes.<br />
<br />
== Split ZIM files ==<br />
ZIM archives can be split in multiple files. This is necessary to be able to store big (over 4GB for example) ZIM archives to limited file systems (like FAT32). That said, the files can be of any size, but the naming is really important. The ZIM files should be named like following (the file name extensions matter): ''foobar.zimaa, foobar.zimab, foobar.zimac''...<br />
<br />
== See also ==<br />
* [[Zeno file format]] (deprecated)<br />
* [[ZIM File Format/4]] (deprecated)<br />
* [[ZIM File Example]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=ZIM_file_format&diff=31341ZIM file format2024-02-03T14:28:38Z<p>Kelson: /* Header */</p>
<hr />
<div>Beginning 2021, we change the way we handle namespaces in ZIM file format.<br />
<br />
This is major change in the way we handle entries in libzim and the semantics around but it is not a break in the binary zim format. Old library/readers are compatible with new zim files.<br />
<br />
This page describe the new format. The old format can be found here : [[ZIM file format old namespace]].[[Image:Schema File Format.png|500px|right]]<br />
== Header ==<br />
A ZIM archive starts with a header :<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !! Offset !! Length !! Description <br />
|-<br />
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)<br />
|-<br />
| [[#Major_.26_Minor_versions|majorVersion]]<br />
| integer<br />
| 4<br />
| 2<br />
| Major version of the ZIM archive format. Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)<br />
|-<br />
| [[#Major_.26_Minor_versions|minorVersion]] || integer || 6 || 2 || Minor version of the ZIM archive format. Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1) <br />
|-<br />
| uuid || integer || 8 || 16 || unique id of this zim archive <br />
|-<br />
| entryCount || integer || 24 || 4 || total number of entries <br />
|-<br />
| clusterCount || integer || 28 || 4 || total number of clusters <br />
|-<br />
| urlPtrPos || integer || 32 || 8 || position of the directory pointerlist ordered by URL <br />
|-<br />
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title<br />
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.<br />
|-<br />
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list <br />
|-<br />
| mimeListPos || integer || 56 || 8 || position of the MIME type list (also header size) <br />
|-<br />
| mainPage || integer || 64 || 4 || main page or 0xffffffff if no main page <br />
|-<br />
| layoutPage || integer || 68 || 4 || [[Layout Page|layout page]] or 0xffffffffff if no layout page (deprecated, always 0xffffffffff) <br />
|-<br />
| checksumPos || integer || 72 || 8 || pointer to the md5checksum of this archive without the checksum itself. This points always 16 bytes before the end of the archive.<br />
|}<br />
<br />
A ZIM archive may be embedded in another file at a specific offset. In the context of the ZIM format, the start of the ZIM header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.<br />
<br />
=== Major & Minor versions ===<br />
<br />
{| class="wikitable"<br />
|+ ZIM format versions<br />
|-<br />
! Major !! Minor !! Backward compatible !! Description <br />
|-<br />
| 0 || Example || || Example <br />
|-<br />
| 1 || Example || || Example <br />
|-<br />
| 2 || Example || || Example <br />
|-<br />
| 3 || Example || || Example<br />
|-<br />
| 4 || Example || || Example<br />
|-<br />
| 5 || 0 || yes || Introduces extended clusters <br />
|-<br />
| rowspan="3" | 6 || 0 || no || Still uses [[ZIM file format old namespace|"old" namespaces]]<br />
|-<br />
| 1 || no || Introduces [[#Namespaces|"new" namespaces]]<br />
|-<br />
| 2 || yes || Explicitly allows alias entries (several entries pointing to the same cluster/blob)<br />
|}<br />
<br />
== MIME Type List (mimeListPos) ==<br />
The MIME type list always follows directly after the header, so the ''mimeListPos'' also defines the end and size of the ZIM file header.<br />
<br />
The MIME types in this list are zero terminated strings. An empty string marks the end of the MIME type list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st MIME Type> || string || 0 ||zero terminated|| declaration of the <1st MIME Type> <br />
|-<br />
| <2nd MIME Type> || string || n/a ||zero terminated|| declaration of the <2nd MIME Type> <br />
|-<br />
| ... || string || ... ||zero terminated|| ... <br />
|-<br />
| <last entry / end> || string || n/a ||zero terminated|| empty string - end of MIME type list <br />
|}<br />
<br />
== URL Pointer List (urlPtrPos) ==<br />
The URL pointer list is a list of 8 byte offsets to the directory entries.<br />
<br />
The directory entries are always ordered by "full" URL (<code><namespace><path></code>). Ordering is simply done by comparing the URL strings.<br />
<br />
Since directory entries have variable sizes this is needed for random access.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st URL> || integer || 0 || 8 || pointer to the directory entry of <1st URL> <br />
|-<br />
| <2nd URL> || integer || 8 || 8 || pointer to the directory entry of <2nd URL> <br />
|-<br />
| <nth URL> || integer ||(n-1)*8|| 8 || pointer to the directory entry of <nth URL> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
Libzim caches directory entries and references the cached entries via the URL pointers.<br />
<br />
== Title Pointer List (titlePtrPos) ==<br />
The title pointer list is a list of entry indices ordered by title (<code><namespace><title></code>). The title pointer list actually points to entries in the URL pointer list.<br />
<br />
Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.<br />
<br />
To get the offset of an entry from the title pointer list, you have to look it up in the URL pointer list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Title> || integer || 0 || 4 || pointer to the URL pointer of <1st Title> <br />
|-<br />
| <2nd Title> || integer || 4 || 4 || pointer to the URL pointer of <2nd Title> <br />
|-<br />
| <nth Title> || integer ||(n-1)*4|| 4 || pointer to the URL pointer of <nth Title> <br />
|-<br />
| ... || integer || ... || 4 || ... <br />
|}<br />
<br />
The indirection from titles via URLs to directory entries has two reasons:<br />
* the pointer list is only half in size as 4 bytes are enough for each entry<br />
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim.<br />
<br />
== Directory Entries ==<br />
Directory entries hold the meta information about all entries, images and other objects in a ZIM archive.<br />
<br />
There are different types of directory entries:<br />
<br />
=== Content Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || MIME type number as defined in the MIME type list <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters (must be 0) <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| cluster number || integer || 8 || 4 || cluster number in which the data of this directory entry is stored <br />
|-<br />
| blob number || integer || 12 || 4 || blob number inside the compressed cluster where the contents are stored <br />
|-<br />
| url || string || 16 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
=== Redirect Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || 0xffff for redirect <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| redirect index || integer || 8 || 4 || pointer to the directory entry of the redirect target <br />
|-<br />
| url || string || 12 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
None of the strings should have control characters from U+0000 through U+001F.<br />
<br />
=== Linktarget or deleted Entry (DEPRECATED) ===<br />
There is two kinds of deprecated entry that could be found in pretty old zim files (I, main develloper of libzim, never saw it).<br />
<br />
They have mimetype equal to 0xfffe or 0xfffd. Reader implementation may check for those value and ignore the whole dirent.<br />
<br />
== Cluster Pointer List (clusterPtrPos) ==<br />
The cluster pointer list is a list of 8 byte offsets which point to all data clusters in a ZIM archive.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Cluster> || integer || 0 || 8 || pointer to the <1st Cluster> <br />
|-<br />
| <1st Cluster> || integer || 8 || 8 || pointer to the <2nd Cluster> <br />
|-<br />
| <nth Cluster> || integer ||(n-1)*8|| 8 || pointer to the <nth Cluster> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
== Clusters ==<br />
The clusters contain the actual data of the directory entries. Clusters can be compressed or uncompressed. The purpose of the clusters are that data of more than one directory entry can be compressed inside one cluster, making the compression much more efficient. Typically clusters have a size of about 1 MB.<br />
<br />
The first byte of the cluster identifies some information about the cluster.<br />
<br />
The first fourth low bits identifies if the cluster compression type:<br />
* No compression is indicated by a value of 1<br />
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).<br />
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.<br />
* 0 is an obselete code for no compression (inhereted from the Zeno)<br />
<br />
The fifth bit identifies the cluster is extended or not :<br />
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.<br />
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.<br />
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.<br />
<br />
The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].<br />
<br />
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed<br />
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8) <br />
|-<br />
| colspan="5" | The following data bytes have to be uncompressed!<br />
|-<br />
| <1st Blob> || integer || 1 || OFFSET_SIZE || offset to the <1st Blob> <br />
|-<br />
| <2nd Blob> || integer || 1+OFFSET_SIZE || OFFSET_SIZE || offset to the <2nd Blob> <br />
|-<br />
| <nth Blob> || integer ||(n-1)*OFFSET_SIZE+1|| OFFSET_SIZE || offset to the <nth Blob> <br />
|-<br />
| ... || integer || ... || OFFSET_SIZE || ... <br />
|-<br />
| <last blob / end> || integer || n/a || OFFSET_SIZE || offset to the end of the cluster <br />
|-<br />
| <1st Blob> || data || n/a || n/a || data of the <1st Blob> <br />
|-<br />
| <2nd Blob> || data || n/a || n/a || data of the <2nd Blob> <br />
|-<br />
| ... || data || ... || n/a || ... <br />
|}<br />
<br />
The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by OFFSET_SIZE. The size of one blob is calculated by the difference of two consecutive offsets.<br />
<br />
== Namespaces ==<br />
Namespaces separate different types of directory entries - which might have the same title or url - stored in the ZIM archive Format.<br />
<br />
The new namespace usage put a strong semantics on the namespaces. The libzim uses this semantics and provide different kind of API to access the different kind of entries.<br />
<br />
The libzim hide the namespace to the user, so a entry `foo.html` in namespace `C` will be accessible as `foo.html`. libzim provides specific API to access metadata.<br />
<br />
Other implementation are free to explicit the namespace or not (but need to provide some fallback for all namespace usage)<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Namespace !! Description <br />
|-<br />
| C || User content entries - see [[Article Format]] <br />
|-<br />
| M || ZIM metadata - see [[Metadata]] <br />
|-<br />
| W || Well know entries (MainPage, Favicon) - see [[Well known entries]] <br />
|-<br />
| X || search indexes - see [[Search indexes]]<br />
|}<br />
<br />
== URLs ==<br />
<br />
=== URL Encoding ===<br />
The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt)<br />
<br />
Some readers process the requests that already do the url decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to libzim.<br />
<br />
=== Local Anchors ===<br />
Many articles - especially when a table of contents is used - use local anchors to jump within an article. <br />
<br />
<pre><br />
<a href="foo#headline1">jump to article foo, headline 1</a><br />
</pre><br />
<br />
The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example "foo". After the article has been loaded the browser will then search for the local anchor tag and jump to the right location.<br />
<br />
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.<br />
<br />
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim.<br />
<br />
== Encodings ==<br />
=== Character Encoding ===<br />
The standard encoding for ZIM archive content is UTF-8. So both article data and URLs should be handled accordingly.<br />
<br />
=== Integer Encoding ===<br />
All types are little-endian.<br />
<br />
All integers are unsigned integers (uint_16, uint_32, uint_64).<br />
<br />
All lengths are bytes.<br />
<br />
== Split ZIM files ==<br />
ZIM archives can be split in multiple files. This is necessary to be able to store big (over 4GB for example) ZIM archives to limited file systems (like FAT32). That said, the files can be of any size, but the naming is really important. The ZIM files should be named like following (the file name extensions matter): ''foobar.zimaa, foobar.zimab, foobar.zimac''...<br />
<br />
== See also ==<br />
* [[Zeno file format]] (deprecated)<br />
* [[ZIM File Format/4]] (deprecated)<br />
* [[ZIM File Example]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=ZIM_file_format&diff=31340ZIM file format2024-02-03T14:27:16Z<p>Kelson: /* Header */</p>
<hr />
<div>Beginning 2021, we change the way we handle namespaces in ZIM file format.<br />
<br />
This is major change in the way we handle entries in libzim and the semantics around but it is not a break in the binary zim format. Old library/readers are compatible with new zim files.<br />
<br />
This page describe the new format. The old format can be found here : [[ZIM file format old namespace]].[[Image:Schema File Format.png|500px|right]]<br />
== Header ==<br />
A ZIM archive starts with a header :<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !! Offset !! Length !! Description <br />
|-<br />
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)<br />
|-<br />
| [[#Major_.26_Minor_versions|majorVersion]]<br />
| integer<br />
| 4<br />
| 2<br />
| Major version of the ZIM archive format (6)<br />
|-<br />
| [[#Major_.26_Minor_versions|minorVersion]] || integer || 6 || 2 || Minor version of the ZIM archive format (1 for new namespace usage, 0 for old namespace usage) <br />
|-<br />
| uuid || integer || 8 || 16 || unique id of this zim archive <br />
|-<br />
| entryCount || integer || 24 || 4 || total number of entries <br />
|-<br />
| clusterCount || integer || 28 || 4 || total number of clusters <br />
|-<br />
| urlPtrPos || integer || 32 || 8 || position of the directory pointerlist ordered by URL <br />
|-<br />
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title<br />
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.<br />
|-<br />
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list <br />
|-<br />
| mimeListPos || integer || 56 || 8 || position of the MIME type list (also header size) <br />
|-<br />
| mainPage || integer || 64 || 4 || main page or 0xffffffff if no main page <br />
|-<br />
| layoutPage || integer || 68 || 4 || [[Layout Page|layout page]] or 0xffffffffff if no layout page (deprecated, always 0xffffffffff) <br />
|-<br />
| checksumPos || integer || 72 || 8 || pointer to the md5checksum of this archive without the checksum itself. This points always 16 bytes before the end of the archive.<br />
|}<br />
<br />
A ZIM archive may be embedded in another file at a specific offset. In the context of the ZIM format, the start of the ZIM header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.<br />
<br />
=== Major & Minor versions ===<br />
<br />
Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)<br />
<br />
Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1)<br />
<br />
The minor version can be :<br />
* 0 : We use the old namespace usage (see [[ZIM file format old namespace]])<br />
* 1 : We use the new namespace usage (describe here).<br />
* 2 : Means that libzim (reference implementation) allow to create alias entries (Several entries pointing to the same cluster/blob). This was already allowed by specification, this is hint only.<br />
<br />
{| class="wikitable"<br />
|+ ZIM format versions<br />
|-<br />
! Major !! Minor !! Backward compatible !! Description <br />
|-<br />
| 0 || Example || || Example <br />
|-<br />
| 1 || Example || || Example <br />
|-<br />
| 2 || Example || || Example <br />
|-<br />
| 3 || Example || || Example<br />
|-<br />
| 4 || Example || || Example<br />
|-<br />
| 5 || 0 || yes || Introduces extended clusters <br />
|-<br />
| rowspan="3" | 6 || 0 || no || Still uses [[ZIM file format old namespace|"old" namespaces]]<br />
|-<br />
| 1 || no || Introduces [[#Namespaces|"new" namespaces]]<br />
|-<br />
| 2 || yes || Explicitly allows alias entries (several entries pointing to the same cluster/blob)<br />
|}<br />
<br />
== MIME Type List (mimeListPos) ==<br />
The MIME type list always follows directly after the header, so the ''mimeListPos'' also defines the end and size of the ZIM file header.<br />
<br />
The MIME types in this list are zero terminated strings. An empty string marks the end of the MIME type list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st MIME Type> || string || 0 ||zero terminated|| declaration of the <1st MIME Type> <br />
|-<br />
| <2nd MIME Type> || string || n/a ||zero terminated|| declaration of the <2nd MIME Type> <br />
|-<br />
| ... || string || ... ||zero terminated|| ... <br />
|-<br />
| <last entry / end> || string || n/a ||zero terminated|| empty string - end of MIME type list <br />
|}<br />
<br />
== URL Pointer List (urlPtrPos) ==<br />
The URL pointer list is a list of 8 byte offsets to the directory entries.<br />
<br />
The directory entries are always ordered by "full" URL (<code><namespace><path></code>). Ordering is simply done by comparing the URL strings.<br />
<br />
Since directory entries have variable sizes this is needed for random access.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st URL> || integer || 0 || 8 || pointer to the directory entry of <1st URL> <br />
|-<br />
| <2nd URL> || integer || 8 || 8 || pointer to the directory entry of <2nd URL> <br />
|-<br />
| <nth URL> || integer ||(n-1)*8|| 8 || pointer to the directory entry of <nth URL> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
Libzim caches directory entries and references the cached entries via the URL pointers.<br />
<br />
== Title Pointer List (titlePtrPos) ==<br />
The title pointer list is a list of entry indices ordered by title (<code><namespace><title></code>). The title pointer list actually points to entries in the URL pointer list.<br />
<br />
Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.<br />
<br />
To get the offset of an entry from the title pointer list, you have to look it up in the URL pointer list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Title> || integer || 0 || 4 || pointer to the URL pointer of <1st Title> <br />
|-<br />
| <2nd Title> || integer || 4 || 4 || pointer to the URL pointer of <2nd Title> <br />
|-<br />
| <nth Title> || integer ||(n-1)*4|| 4 || pointer to the URL pointer of <nth Title> <br />
|-<br />
| ... || integer || ... || 4 || ... <br />
|}<br />
<br />
The indirection from titles via URLs to directory entries has two reasons:<br />
* the pointer list is only half in size as 4 bytes are enough for each entry<br />
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim.<br />
<br />
== Directory Entries ==<br />
Directory entries hold the meta information about all entries, images and other objects in a ZIM archive.<br />
<br />
There are different types of directory entries:<br />
<br />
=== Content Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || MIME type number as defined in the MIME type list <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters (must be 0) <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| cluster number || integer || 8 || 4 || cluster number in which the data of this directory entry is stored <br />
|-<br />
| blob number || integer || 12 || 4 || blob number inside the compressed cluster where the contents are stored <br />
|-<br />
| url || string || 16 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
=== Redirect Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || 0xffff for redirect <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| redirect index || integer || 8 || 4 || pointer to the directory entry of the redirect target <br />
|-<br />
| url || string || 12 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
None of the strings should have control characters from U+0000 through U+001F.<br />
<br />
=== Linktarget or deleted Entry (DEPRECATED) ===<br />
There is two kinds of deprecated entry that could be found in pretty old zim files (I, main develloper of libzim, never saw it).<br />
<br />
They have mimetype equal to 0xfffe or 0xfffd. Reader implementation may check for those value and ignore the whole dirent.<br />
<br />
== Cluster Pointer List (clusterPtrPos) ==<br />
The cluster pointer list is a list of 8 byte offsets which point to all data clusters in a ZIM archive.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Cluster> || integer || 0 || 8 || pointer to the <1st Cluster> <br />
|-<br />
| <1st Cluster> || integer || 8 || 8 || pointer to the <2nd Cluster> <br />
|-<br />
| <nth Cluster> || integer ||(n-1)*8|| 8 || pointer to the <nth Cluster> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
== Clusters ==<br />
The clusters contain the actual data of the directory entries. Clusters can be compressed or uncompressed. The purpose of the clusters are that data of more than one directory entry can be compressed inside one cluster, making the compression much more efficient. Typically clusters have a size of about 1 MB.<br />
<br />
The first byte of the cluster identifies some information about the cluster.<br />
<br />
The first fourth low bits identifies if the cluster compression type:<br />
* No compression is indicated by a value of 1<br />
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).<br />
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.<br />
* 0 is an obselete code for no compression (inhereted from the Zeno)<br />
<br />
The fifth bit identifies the cluster is extended or not :<br />
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.<br />
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.<br />
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.<br />
<br />
The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].<br />
<br />
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed<br />
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8) <br />
|-<br />
| colspan="5" | The following data bytes have to be uncompressed!<br />
|-<br />
| <1st Blob> || integer || 1 || OFFSET_SIZE || offset to the <1st Blob> <br />
|-<br />
| <2nd Blob> || integer || 1+OFFSET_SIZE || OFFSET_SIZE || offset to the <2nd Blob> <br />
|-<br />
| <nth Blob> || integer ||(n-1)*OFFSET_SIZE+1|| OFFSET_SIZE || offset to the <nth Blob> <br />
|-<br />
| ... || integer || ... || OFFSET_SIZE || ... <br />
|-<br />
| <last blob / end> || integer || n/a || OFFSET_SIZE || offset to the end of the cluster <br />
|-<br />
| <1st Blob> || data || n/a || n/a || data of the <1st Blob> <br />
|-<br />
| <2nd Blob> || data || n/a || n/a || data of the <2nd Blob> <br />
|-<br />
| ... || data || ... || n/a || ... <br />
|}<br />
<br />
The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by OFFSET_SIZE. The size of one blob is calculated by the difference of two consecutive offsets.<br />
<br />
== Namespaces ==<br />
Namespaces separate different types of directory entries - which might have the same title or url - stored in the ZIM archive Format.<br />
<br />
The new namespace usage put a strong semantics on the namespaces. The libzim uses this semantics and provide different kind of API to access the different kind of entries.<br />
<br />
The libzim hide the namespace to the user, so a entry `foo.html` in namespace `C` will be accessible as `foo.html`. libzim provides specific API to access metadata.<br />
<br />
Other implementation are free to explicit the namespace or not (but need to provide some fallback for all namespace usage)<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Namespace !! Description <br />
|-<br />
| C || User content entries - see [[Article Format]] <br />
|-<br />
| M || ZIM metadata - see [[Metadata]] <br />
|-<br />
| W || Well know entries (MainPage, Favicon) - see [[Well known entries]] <br />
|-<br />
| X || search indexes - see [[Search indexes]]<br />
|}<br />
<br />
== URLs ==<br />
<br />
=== URL Encoding ===<br />
The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt)<br />
<br />
Some readers process the requests that already do the url decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to libzim.<br />
<br />
=== Local Anchors ===<br />
Many articles - especially when a table of contents is used - use local anchors to jump within an article. <br />
<br />
<pre><br />
<a href="foo#headline1">jump to article foo, headline 1</a><br />
</pre><br />
<br />
The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example "foo". After the article has been loaded the browser will then search for the local anchor tag and jump to the right location.<br />
<br />
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.<br />
<br />
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim.<br />
<br />
== Encodings ==<br />
=== Character Encoding ===<br />
The standard encoding for ZIM archive content is UTF-8. So both article data and URLs should be handled accordingly.<br />
<br />
=== Integer Encoding ===<br />
All types are little-endian.<br />
<br />
All integers are unsigned integers (uint_16, uint_32, uint_64).<br />
<br />
All lengths are bytes.<br />
<br />
== Split ZIM files ==<br />
ZIM archives can be split in multiple files. This is necessary to be able to store big (over 4GB for example) ZIM archives to limited file systems (like FAT32). That said, the files can be of any size, but the naming is really important. The ZIM files should be named like following (the file name extensions matter): ''foobar.zimaa, foobar.zimab, foobar.zimac''...<br />
<br />
== See also ==<br />
* [[Zeno file format]] (deprecated)<br />
* [[ZIM File Format/4]] (deprecated)<br />
* [[ZIM File Example]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=ZIM_file_format&diff=31339ZIM file format2024-02-03T14:26:07Z<p>Kelson: /* Major & Minor versions */</p>
<hr />
<div>Beginning 2021, we change the way we handle namespaces in ZIM file format.<br />
<br />
This is major change in the way we handle entries in libzim and the semantics around but it is not a break in the binary zim format. Old library/readers are compatible with new zim files.<br />
<br />
This page describe the new format. The old format can be found here : [[ZIM file format old namespace]].[[Image:Schema File Format.png|500px|right]]<br />
== Header ==<br />
A ZIM archive starts with a header :<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !! Offset !! Length !! Description <br />
|-<br />
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)<br />
|-<br />
| [[#Major_.26_Minor_versions|majorVersion]]<br />
| integer<br />
| 4<br />
| 2<br />
| Major version of the ZIM archive format (6)<br />
|-<br />
| [[#Major_.26_Minor_versions|minorVersion]] || integer || 6 || 2 || Minor version of the ZIM archive format (1 for new namespace usage, 0 for old namespace usage) <br />
|-<br />
| uuid || integer || 8 || 16 || unique id of this zim archive <br />
|-<br />
| entryCount || integer || 24 || 4 || total number of entries <br />
|-<br />
| clusterCount || integer || 28 || 4 || total number of clusters <br />
|-<br />
| urlPtrPos || integer || 32 || 8 || position of the directory pointerlist ordered by URL <br />
|-<br />
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title<br />
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.<br />
|-<br />
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list <br />
|-<br />
| mimeListPos || integer || 56 || 8 || position of the MIME type list (also header size) <br />
|-<br />
| mainPage || integer || 64 || 4 || main page or 0xffffffff if no main page <br />
|-<br />
| layoutPage || integer || 68 || 4 || [[Layout Page|layout page]] or 0xffffffffff if no layout page (deprecated, always 0xffffffffff) <br />
|-<br />
| checksumPos || integer || 72 || 8 || pointer to the md5checksum of this archive without the checksum itself. This points always 16 bytes before the end of the archive.<br />
|}<br />
<br />
=== Major & Minor versions ===<br />
<br />
Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)<br />
<br />
Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1)<br />
<br />
The minor version can be :<br />
* 0 : We use the old namespace usage (see [[ZIM file format old namespace]])<br />
* 1 : We use the new namespace usage (describe here).<br />
* 2 : Means that libzim (reference implementation) allow to create alias entries (Several entries pointing to the same cluster/blob). This was already allowed by specification, this is hint only.<br />
<br />
A zim archive may be embedded in another file at a specific offset. In the context of zim format, the start of the zim header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.<br />
<br />
{| class="wikitable"<br />
|+ ZIM format versions<br />
|-<br />
! Major !! Minor !! Backward compatible !! Description <br />
|-<br />
| 0 || Example || || Example <br />
|-<br />
| 1 || Example || || Example <br />
|-<br />
| 2 || Example || || Example <br />
|-<br />
| 3 || Example || || Example<br />
|-<br />
| 4 || Example || || Example<br />
|-<br />
| 5 || 0 || yes || Introduces extended clusters <br />
|-<br />
| rowspan="3" | 6 || 0 || no || Still uses [[ZIM file format old namespace|"old" namespaces]]<br />
|-<br />
| 1 || no || Introduces [[#Namespaces|"new" namespaces]]<br />
|-<br />
| 2 || yes || Explicitly allows alias entries (several entries pointing to the same cluster/blob)<br />
|}<br />
<br />
== MIME Type List (mimeListPos) ==<br />
The MIME type list always follows directly after the header, so the ''mimeListPos'' also defines the end and size of the ZIM file header.<br />
<br />
The MIME types in this list are zero terminated strings. An empty string marks the end of the MIME type list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st MIME Type> || string || 0 ||zero terminated|| declaration of the <1st MIME Type> <br />
|-<br />
| <2nd MIME Type> || string || n/a ||zero terminated|| declaration of the <2nd MIME Type> <br />
|-<br />
| ... || string || ... ||zero terminated|| ... <br />
|-<br />
| <last entry / end> || string || n/a ||zero terminated|| empty string - end of MIME type list <br />
|}<br />
<br />
== URL Pointer List (urlPtrPos) ==<br />
The URL pointer list is a list of 8 byte offsets to the directory entries.<br />
<br />
The directory entries are always ordered by "full" URL (<code><namespace><path></code>). Ordering is simply done by comparing the URL strings.<br />
<br />
Since directory entries have variable sizes this is needed for random access.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st URL> || integer || 0 || 8 || pointer to the directory entry of <1st URL> <br />
|-<br />
| <2nd URL> || integer || 8 || 8 || pointer to the directory entry of <2nd URL> <br />
|-<br />
| <nth URL> || integer ||(n-1)*8|| 8 || pointer to the directory entry of <nth URL> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
Libzim caches directory entries and references the cached entries via the URL pointers.<br />
<br />
== Title Pointer List (titlePtrPos) ==<br />
The title pointer list is a list of entry indices ordered by title (<code><namespace><title></code>). The title pointer list actually points to entries in the URL pointer list.<br />
<br />
Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.<br />
<br />
To get the offset of an entry from the title pointer list, you have to look it up in the URL pointer list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Title> || integer || 0 || 4 || pointer to the URL pointer of <1st Title> <br />
|-<br />
| <2nd Title> || integer || 4 || 4 || pointer to the URL pointer of <2nd Title> <br />
|-<br />
| <nth Title> || integer ||(n-1)*4|| 4 || pointer to the URL pointer of <nth Title> <br />
|-<br />
| ... || integer || ... || 4 || ... <br />
|}<br />
<br />
The indirection from titles via URLs to directory entries has two reasons:<br />
* the pointer list is only half in size as 4 bytes are enough for each entry<br />
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim.<br />
<br />
== Directory Entries ==<br />
Directory entries hold the meta information about all entries, images and other objects in a ZIM archive.<br />
<br />
There are different types of directory entries:<br />
<br />
=== Content Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || MIME type number as defined in the MIME type list <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters (must be 0) <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| cluster number || integer || 8 || 4 || cluster number in which the data of this directory entry is stored <br />
|-<br />
| blob number || integer || 12 || 4 || blob number inside the compressed cluster where the contents are stored <br />
|-<br />
| url || string || 16 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
=== Redirect Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || 0xffff for redirect <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| redirect index || integer || 8 || 4 || pointer to the directory entry of the redirect target <br />
|-<br />
| url || string || 12 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
None of the strings should have control characters from U+0000 through U+001F.<br />
<br />
=== Linktarget or deleted Entry (DEPRECATED) ===<br />
There is two kinds of deprecated entry that could be found in pretty old zim files (I, main develloper of libzim, never saw it).<br />
<br />
They have mimetype equal to 0xfffe or 0xfffd. Reader implementation may check for those value and ignore the whole dirent.<br />
<br />
== Cluster Pointer List (clusterPtrPos) ==<br />
The cluster pointer list is a list of 8 byte offsets which point to all data clusters in a ZIM archive.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Cluster> || integer || 0 || 8 || pointer to the <1st Cluster> <br />
|-<br />
| <1st Cluster> || integer || 8 || 8 || pointer to the <2nd Cluster> <br />
|-<br />
| <nth Cluster> || integer ||(n-1)*8|| 8 || pointer to the <nth Cluster> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
== Clusters ==<br />
The clusters contain the actual data of the directory entries. Clusters can be compressed or uncompressed. The purpose of the clusters are that data of more than one directory entry can be compressed inside one cluster, making the compression much more efficient. Typically clusters have a size of about 1 MB.<br />
<br />
The first byte of the cluster identifies some information about the cluster.<br />
<br />
The first fourth low bits identifies if the cluster compression type:<br />
* No compression is indicated by a value of 1<br />
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).<br />
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.<br />
* 0 is an obselete code for no compression (inhereted from the Zeno)<br />
<br />
The fifth bit identifies the cluster is extended or not :<br />
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.<br />
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.<br />
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.<br />
<br />
The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].<br />
<br />
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed<br />
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8) <br />
|-<br />
| colspan="5" | The following data bytes have to be uncompressed!<br />
|-<br />
| <1st Blob> || integer || 1 || OFFSET_SIZE || offset to the <1st Blob> <br />
|-<br />
| <2nd Blob> || integer || 1+OFFSET_SIZE || OFFSET_SIZE || offset to the <2nd Blob> <br />
|-<br />
| <nth Blob> || integer ||(n-1)*OFFSET_SIZE+1|| OFFSET_SIZE || offset to the <nth Blob> <br />
|-<br />
| ... || integer || ... || OFFSET_SIZE || ... <br />
|-<br />
| <last blob / end> || integer || n/a || OFFSET_SIZE || offset to the end of the cluster <br />
|-<br />
| <1st Blob> || data || n/a || n/a || data of the <1st Blob> <br />
|-<br />
| <2nd Blob> || data || n/a || n/a || data of the <2nd Blob> <br />
|-<br />
| ... || data || ... || n/a || ... <br />
|}<br />
<br />
The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by OFFSET_SIZE. The size of one blob is calculated by the difference of two consecutive offsets.<br />
<br />
== Namespaces ==<br />
Namespaces separate different types of directory entries - which might have the same title or url - stored in the ZIM archive Format.<br />
<br />
The new namespace usage put a strong semantics on the namespaces. The libzim uses this semantics and provide different kind of API to access the different kind of entries.<br />
<br />
The libzim hide the namespace to the user, so a entry `foo.html` in namespace `C` will be accessible as `foo.html`. libzim provides specific API to access metadata.<br />
<br />
Other implementation are free to explicit the namespace or not (but need to provide some fallback for all namespace usage)<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Namespace !! Description <br />
|-<br />
| C || User content entries - see [[Article Format]] <br />
|-<br />
| M || ZIM metadata - see [[Metadata]] <br />
|-<br />
| W || Well know entries (MainPage, Favicon) - see [[Well known entries]] <br />
|-<br />
| X || search indexes - see [[Search indexes]]<br />
|}<br />
<br />
== URLs ==<br />
<br />
=== URL Encoding ===<br />
The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt)<br />
<br />
Some readers process the requests that already do the url decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to libzim.<br />
<br />
=== Local Anchors ===<br />
Many articles - especially when a table of contents is used - use local anchors to jump within an article. <br />
<br />
<pre><br />
<a href="foo#headline1">jump to article foo, headline 1</a><br />
</pre><br />
<br />
The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example "foo". After the article has been loaded the browser will then search for the local anchor tag and jump to the right location.<br />
<br />
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.<br />
<br />
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim.<br />
<br />
== Encodings ==<br />
=== Character Encoding ===<br />
The standard encoding for ZIM archive content is UTF-8. So both article data and URLs should be handled accordingly.<br />
<br />
=== Integer Encoding ===<br />
All types are little-endian.<br />
<br />
All integers are unsigned integers (uint_16, uint_32, uint_64).<br />
<br />
All lengths are bytes.<br />
<br />
== Split ZIM files ==<br />
ZIM archives can be split in multiple files. This is necessary to be able to store big (over 4GB for example) ZIM archives to limited file systems (like FAT32). That said, the files can be of any size, but the naming is really important. The ZIM files should be named like following (the file name extensions matter): ''foobar.zimaa, foobar.zimab, foobar.zimac''...<br />
<br />
== See also ==<br />
* [[Zeno file format]] (deprecated)<br />
* [[ZIM File Format/4]] (deprecated)<br />
* [[ZIM File Example]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=ZIM_file_format&diff=31338ZIM file format2024-02-03T14:24:17Z<p>Kelson: /* Major & Minor versions */</p>
<hr />
<div>Beginning 2021, we change the way we handle namespaces in ZIM file format.<br />
<br />
This is major change in the way we handle entries in libzim and the semantics around but it is not a break in the binary zim format. Old library/readers are compatible with new zim files.<br />
<br />
This page describe the new format. The old format can be found here : [[ZIM file format old namespace]].[[Image:Schema File Format.png|500px|right]]<br />
== Header ==<br />
A ZIM archive starts with a header :<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !! Offset !! Length !! Description <br />
|-<br />
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)<br />
|-<br />
| [[#Major_.26_Minor_versions|majorVersion]]<br />
| integer<br />
| 4<br />
| 2<br />
| Major version of the ZIM archive format (6)<br />
|-<br />
| [[#Major_.26_Minor_versions|minorVersion]] || integer || 6 || 2 || Minor version of the ZIM archive format (1 for new namespace usage, 0 for old namespace usage) <br />
|-<br />
| uuid || integer || 8 || 16 || unique id of this zim archive <br />
|-<br />
| entryCount || integer || 24 || 4 || total number of entries <br />
|-<br />
| clusterCount || integer || 28 || 4 || total number of clusters <br />
|-<br />
| urlPtrPos || integer || 32 || 8 || position of the directory pointerlist ordered by URL <br />
|-<br />
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title<br />
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.<br />
|-<br />
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list <br />
|-<br />
| mimeListPos || integer || 56 || 8 || position of the MIME type list (also header size) <br />
|-<br />
| mainPage || integer || 64 || 4 || main page or 0xffffffff if no main page <br />
|-<br />
| layoutPage || integer || 68 || 4 || [[Layout Page|layout page]] or 0xffffffffff if no layout page (deprecated, always 0xffffffffff) <br />
|-<br />
| checksumPos || integer || 72 || 8 || pointer to the md5checksum of this archive without the checksum itself. This points always 16 bytes before the end of the archive.<br />
|}<br />
<br />
=== Major & Minor versions ===<br />
<br />
Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)<br />
<br />
Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1)<br />
<br />
The minor version can be :<br />
* 0 : We use the old namespace usage (see [[ZIM file format old namespace]])<br />
* 1 : We use the new namespace usage (describe here).<br />
* 2 : Means that libzim (reference implementation) allow to create alias entries (Several entries pointing to the same cluster/blob). This was already allowed by specification, this is hint only.<br />
<br />
A zim archive may be embedded in another file at a specific offset. In the context of zim format, the start of the zim header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.<br />
<br />
{| class="wikitable"<br />
|+ ZIM format versions<br />
|-<br />
! Major !! Minor !! Backward compatible !! Description <br />
|-<br />
| 0 || Example || || Example <br />
|-<br />
| 1 || Example || || Example <br />
|-<br />
| 2 || Example || || Example <br />
|-<br />
| 3 || Example || || Example<br />
|-<br />
| 4 || Example || || Example<br />
|-<br />
| 5 || 0 || yes || Introduces extended clusters <br />
|-<br />
| rowspan="3" | 6 || 0 || no || Still uses old namespaces (see [[ZIM file format old namespace]])<br />
|-<br />
| 1 || no || Introduces new namespaces<br />
|-<br />
| 2 || yes || Explicitly allows alias entries (several entries pointing to the same cluster/blob)<br />
|}<br />
<br />
== MIME Type List (mimeListPos) ==<br />
The MIME type list always follows directly after the header, so the ''mimeListPos'' also defines the end and size of the ZIM file header.<br />
<br />
The MIME types in this list are zero terminated strings. An empty string marks the end of the MIME type list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st MIME Type> || string || 0 ||zero terminated|| declaration of the <1st MIME Type> <br />
|-<br />
| <2nd MIME Type> || string || n/a ||zero terminated|| declaration of the <2nd MIME Type> <br />
|-<br />
| ... || string || ... ||zero terminated|| ... <br />
|-<br />
| <last entry / end> || string || n/a ||zero terminated|| empty string - end of MIME type list <br />
|}<br />
<br />
== URL Pointer List (urlPtrPos) ==<br />
The URL pointer list is a list of 8 byte offsets to the directory entries.<br />
<br />
The directory entries are always ordered by "full" URL (<code><namespace><path></code>). Ordering is simply done by comparing the URL strings.<br />
<br />
Since directory entries have variable sizes this is needed for random access.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st URL> || integer || 0 || 8 || pointer to the directory entry of <1st URL> <br />
|-<br />
| <2nd URL> || integer || 8 || 8 || pointer to the directory entry of <2nd URL> <br />
|-<br />
| <nth URL> || integer ||(n-1)*8|| 8 || pointer to the directory entry of <nth URL> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
Libzim caches directory entries and references the cached entries via the URL pointers.<br />
<br />
== Title Pointer List (titlePtrPos) ==<br />
The title pointer list is a list of entry indices ordered by title (<code><namespace><title></code>). The title pointer list actually points to entries in the URL pointer list.<br />
<br />
Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.<br />
<br />
To get the offset of an entry from the title pointer list, you have to look it up in the URL pointer list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Title> || integer || 0 || 4 || pointer to the URL pointer of <1st Title> <br />
|-<br />
| <2nd Title> || integer || 4 || 4 || pointer to the URL pointer of <2nd Title> <br />
|-<br />
| <nth Title> || integer ||(n-1)*4|| 4 || pointer to the URL pointer of <nth Title> <br />
|-<br />
| ... || integer || ... || 4 || ... <br />
|}<br />
<br />
The indirection from titles via URLs to directory entries has two reasons:<br />
* the pointer list is only half in size as 4 bytes are enough for each entry<br />
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim.<br />
<br />
== Directory Entries ==<br />
Directory entries hold the meta information about all entries, images and other objects in a ZIM archive.<br />
<br />
There are different types of directory entries:<br />
<br />
=== Content Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || MIME type number as defined in the MIME type list <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters (must be 0) <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| cluster number || integer || 8 || 4 || cluster number in which the data of this directory entry is stored <br />
|-<br />
| blob number || integer || 12 || 4 || blob number inside the compressed cluster where the contents are stored <br />
|-<br />
| url || string || 16 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
=== Redirect Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || 0xffff for redirect <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| redirect index || integer || 8 || 4 || pointer to the directory entry of the redirect target <br />
|-<br />
| url || string || 12 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
None of the strings should have control characters from U+0000 through U+001F.<br />
<br />
=== Linktarget or deleted Entry (DEPRECATED) ===<br />
There is two kinds of deprecated entry that could be found in pretty old zim files (I, main develloper of libzim, never saw it).<br />
<br />
They have mimetype equal to 0xfffe or 0xfffd. Reader implementation may check for those value and ignore the whole dirent.<br />
<br />
== Cluster Pointer List (clusterPtrPos) ==<br />
The cluster pointer list is a list of 8 byte offsets which point to all data clusters in a ZIM archive.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Cluster> || integer || 0 || 8 || pointer to the <1st Cluster> <br />
|-<br />
| <1st Cluster> || integer || 8 || 8 || pointer to the <2nd Cluster> <br />
|-<br />
| <nth Cluster> || integer ||(n-1)*8|| 8 || pointer to the <nth Cluster> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
== Clusters ==<br />
The clusters contain the actual data of the directory entries. Clusters can be compressed or uncompressed. The purpose of the clusters are that data of more than one directory entry can be compressed inside one cluster, making the compression much more efficient. Typically clusters have a size of about 1 MB.<br />
<br />
The first byte of the cluster identifies some information about the cluster.<br />
<br />
The first fourth low bits identifies if the cluster compression type:<br />
* No compression is indicated by a value of 1<br />
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).<br />
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.<br />
* 0 is an obselete code for no compression (inhereted from the Zeno)<br />
<br />
The fifth bit identifies the cluster is extended or not :<br />
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.<br />
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.<br />
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.<br />
<br />
The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].<br />
<br />
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed<br />
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8) <br />
|-<br />
| colspan="5" | The following data bytes have to be uncompressed!<br />
|-<br />
| <1st Blob> || integer || 1 || OFFSET_SIZE || offset to the <1st Blob> <br />
|-<br />
| <2nd Blob> || integer || 1+OFFSET_SIZE || OFFSET_SIZE || offset to the <2nd Blob> <br />
|-<br />
| <nth Blob> || integer ||(n-1)*OFFSET_SIZE+1|| OFFSET_SIZE || offset to the <nth Blob> <br />
|-<br />
| ... || integer || ... || OFFSET_SIZE || ... <br />
|-<br />
| <last blob / end> || integer || n/a || OFFSET_SIZE || offset to the end of the cluster <br />
|-<br />
| <1st Blob> || data || n/a || n/a || data of the <1st Blob> <br />
|-<br />
| <2nd Blob> || data || n/a || n/a || data of the <2nd Blob> <br />
|-<br />
| ... || data || ... || n/a || ... <br />
|}<br />
<br />
The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by OFFSET_SIZE. The size of one blob is calculated by the difference of two consecutive offsets.<br />
<br />
== Namespaces ==<br />
Namespaces separate different types of directory entries - which might have the same title or url - stored in the ZIM archive Format.<br />
<br />
The new namespace usage put a strong semantics on the namespaces. The libzim uses this semantics and provide different kind of API to access the different kind of entries.<br />
<br />
The libzim hide the namespace to the user, so a entry `foo.html` in namespace `C` will be accessible as `foo.html`. libzim provides specific API to access metadata.<br />
<br />
Other implementation are free to explicit the namespace or not (but need to provide some fallback for all namespace usage)<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Namespace !! Description <br />
|-<br />
| C || User content entries - see [[Article Format]] <br />
|-<br />
| M || ZIM metadata - see [[Metadata]] <br />
|-<br />
| W || Well know entries (MainPage, Favicon) - see [[Well known entries]] <br />
|-<br />
| X || search indexes - see [[Search indexes]]<br />
|}<br />
<br />
== URLs ==<br />
<br />
=== URL Encoding ===<br />
The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt)<br />
<br />
Some readers process the requests that already do the url decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to libzim.<br />
<br />
=== Local Anchors ===<br />
Many articles - especially when a table of contents is used - use local anchors to jump within an article. <br />
<br />
<pre><br />
<a href="foo#headline1">jump to article foo, headline 1</a><br />
</pre><br />
<br />
The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example "foo". After the article has been loaded the browser will then search for the local anchor tag and jump to the right location.<br />
<br />
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.<br />
<br />
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim.<br />
<br />
== Encodings ==<br />
=== Character Encoding ===<br />
The standard encoding for ZIM archive content is UTF-8. So both article data and URLs should be handled accordingly.<br />
<br />
=== Integer Encoding ===<br />
All types are little-endian.<br />
<br />
All integers are unsigned integers (uint_16, uint_32, uint_64).<br />
<br />
All lengths are bytes.<br />
<br />
== Split ZIM files ==<br />
ZIM archives can be split in multiple files. This is necessary to be able to store big (over 4GB for example) ZIM archives to limited file systems (like FAT32). That said, the files can be of any size, but the naming is really important. The ZIM files should be named like following (the file name extensions matter): ''foobar.zimaa, foobar.zimab, foobar.zimac''...<br />
<br />
== See also ==<br />
* [[Zeno file format]] (deprecated)<br />
* [[ZIM File Format/4]] (deprecated)<br />
* [[ZIM File Example]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=ZIM_file_format&diff=31337ZIM file format2024-02-03T14:12:43Z<p>Kelson: /* Major & Minor versions */</p>
<hr />
<div>Beginning 2021, we change the way we handle namespaces in ZIM file format.<br />
<br />
This is major change in the way we handle entries in libzim and the semantics around but it is not a break in the binary zim format. Old library/readers are compatible with new zim files.<br />
<br />
This page describe the new format. The old format can be found here : [[ZIM file format old namespace]].[[Image:Schema File Format.png|500px|right]]<br />
== Header ==<br />
A ZIM archive starts with a header :<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !! Offset !! Length !! Description <br />
|-<br />
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)<br />
|-<br />
| [[#Major_.26_Minor_versions|majorVersion]]<br />
| integer<br />
| 4<br />
| 2<br />
| Major version of the ZIM archive format (6)<br />
|-<br />
| [[#Major_.26_Minor_versions|minorVersion]] || integer || 6 || 2 || Minor version of the ZIM archive format (1 for new namespace usage, 0 for old namespace usage) <br />
|-<br />
| uuid || integer || 8 || 16 || unique id of this zim archive <br />
|-<br />
| entryCount || integer || 24 || 4 || total number of entries <br />
|-<br />
| clusterCount || integer || 28 || 4 || total number of clusters <br />
|-<br />
| urlPtrPos || integer || 32 || 8 || position of the directory pointerlist ordered by URL <br />
|-<br />
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title<br />
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.<br />
|-<br />
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list <br />
|-<br />
| mimeListPos || integer || 56 || 8 || position of the MIME type list (also header size) <br />
|-<br />
| mainPage || integer || 64 || 4 || main page or 0xffffffff if no main page <br />
|-<br />
| layoutPage || integer || 68 || 4 || [[Layout Page|layout page]] or 0xffffffffff if no layout page (deprecated, always 0xffffffffff) <br />
|-<br />
| checksumPos || integer || 72 || 8 || pointer to the md5checksum of this archive without the checksum itself. This points always 16 bytes before the end of the archive.<br />
|}<br />
<br />
=== Major & Minor versions ===<br />
<br />
Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)<br />
<br />
Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1)<br />
<br />
The current major version is 6. You may found old zim archives with major version 5. They are the same than 6 less extended cluster, so you can read a 5 major version as if it was a 6.<br />
<br />
The minor version can be :<br />
* 0 : We use the old namespace usage (see [[ZIM file format old namespace]])<br />
* 1 : We use the new namespace usage (describe here).<br />
* 2 : Means that libzim (reference implementation) allow to create alias entries (Several entries pointing to the same cluster/blob). This was already allowed by specification, this is hint only.<br />
<br />
A zim archive may be embedded in another file at a specific offset. In the context of zim format, the start of the zim header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.<br />
<br />
{| class="wikitable"<br />
|+ ZIM format versions<br />
|-<br />
! Major !! Minor !! Description <br />
|-<br />
| Example || Example || Example <br />
|-<br />
| Example || Example || Example <br />
|-<br />
| Example || Example || Example <br />
|-<br />
| Example || Example || Example<br />
|-<br />
| Example || Example || Example<br />
|-<br />
| Example || Example || Example<br />
|}<br />
<br />
== MIME Type List (mimeListPos) ==<br />
The MIME type list always follows directly after the header, so the ''mimeListPos'' also defines the end and size of the ZIM file header.<br />
<br />
The MIME types in this list are zero terminated strings. An empty string marks the end of the MIME type list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st MIME Type> || string || 0 ||zero terminated|| declaration of the <1st MIME Type> <br />
|-<br />
| <2nd MIME Type> || string || n/a ||zero terminated|| declaration of the <2nd MIME Type> <br />
|-<br />
| ... || string || ... ||zero terminated|| ... <br />
|-<br />
| <last entry / end> || string || n/a ||zero terminated|| empty string - end of MIME type list <br />
|}<br />
<br />
== URL Pointer List (urlPtrPos) ==<br />
The URL pointer list is a list of 8 byte offsets to the directory entries.<br />
<br />
The directory entries are always ordered by "full" URL (<code><namespace><path></code>). Ordering is simply done by comparing the URL strings.<br />
<br />
Since directory entries have variable sizes this is needed for random access.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st URL> || integer || 0 || 8 || pointer to the directory entry of <1st URL> <br />
|-<br />
| <2nd URL> || integer || 8 || 8 || pointer to the directory entry of <2nd URL> <br />
|-<br />
| <nth URL> || integer ||(n-1)*8|| 8 || pointer to the directory entry of <nth URL> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
Libzim caches directory entries and references the cached entries via the URL pointers.<br />
<br />
== Title Pointer List (titlePtrPos) ==<br />
The title pointer list is a list of entry indices ordered by title (<code><namespace><title></code>). The title pointer list actually points to entries in the URL pointer list.<br />
<br />
Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.<br />
<br />
To get the offset of an entry from the title pointer list, you have to look it up in the URL pointer list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Title> || integer || 0 || 4 || pointer to the URL pointer of <1st Title> <br />
|-<br />
| <2nd Title> || integer || 4 || 4 || pointer to the URL pointer of <2nd Title> <br />
|-<br />
| <nth Title> || integer ||(n-1)*4|| 4 || pointer to the URL pointer of <nth Title> <br />
|-<br />
| ... || integer || ... || 4 || ... <br />
|}<br />
<br />
The indirection from titles via URLs to directory entries has two reasons:<br />
* the pointer list is only half in size as 4 bytes are enough for each entry<br />
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim.<br />
<br />
== Directory Entries ==<br />
Directory entries hold the meta information about all entries, images and other objects in a ZIM archive.<br />
<br />
There are different types of directory entries:<br />
<br />
=== Content Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || MIME type number as defined in the MIME type list <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters (must be 0) <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| cluster number || integer || 8 || 4 || cluster number in which the data of this directory entry is stored <br />
|-<br />
| blob number || integer || 12 || 4 || blob number inside the compressed cluster where the contents are stored <br />
|-<br />
| url || string || 16 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
=== Redirect Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || 0xffff for redirect <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| redirect index || integer || 8 || 4 || pointer to the directory entry of the redirect target <br />
|-<br />
| url || string || 12 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
None of the strings should have control characters from U+0000 through U+001F.<br />
<br />
=== Linktarget or deleted Entry (DEPRECATED) ===<br />
There is two kinds of deprecated entry that could be found in pretty old zim files (I, main develloper of libzim, never saw it).<br />
<br />
They have mimetype equal to 0xfffe or 0xfffd. Reader implementation may check for those value and ignore the whole dirent.<br />
<br />
== Cluster Pointer List (clusterPtrPos) ==<br />
The cluster pointer list is a list of 8 byte offsets which point to all data clusters in a ZIM archive.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Cluster> || integer || 0 || 8 || pointer to the <1st Cluster> <br />
|-<br />
| <1st Cluster> || integer || 8 || 8 || pointer to the <2nd Cluster> <br />
|-<br />
| <nth Cluster> || integer ||(n-1)*8|| 8 || pointer to the <nth Cluster> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
== Clusters ==<br />
The clusters contain the actual data of the directory entries. Clusters can be compressed or uncompressed. The purpose of the clusters are that data of more than one directory entry can be compressed inside one cluster, making the compression much more efficient. Typically clusters have a size of about 1 MB.<br />
<br />
The first byte of the cluster identifies some information about the cluster.<br />
<br />
The first fourth low bits identifies if the cluster compression type:<br />
* No compression is indicated by a value of 1<br />
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).<br />
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.<br />
* 0 is an obselete code for no compression (inhereted from the Zeno)<br />
<br />
The fifth bit identifies the cluster is extended or not :<br />
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.<br />
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.<br />
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.<br />
<br />
The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].<br />
<br />
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed<br />
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8) <br />
|-<br />
| colspan="5" | The following data bytes have to be uncompressed!<br />
|-<br />
| <1st Blob> || integer || 1 || OFFSET_SIZE || offset to the <1st Blob> <br />
|-<br />
| <2nd Blob> || integer || 1+OFFSET_SIZE || OFFSET_SIZE || offset to the <2nd Blob> <br />
|-<br />
| <nth Blob> || integer ||(n-1)*OFFSET_SIZE+1|| OFFSET_SIZE || offset to the <nth Blob> <br />
|-<br />
| ... || integer || ... || OFFSET_SIZE || ... <br />
|-<br />
| <last blob / end> || integer || n/a || OFFSET_SIZE || offset to the end of the cluster <br />
|-<br />
| <1st Blob> || data || n/a || n/a || data of the <1st Blob> <br />
|-<br />
| <2nd Blob> || data || n/a || n/a || data of the <2nd Blob> <br />
|-<br />
| ... || data || ... || n/a || ... <br />
|}<br />
<br />
The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by OFFSET_SIZE. The size of one blob is calculated by the difference of two consecutive offsets.<br />
<br />
== Namespaces ==<br />
Namespaces separate different types of directory entries - which might have the same title or url - stored in the ZIM archive Format.<br />
<br />
The new namespace usage put a strong semantics on the namespaces. The libzim uses this semantics and provide different kind of API to access the different kind of entries.<br />
<br />
The libzim hide the namespace to the user, so a entry `foo.html` in namespace `C` will be accessible as `foo.html`. libzim provides specific API to access metadata.<br />
<br />
Other implementation are free to explicit the namespace or not (but need to provide some fallback for all namespace usage)<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Namespace !! Description <br />
|-<br />
| C || User content entries - see [[Article Format]] <br />
|-<br />
| M || ZIM metadata - see [[Metadata]] <br />
|-<br />
| W || Well know entries (MainPage, Favicon) - see [[Well known entries]] <br />
|-<br />
| X || search indexes - see [[Search indexes]]<br />
|}<br />
<br />
== URLs ==<br />
<br />
=== URL Encoding ===<br />
The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt)<br />
<br />
Some readers process the requests that already do the url decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to libzim.<br />
<br />
=== Local Anchors ===<br />
Many articles - especially when a table of contents is used - use local anchors to jump within an article. <br />
<br />
<pre><br />
<a href="foo#headline1">jump to article foo, headline 1</a><br />
</pre><br />
<br />
The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example "foo". After the article has been loaded the browser will then search for the local anchor tag and jump to the right location.<br />
<br />
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.<br />
<br />
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim.<br />
<br />
== Encodings ==<br />
=== Character Encoding ===<br />
The standard encoding for ZIM archive content is UTF-8. So both article data and URLs should be handled accordingly.<br />
<br />
=== Integer Encoding ===<br />
All types are little-endian.<br />
<br />
All integers are unsigned integers (uint_16, uint_32, uint_64).<br />
<br />
All lengths are bytes.<br />
<br />
== Split ZIM files ==<br />
ZIM archives can be split in multiple files. This is necessary to be able to store big (over 4GB for example) ZIM archives to limited file systems (like FAT32). That said, the files can be of any size, but the naming is really important. The ZIM files should be named like following (the file name extensions matter): ''foobar.zimaa, foobar.zimab, foobar.zimac''...<br />
<br />
== See also ==<br />
* [[Zeno file format]] (deprecated)<br />
* [[ZIM File Format/4]] (deprecated)<br />
* [[ZIM File Example]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=ZIM_file_format&diff=31336ZIM file format2024-02-03T14:10:07Z<p>Kelson: /* Header */</p>
<hr />
<div>Beginning 2021, we change the way we handle namespaces in ZIM file format.<br />
<br />
This is major change in the way we handle entries in libzim and the semantics around but it is not a break in the binary zim format. Old library/readers are compatible with new zim files.<br />
<br />
This page describe the new format. The old format can be found here : [[ZIM file format old namespace]].[[Image:Schema File Format.png|500px|right]]<br />
== Header ==<br />
A ZIM archive starts with a header :<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !! Offset !! Length !! Description <br />
|-<br />
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)<br />
|-<br />
| [[#Major_.26_Minor_versions|majorVersion]]<br />
| integer<br />
| 4<br />
| 2<br />
| Major version of the ZIM archive format (6)<br />
|-<br />
| [[#Major_.26_Minor_versions|minorVersion]] || integer || 6 || 2 || Minor version of the ZIM archive format (1 for new namespace usage, 0 for old namespace usage) <br />
|-<br />
| uuid || integer || 8 || 16 || unique id of this zim archive <br />
|-<br />
| entryCount || integer || 24 || 4 || total number of entries <br />
|-<br />
| clusterCount || integer || 28 || 4 || total number of clusters <br />
|-<br />
| urlPtrPos || integer || 32 || 8 || position of the directory pointerlist ordered by URL <br />
|-<br />
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title<br />
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.<br />
|-<br />
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list <br />
|-<br />
| mimeListPos || integer || 56 || 8 || position of the MIME type list (also header size) <br />
|-<br />
| mainPage || integer || 64 || 4 || main page or 0xffffffff if no main page <br />
|-<br />
| layoutPage || integer || 68 || 4 || [[Layout Page|layout page]] or 0xffffffffff if no layout page (deprecated, always 0xffffffffff) <br />
|-<br />
| checksumPos || integer || 72 || 8 || pointer to the md5checksum of this archive without the checksum itself. This points always 16 bytes before the end of the archive.<br />
|}<br />
<br />
=== Major & Minor versions ===<br />
<br />
Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)<br />
<br />
Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1)<br />
<br />
The current major version is 6. You may found old zim archives with major version 5. They are the same than 6 less extended cluster, so you can read a 5 major version as if it was a 6.<br />
<br />
The minor version can be :<br />
* 0 : We use the old namespace usage (see [[ZIM file format old namespace]])<br />
* 1 : We use the new namespace usage (describe here).<br />
* 2 : Means that libzim (reference implementation) allow to create alias entries (Several entries pointing to the same cluster/blob). This was already allowed by specification, this is hint only.<br />
<br />
A zim archive may be embedded in another file at a specific offset. In the context of zim format, the start of the zim header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.<br />
<br />
== MIME Type List (mimeListPos) ==<br />
The MIME type list always follows directly after the header, so the ''mimeListPos'' also defines the end and size of the ZIM file header.<br />
<br />
The MIME types in this list are zero terminated strings. An empty string marks the end of the MIME type list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st MIME Type> || string || 0 ||zero terminated|| declaration of the <1st MIME Type> <br />
|-<br />
| <2nd MIME Type> || string || n/a ||zero terminated|| declaration of the <2nd MIME Type> <br />
|-<br />
| ... || string || ... ||zero terminated|| ... <br />
|-<br />
| <last entry / end> || string || n/a ||zero terminated|| empty string - end of MIME type list <br />
|}<br />
<br />
== URL Pointer List (urlPtrPos) ==<br />
The URL pointer list is a list of 8 byte offsets to the directory entries.<br />
<br />
The directory entries are always ordered by "full" URL (<code><namespace><path></code>). Ordering is simply done by comparing the URL strings.<br />
<br />
Since directory entries have variable sizes this is needed for random access.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st URL> || integer || 0 || 8 || pointer to the directory entry of <1st URL> <br />
|-<br />
| <2nd URL> || integer || 8 || 8 || pointer to the directory entry of <2nd URL> <br />
|-<br />
| <nth URL> || integer ||(n-1)*8|| 8 || pointer to the directory entry of <nth URL> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
Libzim caches directory entries and references the cached entries via the URL pointers.<br />
<br />
== Title Pointer List (titlePtrPos) ==<br />
The title pointer list is a list of entry indices ordered by title (<code><namespace><title></code>). The title pointer list actually points to entries in the URL pointer list.<br />
<br />
Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.<br />
<br />
To get the offset of an entry from the title pointer list, you have to look it up in the URL pointer list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Title> || integer || 0 || 4 || pointer to the URL pointer of <1st Title> <br />
|-<br />
| <2nd Title> || integer || 4 || 4 || pointer to the URL pointer of <2nd Title> <br />
|-<br />
| <nth Title> || integer ||(n-1)*4|| 4 || pointer to the URL pointer of <nth Title> <br />
|-<br />
| ... || integer || ... || 4 || ... <br />
|}<br />
<br />
The indirection from titles via URLs to directory entries has two reasons:<br />
* the pointer list is only half in size as 4 bytes are enough for each entry<br />
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim.<br />
<br />
== Directory Entries ==<br />
Directory entries hold the meta information about all entries, images and other objects in a ZIM archive.<br />
<br />
There are different types of directory entries:<br />
<br />
=== Content Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || MIME type number as defined in the MIME type list <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters (must be 0) <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| cluster number || integer || 8 || 4 || cluster number in which the data of this directory entry is stored <br />
|-<br />
| blob number || integer || 12 || 4 || blob number inside the compressed cluster where the contents are stored <br />
|-<br />
| url || string || 16 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
=== Redirect Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || 0xffff for redirect <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| redirect index || integer || 8 || 4 || pointer to the directory entry of the redirect target <br />
|-<br />
| url || string || 12 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
None of the strings should have control characters from U+0000 through U+001F.<br />
<br />
=== Linktarget or deleted Entry (DEPRECATED) ===<br />
There is two kinds of deprecated entry that could be found in pretty old zim files (I, main develloper of libzim, never saw it).<br />
<br />
They have mimetype equal to 0xfffe or 0xfffd. Reader implementation may check for those value and ignore the whole dirent.<br />
<br />
== Cluster Pointer List (clusterPtrPos) ==<br />
The cluster pointer list is a list of 8 byte offsets which point to all data clusters in a ZIM archive.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Cluster> || integer || 0 || 8 || pointer to the <1st Cluster> <br />
|-<br />
| <1st Cluster> || integer || 8 || 8 || pointer to the <2nd Cluster> <br />
|-<br />
| <nth Cluster> || integer ||(n-1)*8|| 8 || pointer to the <nth Cluster> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
== Clusters ==<br />
The clusters contain the actual data of the directory entries. Clusters can be compressed or uncompressed. The purpose of the clusters are that data of more than one directory entry can be compressed inside one cluster, making the compression much more efficient. Typically clusters have a size of about 1 MB.<br />
<br />
The first byte of the cluster identifies some information about the cluster.<br />
<br />
The first fourth low bits identifies if the cluster compression type:<br />
* No compression is indicated by a value of 1<br />
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).<br />
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.<br />
* 0 is an obselete code for no compression (inhereted from the Zeno)<br />
<br />
The fifth bit identifies the cluster is extended or not :<br />
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.<br />
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.<br />
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.<br />
<br />
The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].<br />
<br />
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed<br />
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8) <br />
|-<br />
| colspan="5" | The following data bytes have to be uncompressed!<br />
|-<br />
| <1st Blob> || integer || 1 || OFFSET_SIZE || offset to the <1st Blob> <br />
|-<br />
| <2nd Blob> || integer || 1+OFFSET_SIZE || OFFSET_SIZE || offset to the <2nd Blob> <br />
|-<br />
| <nth Blob> || integer ||(n-1)*OFFSET_SIZE+1|| OFFSET_SIZE || offset to the <nth Blob> <br />
|-<br />
| ... || integer || ... || OFFSET_SIZE || ... <br />
|-<br />
| <last blob / end> || integer || n/a || OFFSET_SIZE || offset to the end of the cluster <br />
|-<br />
| <1st Blob> || data || n/a || n/a || data of the <1st Blob> <br />
|-<br />
| <2nd Blob> || data || n/a || n/a || data of the <2nd Blob> <br />
|-<br />
| ... || data || ... || n/a || ... <br />
|}<br />
<br />
The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by OFFSET_SIZE. The size of one blob is calculated by the difference of two consecutive offsets.<br />
<br />
== Namespaces ==<br />
Namespaces separate different types of directory entries - which might have the same title or url - stored in the ZIM archive Format.<br />
<br />
The new namespace usage put a strong semantics on the namespaces. The libzim uses this semantics and provide different kind of API to access the different kind of entries.<br />
<br />
The libzim hide the namespace to the user, so a entry `foo.html` in namespace `C` will be accessible as `foo.html`. libzim provides specific API to access metadata.<br />
<br />
Other implementation are free to explicit the namespace or not (but need to provide some fallback for all namespace usage)<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Namespace !! Description <br />
|-<br />
| C || User content entries - see [[Article Format]] <br />
|-<br />
| M || ZIM metadata - see [[Metadata]] <br />
|-<br />
| W || Well know entries (MainPage, Favicon) - see [[Well known entries]] <br />
|-<br />
| X || search indexes - see [[Search indexes]]<br />
|}<br />
<br />
== URLs ==<br />
<br />
=== URL Encoding ===<br />
The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt)<br />
<br />
Some readers process the requests that already do the url decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to libzim.<br />
<br />
=== Local Anchors ===<br />
Many articles - especially when a table of contents is used - use local anchors to jump within an article. <br />
<br />
<pre><br />
<a href="foo#headline1">jump to article foo, headline 1</a><br />
</pre><br />
<br />
The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example "foo". After the article has been loaded the browser will then search for the local anchor tag and jump to the right location.<br />
<br />
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.<br />
<br />
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim.<br />
<br />
== Encodings ==<br />
=== Character Encoding ===<br />
The standard encoding for ZIM archive content is UTF-8. So both article data and URLs should be handled accordingly.<br />
<br />
=== Integer Encoding ===<br />
All types are little-endian.<br />
<br />
All integers are unsigned integers (uint_16, uint_32, uint_64).<br />
<br />
All lengths are bytes.<br />
<br />
== Split ZIM files ==<br />
ZIM archives can be split in multiple files. This is necessary to be able to store big (over 4GB for example) ZIM archives to limited file systems (like FAT32). That said, the files can be of any size, but the naming is really important. The ZIM files should be named like following (the file name extensions matter): ''foobar.zimaa, foobar.zimab, foobar.zimac''...<br />
<br />
== See also ==<br />
* [[Zeno file format]] (deprecated)<br />
* [[ZIM File Format/4]] (deprecated)<br />
* [[ZIM File Example]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=ZIM_file_format&diff=31335ZIM file format2024-02-03T14:09:29Z<p>Kelson: /* Header */</p>
<hr />
<div>Beginning 2021, we change the way we handle namespaces in ZIM file format.<br />
<br />
This is major change in the way we handle entries in libzim and the semantics around but it is not a break in the binary zim format. Old library/readers are compatible with new zim files.<br />
<br />
This page describe the new format. The old format can be found here : [[ZIM file format old namespace]].[[Image:Schema File Format.png|500px|right]]<br />
== Header ==<br />
A ZIM archive starts with a header :<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !! Offset !! Length !! Description <br />
|-<br />
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)<br />
|-<br />
| majorVersion<br />
| integer<br />
| 4<br />
| 2<br />
| Major version of the ZIM archive format (6)<br />
|-<br />
| minorVersion || integer || 6 || 2 || Minor version of the ZIM archive format (1 for new namespace usage, 0 for old namespace usage) <br />
|-<br />
| uuid || integer || 8 || 16 || unique id of this zim archive <br />
|-<br />
| entryCount || integer || 24 || 4 || total number of entries <br />
|-<br />
| clusterCount || integer || 28 || 4 || total number of clusters <br />
|-<br />
| urlPtrPos || integer || 32 || 8 || position of the directory pointerlist ordered by URL <br />
|-<br />
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title<br />
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.<br />
|-<br />
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list <br />
|-<br />
| mimeListPos || integer || 56 || 8 || position of the MIME type list (also header size) <br />
|-<br />
| mainPage || integer || 64 || 4 || main page or 0xffffffff if no main page <br />
|-<br />
| layoutPage || integer || 68 || 4 || [[Layout Page|layout page]] or 0xffffffffff if no layout page (deprecated, always 0xffffffffff) <br />
|-<br />
| checksumPos || integer || 72 || 8 || pointer to the md5checksum of this archive without the checksum itself. This points always 16 bytes before the end of the archive.<br />
|}<br />
<br />
=== Major & Minor versions ===<br />
<br />
Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)<br />
<br />
Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1)<br />
<br />
The current major version is 6. You may found old zim archives with major version 5. They are the same than 6 less extended cluster, so you can read a 5 major version as if it was a 6.<br />
<br />
The minor version can be :<br />
* 0 : We use the old namespace usage (see [[ZIM file format old namespace]])<br />
* 1 : We use the new namespace usage (describe here).<br />
* 2 : Means that libzim (reference implementation) allow to create alias entries (Several entries pointing to the same cluster/blob). This was already allowed by specification, this is hint only.<br />
<br />
A zim archive may be embedded in another file at a specific offset. In the context of zim format, the start of the zim header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.<br />
<br />
== MIME Type List (mimeListPos) ==<br />
The MIME type list always follows directly after the header, so the ''mimeListPos'' also defines the end and size of the ZIM file header.<br />
<br />
The MIME types in this list are zero terminated strings. An empty string marks the end of the MIME type list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st MIME Type> || string || 0 ||zero terminated|| declaration of the <1st MIME Type> <br />
|-<br />
| <2nd MIME Type> || string || n/a ||zero terminated|| declaration of the <2nd MIME Type> <br />
|-<br />
| ... || string || ... ||zero terminated|| ... <br />
|-<br />
| <last entry / end> || string || n/a ||zero terminated|| empty string - end of MIME type list <br />
|}<br />
<br />
== URL Pointer List (urlPtrPos) ==<br />
The URL pointer list is a list of 8 byte offsets to the directory entries.<br />
<br />
The directory entries are always ordered by "full" URL (<code><namespace><path></code>). Ordering is simply done by comparing the URL strings.<br />
<br />
Since directory entries have variable sizes this is needed for random access.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st URL> || integer || 0 || 8 || pointer to the directory entry of <1st URL> <br />
|-<br />
| <2nd URL> || integer || 8 || 8 || pointer to the directory entry of <2nd URL> <br />
|-<br />
| <nth URL> || integer ||(n-1)*8|| 8 || pointer to the directory entry of <nth URL> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
Libzim caches directory entries and references the cached entries via the URL pointers.<br />
<br />
== Title Pointer List (titlePtrPos) ==<br />
The title pointer list is a list of entry indices ordered by title (<code><namespace><title></code>). The title pointer list actually points to entries in the URL pointer list.<br />
<br />
Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.<br />
<br />
To get the offset of an entry from the title pointer list, you have to look it up in the URL pointer list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Title> || integer || 0 || 4 || pointer to the URL pointer of <1st Title> <br />
|-<br />
| <2nd Title> || integer || 4 || 4 || pointer to the URL pointer of <2nd Title> <br />
|-<br />
| <nth Title> || integer ||(n-1)*4|| 4 || pointer to the URL pointer of <nth Title> <br />
|-<br />
| ... || integer || ... || 4 || ... <br />
|}<br />
<br />
The indirection from titles via URLs to directory entries has two reasons:<br />
* the pointer list is only half in size as 4 bytes are enough for each entry<br />
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim.<br />
<br />
== Directory Entries ==<br />
Directory entries hold the meta information about all entries, images and other objects in a ZIM archive.<br />
<br />
There are different types of directory entries:<br />
<br />
=== Content Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || MIME type number as defined in the MIME type list <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters (must be 0) <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| cluster number || integer || 8 || 4 || cluster number in which the data of this directory entry is stored <br />
|-<br />
| blob number || integer || 12 || 4 || blob number inside the compressed cluster where the contents are stored <br />
|-<br />
| url || string || 16 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
=== Redirect Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || 0xffff for redirect <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| redirect index || integer || 8 || 4 || pointer to the directory entry of the redirect target <br />
|-<br />
| url || string || 12 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
None of the strings should have control characters from U+0000 through U+001F.<br />
<br />
=== Linktarget or deleted Entry (DEPRECATED) ===<br />
There is two kinds of deprecated entry that could be found in pretty old zim files (I, main develloper of libzim, never saw it).<br />
<br />
They have mimetype equal to 0xfffe or 0xfffd. Reader implementation may check for those value and ignore the whole dirent.<br />
<br />
== Cluster Pointer List (clusterPtrPos) ==<br />
The cluster pointer list is a list of 8 byte offsets which point to all data clusters in a ZIM archive.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Cluster> || integer || 0 || 8 || pointer to the <1st Cluster> <br />
|-<br />
| <1st Cluster> || integer || 8 || 8 || pointer to the <2nd Cluster> <br />
|-<br />
| <nth Cluster> || integer ||(n-1)*8|| 8 || pointer to the <nth Cluster> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
== Clusters ==<br />
The clusters contain the actual data of the directory entries. Clusters can be compressed or uncompressed. The purpose of the clusters are that data of more than one directory entry can be compressed inside one cluster, making the compression much more efficient. Typically clusters have a size of about 1 MB.<br />
<br />
The first byte of the cluster identifies some information about the cluster.<br />
<br />
The first fourth low bits identifies if the cluster compression type:<br />
* No compression is indicated by a value of 1<br />
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).<br />
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.<br />
* 0 is an obselete code for no compression (inhereted from the Zeno)<br />
<br />
The fifth bit identifies the cluster is extended or not :<br />
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.<br />
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.<br />
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.<br />
<br />
The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].<br />
<br />
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed<br />
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8) <br />
|-<br />
| colspan="5" | The following data bytes have to be uncompressed!<br />
|-<br />
| <1st Blob> || integer || 1 || OFFSET_SIZE || offset to the <1st Blob> <br />
|-<br />
| <2nd Blob> || integer || 1+OFFSET_SIZE || OFFSET_SIZE || offset to the <2nd Blob> <br />
|-<br />
| <nth Blob> || integer ||(n-1)*OFFSET_SIZE+1|| OFFSET_SIZE || offset to the <nth Blob> <br />
|-<br />
| ... || integer || ... || OFFSET_SIZE || ... <br />
|-<br />
| <last blob / end> || integer || n/a || OFFSET_SIZE || offset to the end of the cluster <br />
|-<br />
| <1st Blob> || data || n/a || n/a || data of the <1st Blob> <br />
|-<br />
| <2nd Blob> || data || n/a || n/a || data of the <2nd Blob> <br />
|-<br />
| ... || data || ... || n/a || ... <br />
|}<br />
<br />
The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by OFFSET_SIZE. The size of one blob is calculated by the difference of two consecutive offsets.<br />
<br />
== Namespaces ==<br />
Namespaces separate different types of directory entries - which might have the same title or url - stored in the ZIM archive Format.<br />
<br />
The new namespace usage put a strong semantics on the namespaces. The libzim uses this semantics and provide different kind of API to access the different kind of entries.<br />
<br />
The libzim hide the namespace to the user, so a entry `foo.html` in namespace `C` will be accessible as `foo.html`. libzim provides specific API to access metadata.<br />
<br />
Other implementation are free to explicit the namespace or not (but need to provide some fallback for all namespace usage)<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Namespace !! Description <br />
|-<br />
| C || User content entries - see [[Article Format]] <br />
|-<br />
| M || ZIM metadata - see [[Metadata]] <br />
|-<br />
| W || Well know entries (MainPage, Favicon) - see [[Well known entries]] <br />
|-<br />
| X || search indexes - see [[Search indexes]]<br />
|}<br />
<br />
== URLs ==<br />
<br />
=== URL Encoding ===<br />
The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt)<br />
<br />
Some readers process the requests that already do the url decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to libzim.<br />
<br />
=== Local Anchors ===<br />
Many articles - especially when a table of contents is used - use local anchors to jump within an article. <br />
<br />
<pre><br />
<a href="foo#headline1">jump to article foo, headline 1</a><br />
</pre><br />
<br />
The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example "foo". After the article has been loaded the browser will then search for the local anchor tag and jump to the right location.<br />
<br />
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.<br />
<br />
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim.<br />
<br />
== Encodings ==<br />
=== Character Encoding ===<br />
The standard encoding for ZIM archive content is UTF-8. So both article data and URLs should be handled accordingly.<br />
<br />
=== Integer Encoding ===<br />
All types are little-endian.<br />
<br />
All integers are unsigned integers (uint_16, uint_32, uint_64).<br />
<br />
All lengths are bytes.<br />
<br />
== Split ZIM files ==<br />
ZIM archives can be split in multiple files. This is necessary to be able to store big (over 4GB for example) ZIM archives to limited file systems (like FAT32). That said, the files can be of any size, but the naming is really important. The ZIM files should be named like following (the file name extensions matter): ''foobar.zimaa, foobar.zimab, foobar.zimac''...<br />
<br />
== See also ==<br />
* [[Zeno file format]] (deprecated)<br />
* [[ZIM File Format/4]] (deprecated)<br />
* [[ZIM File Example]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31334Content team2024-02-02T12:17:36Z<p>Kelson: /* Worflows */</p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format ("books" being understood here as web content stored as single web archives). <br />
== Purpose ==<br />
<br />
Provide web-based educational content to people without internet access, and make the experience as seamless as possible. Access and discovery must be user-friendly and market ready, the content up-to-date and as portable as can technically be.<br />
<br />
== Goals ==<br />
* Book curation must remain focused on educational material, broadly construed;<br />
* Books should have proper visual formatting;<br />
* Books should be up-to-date;<br />
* The Kiwix Library should allow easy and friendly discovery of content.<br />
<br />
== Responsabilities ==<br />
* Content Requests<br />
** Collaborate with requesters to qualify requests properly. Keep them informed.<br />
** Ensure we are allowed and able to fullfill requests<br />
** Initiate new recipes and manage first publishing if new book<br />
** Collaborate with scraper dev. team if necessary<br />
** Keep the tickets up2date<br />
<br />
* Scraping<br />
** Ensure Zimfarm works fine and contribute to its improvements with dev. team<br />
** Analyses failures or unexpected behaviors<br />
** Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team<br />
** Ensure workers are online and are properly configured<br />
** Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)<br />
<br />
* Library management<br />
** Ensure ZIM filenames and location (paths) are correct<br />
** Ensure ZIM Metadata are correct<br />
** Ensure ZIM are recent and kept up2date (AFAP)<br />
** Ensure library is coherent and user-friendly<br />
<br />
== Policies ==<br />
<br />
=== Publishing ===<br />
* Content has to be legal in Switzerland<br />
* Content should not advertise [https://en.wikipedia.org/wiki/Fringe_theory fringe theory]<br />
* Content should betterne [https://en.wikipedia.org/wiki/Free_content free content]<br />
* If not free, content should be:<br />
** Open content OR<br />
** Educational content OR<br />
** has an authorization of reproduction<br />
* Any content we publish should<br />
** have (almost) no user visible error<br />
** have proper/correct metadata<br />
** be easily discoverable in the public library<br />
<br />
=== Content Requests ===<br />
* Allow everybody to request new, changes or deletion of content<br />
* In full transparency track the lifecycle of our content portfolio<br />
* New content should be assessed and vetted content against publishing policy (see above)<br />
* Content requests should be closed:<br />
** when fully implemented (user visible)<br />
** if refusal or impossibility of implementation<br />
* ZIM Medata should be given for new content<br />
* Only once all prerequisites are satisfied, then start with scraping<br />
<br />
=== Scraping ===<br />
* Scraping leadership means the initiative should come from the content team<br />
* First analysis of error should be done by content team<br />
* If error in scraper is suspected<br />
** Issue should be updated to corresponding scraper code repository<br />
** Scraper problem analysis does not super-seed in any manner content request<br />
* ZIM quality should be vetted against publishing policy<br />
* Any recipe should run successfully first in dev before been put in production<br />
* Hardware resources should be saved<br />
<br />
=== Library Management ===<br />
<br />
=== Custom Apps ===<br />
<br />
== Processes ==<br />
<br />
=== Content Requests ===<br />
<br />
=== Scraping ===<br />
<br />
=== Library Management ===<br />
<br />
=== Custom Apps ===<br />
<br />
== Worflows ==<br />
<br />
<br />
## To create a new recipe for youtube files<br />
<br />
**It’s recommended to clone an existing Youtube recipe.**<br />
<br />
* Create the recipe name as per the naming conventions [here](https://github.com/openzim/overview/wiki/Naming-Convention).<br />
* In the Language space, choose the language of the website you are creating the recipe for.<br />
* From Category space, choose (other)<br />
* From warehouse path space, choose (/.hidden/.dev) always as a first time in order to test the resulted file, if the file is tested and all is correct then you update the recipe with the proper path (videos).<br />
* Make sure the Status is set to Enabled.<br />
* You can choose Periodicity to be monthly or quarterly.<br />
* In Offliner space choose: Youtube<br />
* In platform space choose Youtube.<br />
* Keep the rest the same with no change.<br />
<br />
**In Youtube command flags:**<br />
<br />
* In Playlist mode: choose (Not Set) if you are doing the recipe for a whole channel.<br />
* If you are doing the recipe for a playlist, choose (Set).<br />
* In Type: choose (Channel) or (Playlist) as per your required file.<br />
* In Youtube ID: type the ID of the channel or the playlist.<br />
* For the API Key: There is a list of keys mostly as per the channel or the playlists sizes, ask for the list to choose the appropriate API Key.<br />
* In Zim Name: the recipe name as per the naming conventions [here](https://github.com/openzim/overview/wiki/Naming-Convention).<br />
* In Title: type the name you want for the output file.<br />
* Description: type a short description of your required zim file.<br />
* Leave Optimisation Cache URL as it is (cloned from old recipe).<br />
* Leave the rest of the fields empty or as per the cloned recipe.<br />
* Finally, click in the bottom on (Update offliner details).<br />
* Review all your entries once again, then go back to the top of the page and click on (Request).<br />
* After about an hour, check the recipe if it failed or succeeded (or the next day if the source website is large).<br />
* If successful, go to this link ([dev.library.kiwix.org](https://dev.library.kiwix.org/)) and check your created file, check the size and check if the file is working properly. If the file does not appear, wait a bit as updates are made every 15 minutes.<br />
* If the file looks good and complete, go back to your recipe, In warehouse path space, change(/.hidden/.dev) to the proper category related to your file content (Wikipedia, Wikihow, … etc).<br />
* Click on Update offliner details and then click on Request again.<br />
* Finally, check the file in (https://library.kiwix.org/ ). If all is good, do not forget to go back to the initial ticket (most likely at zim-requests) and put the link of the output file and close the ticket.<br />
<br />
== Members ==<br />
* [https://github.com/Popolechien Popolechien], manager in line<br />
* [https://github.com/RavanJAltaie Ravan], content manager<br />
* [https://github.com/benoit74 Benoit74], scrapers lead dev<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31329Content team2024-01-18T18:29:59Z<p>Kelson: /* Scraping */</p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format ("books" being understood here as web content stored as single web archives). <br />
== Purpose ==<br />
<br />
Provide web-based educational content to people without internet access, and make the experience as seamless as possible. Access and discovery must be user-friendly and market ready, the content up-to-date and as portable as can technically be.<br />
<br />
== Goals ==<br />
* Book curation must remain focused on educational material, broadly construed;<br />
* Books should have proper visual formatting;<br />
* Books should be up-to-date;<br />
* The Kiwix Library should allow easy and friendly discovery of content.<br />
<br />
== Responsabilities ==<br />
* Content Requests<br />
** Collaborate with requesters to qualify requests properly. Keep them informed.<br />
** Ensure we are allowed and able to fullfill requests<br />
** Initiate new recipes and manage first publishing if new book<br />
** Collaborate with scraper dev. team if necessary<br />
** Keep the tickets up2date<br />
<br />
* Scraping<br />
** Ensure Zimfarm works fine and contribute to its improvements with dev. team<br />
** Analyses failures or unexpected behaviors<br />
** Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team<br />
** Ensure workers are online and are properly configured<br />
** Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)<br />
<br />
* Library management<br />
** Ensure ZIM filenames and location (paths) are correct<br />
** Ensure ZIM Metadata are correct<br />
** Ensure ZIM are recent and kept up2date (AFAP)<br />
** Ensure library is coherent and user-friendly<br />
<br />
== Policies ==<br />
<br />
=== Publishing ===<br />
* Content has to be legal in Switzerland<br />
* Content should not advertise [https://en.wikipedia.org/wiki/Fringe_theory fringe theory]<br />
* Content should betterne [https://en.wikipedia.org/wiki/Free_content free content]<br />
* If not free, content should be:<br />
** Open content OR<br />
** Educational content OR<br />
** has an authorization of reproduction<br />
* Any content we publish should<br />
** have (almost) no user visible error<br />
** have proper/correct metadata<br />
** be easily discoverable in the public library<br />
<br />
=== Content Requests ===<br />
* Allow everybody to request new, changes or deletion of content<br />
* In full transparency track the lifecycle of our content portfolio<br />
* New content should be assessed and vetted content against publishing policy (see above)<br />
* Content requests should be closed:<br />
** when fully implemented (user visible)<br />
** if refusal or impossibility of implementation<br />
* ZIM Medata should be given for new content<br />
* Only once all prerequisites are satisfied, then start with scraping<br />
<br />
=== Scraping ===<br />
* Scraping leadership means the initiative should come from the content team<br />
* First analysis of error should be done by content team<br />
* If error in scraper is suspected<br />
** Issue should be updated to corresponding scraper code repository<br />
** Scraper problem analysis does not super-seed in any manner content request<br />
* ZIM quality should be vetted against publishing policy<br />
* Any recipe should run successfully first in dev before been put in production<br />
* Hardware resources should be saved<br />
<br />
=== Library Management ===<br />
<br />
== Processes ==<br />
<br />
=== Content Requests ===<br />
<br />
=== Scraping ===<br />
<br />
=== Library Management ===<br />
<br />
== Worflows ==<br />
<br />
<br />
== Members ==<br />
* [https://github.com/Popolechien Popolechien], manager in line<br />
* [https://github.com/RavanJAltaie Ravan], content manager<br />
* [https://github.com/benoit74 Benoit74], scrapers lead dev<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31328Content team2024-01-18T18:26:14Z<p>Kelson: /* Scraping */</p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format ("books" being understood here as web content stored as single web archives). <br />
== Purpose ==<br />
<br />
Provide web-based educational content to people without internet access, and make the experience as seamless as possible. Access and discovery must be user-friendly and market ready, the content up-to-date and as portable as can technically be.<br />
<br />
== Goals ==<br />
* Book curation must remain focused on educational material, broadly construed;<br />
* Books should have proper visual formatting;<br />
* Books should be up-to-date;<br />
* The Kiwix Library should allow easy and friendly discovery of content.<br />
<br />
== Responsabilities ==<br />
* Content Requests<br />
** Collaborate with requesters to qualify requests properly. Keep them informed.<br />
** Ensure we are allowed and able to fullfill requests<br />
** Initiate new recipes and manage first publishing if new book<br />
** Collaborate with scraper dev. team if necessary<br />
** Keep the tickets up2date<br />
<br />
* Scraping<br />
** Ensure Zimfarm works fine and contribute to its improvements with dev. team<br />
** Analyses failures or unexpected behaviors<br />
** Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team<br />
** Ensure workers are online and are properly configured<br />
** Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)<br />
<br />
* Library management<br />
** Ensure ZIM filenames and location (paths) are correct<br />
** Ensure ZIM Metadata are correct<br />
** Ensure ZIM are recent and kept up2date (AFAP)<br />
** Ensure library is coherent and user-friendly<br />
<br />
== Policies ==<br />
<br />
=== Publishing ===<br />
* Content has to be legal in Switzerland<br />
* Content should not advertise [https://en.wikipedia.org/wiki/Fringe_theory fringe theory]<br />
* Content should betterne [https://en.wikipedia.org/wiki/Free_content free content]<br />
* If not free, content should be:<br />
** Open content OR<br />
** Educational content OR<br />
** has an authorization of reproduction<br />
* Any content we publish should<br />
** have (almost) no user visible error<br />
** have proper/correct metadata<br />
** be easily discoverable in the public library<br />
<br />
=== Content Requests ===<br />
* Allow everybody to request new, changes or deletion of content<br />
* In full transparency track the lifecycle of our content portfolio<br />
* New content should be assessed and vetted content against publishing policy (see above)<br />
* Content requests should be closed:<br />
** when fully implemented (user visible)<br />
** if refusal or impossibility of implementation<br />
* ZIM Medata should be given for new content<br />
* Only once all prerequisites are satisfied, then start with scraping<br />
<br />
=== Scraping ===<br />
* Scraping leadership means the initiative should come from the content team<br />
* First analysis of error should be done by content team<br />
* If error in scraper is suspected<br />
** Issue should be updated to corresponding scraper code repository<br />
** Scraper problem analysis does not super-seed in any manner content request<br />
* ZIM quality should be vetted against publishing policy<br />
* Any recipe should run successfully first in dev before been put in production<br />
* Hardware ressources should be saved<br />
<br />
=== Library Management ===<br />
<br />
== Processes ==<br />
<br />
=== Content Requests ===<br />
<br />
=== Scraping ===<br />
<br />
=== Library Management ===<br />
<br />
== Worflows ==<br />
<br />
<br />
== Members ==<br />
* [https://github.com/Popolechien Popolechien], manager in line<br />
* [https://github.com/RavanJAltaie Ravan], content manager<br />
* [https://github.com/benoit74 Benoit74], scrapers lead dev<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31327Content team2024-01-18T18:23:50Z<p>Kelson: </p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format ("books" being understood here as web content stored as single web archives). <br />
== Purpose ==<br />
<br />
Provide web-based educational content to people without internet access, and make the experience as seamless as possible. Access and discovery must be user-friendly and market ready, the content up-to-date and as portable as can technically be.<br />
<br />
== Goals ==<br />
* Book curation must remain focused on educational material, broadly construed;<br />
* Books should have proper visual formatting;<br />
* Books should be up-to-date;<br />
* The Kiwix Library should allow easy and friendly discovery of content.<br />
<br />
== Responsabilities ==<br />
* Content Requests<br />
** Collaborate with requesters to qualify requests properly. Keep them informed.<br />
** Ensure we are allowed and able to fullfill requests<br />
** Initiate new recipes and manage first publishing if new book<br />
** Collaborate with scraper dev. team if necessary<br />
** Keep the tickets up2date<br />
<br />
* Scraping<br />
** Ensure Zimfarm works fine and contribute to its improvements with dev. team<br />
** Analyses failures or unexpected behaviors<br />
** Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team<br />
** Ensure workers are online and are properly configured<br />
** Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)<br />
<br />
* Library management<br />
** Ensure ZIM filenames and location (paths) are correct<br />
** Ensure ZIM Metadata are correct<br />
** Ensure ZIM are recent and kept up2date (AFAP)<br />
** Ensure library is coherent and user-friendly<br />
<br />
== Policies ==<br />
<br />
=== Publishing ===<br />
* Content has to be legal in Switzerland<br />
* Content should not advertise [https://en.wikipedia.org/wiki/Fringe_theory fringe theory]<br />
* Content should betterne [https://en.wikipedia.org/wiki/Free_content free content]<br />
* If not free, content should be:<br />
** Open content OR<br />
** Educational content OR<br />
** has an authorization of reproduction<br />
* Any content we publish should<br />
** have (almost) no user visible error<br />
** have proper/correct metadata<br />
** be easily discoverable in the public library<br />
<br />
=== Content Requests ===<br />
* Allow everybody to request new, changes or deletion of content<br />
* In full transparency track the lifecycle of our content portfolio<br />
* New content should be assessed and vetted content against publishing policy (see above)<br />
* Content requests should be closed:<br />
** when fully implemented (user visible)<br />
** if refusal or impossibility of implementation<br />
* ZIM Medata should be given for new content<br />
* Only once all prerequisites are satisfied, then start with scraping<br />
<br />
=== Scraping ===<br />
<br />
=== Library Management ===<br />
<br />
== Processes ==<br />
<br />
=== Content Requests ===<br />
<br />
=== Scraping ===<br />
<br />
=== Library Management ===<br />
<br />
== Worflows ==<br />
<br />
<br />
== Members ==<br />
* [https://github.com/Popolechien Popolechien], manager in line<br />
* [https://github.com/RavanJAltaie Ravan], content manager<br />
* [https://github.com/benoit74 Benoit74], scrapers lead dev<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31326Content team2024-01-18T18:10:27Z<p>Kelson: /* Content Requests */</p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format ("books" being understood here as web content stored as single web archives). <br />
== Purpose ==<br />
<br />
Provide web-based educational content to people without internet access, and make the experience as seamless as possible. Access and discovery must be user-friendly and market ready, the content up-to-date and as portable as can technically be.<br />
<br />
== Goals ==<br />
* Book curation must remain focused on educational material, broadly construed;<br />
* Books should have proper visual formatting;<br />
* Books should be up-to-date;<br />
* The Kiwix Library should allow easy and friendly discovery of content.<br />
<br />
== Responsabilities ==<br />
* Content Requests<br />
** Collaborate with requesters to qualify requests properly. Keep them informed.<br />
** Ensure we are allowed and able to fullfill requests<br />
** Initiate new recipes and manage first publishing if new book<br />
** Collaborate with scraper dev. team if necessary<br />
** Keep the tickets up2date<br />
<br />
* Scraping<br />
** Ensure Zimfarm works fine and contribute to its improvements with dev. team<br />
** Analyses failures or unexpected behaviors<br />
** Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team<br />
** Ensure workers are online and are properly configured<br />
** Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)<br />
<br />
* Library management<br />
** Ensure ZIM filenames and location (paths) are correct<br />
** Ensure ZIM Metadata are correct<br />
** Ensure ZIM are recent and kept up2date (AFAP)<br />
** Ensure library is coherent and user-friendly<br />
<br />
== Policies ==<br />
<br />
=== Publishing ===<br />
* Content has to be legal in Switzerland<br />
* Content should not advertise [https://en.wikipedia.org/wiki/Fringe_theory fringe theory]<br />
* Content should betterne [https://en.wikipedia.org/wiki/Free_content free content]<br />
* If not free, content should be:<br />
** Open content OR<br />
** Educational content OR<br />
** has an authorization of reproduction<br />
<br />
=== Content Requests ===<br />
* Allow everybody to request new, changes or deletion of content<br />
* In full transparency track the lifecycle of our content portfolio<br />
* New content should be assessed and vetted content against publishing policy (see above)<br />
* Content requests should be closed:<br />
** when fully implemented (user visible)<br />
** if refusal or impossibility of implementation<br />
* ZIM Medata should be given for new content<br />
* Only once all prerequisites are satisfied, then start with scraping<br />
<br />
=== Scraping ===<br />
<br />
=== Library Management ===<br />
<br />
== Processes ==<br />
<br />
=== Content Requests ===<br />
<br />
=== Scraping ===<br />
<br />
=== Library Management ===<br />
<br />
== Worflows ==<br />
<br />
<br />
== Members ==<br />
* [https://github.com/Popolechien Popolechien], manager in line<br />
* [https://github.com/RavanJAltaie Ravan], content manager<br />
* [https://github.com/benoit74 Benoit74], scrapers lead dev<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31325Content team2024-01-18T18:07:35Z<p>Kelson: /* Policies */</p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format ("books" being understood here as web content stored as single web archives). <br />
== Purpose ==<br />
<br />
Provide web-based educational content to people without internet access, and make the experience as seamless as possible. Access and discovery must be user-friendly and market ready, the content up-to-date and as portable as can technically be.<br />
<br />
== Goals ==<br />
* Book curation must remain focused on educational material, broadly construed;<br />
* Books should have proper visual formatting;<br />
* Books should be up-to-date;<br />
* The Kiwix Library should allow easy and friendly discovery of content.<br />
<br />
== Responsabilities ==<br />
* Content Requests<br />
** Collaborate with requesters to qualify requests properly. Keep them informed.<br />
** Ensure we are allowed and able to fullfill requests<br />
** Initiate new recipes and manage first publishing if new book<br />
** Collaborate with scraper dev. team if necessary<br />
** Keep the tickets up2date<br />
<br />
* Scraping<br />
** Ensure Zimfarm works fine and contribute to its improvements with dev. team<br />
** Analyses failures or unexpected behaviors<br />
** Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team<br />
** Ensure workers are online and are properly configured<br />
** Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)<br />
<br />
* Library management<br />
** Ensure ZIM filenames and location (paths) are correct<br />
** Ensure ZIM Metadata are correct<br />
** Ensure ZIM are recent and kept up2date (AFAP)<br />
** Ensure library is coherent and user-friendly<br />
<br />
== Policies ==<br />
<br />
=== Publishing ===<br />
* Content has to be legal in Switzerland<br />
* Content should not advertise [https://en.wikipedia.org/wiki/Fringe_theory fringe theory]<br />
* Content should betterne [https://en.wikipedia.org/wiki/Free_content free content]<br />
* If not free, content should be:<br />
** Open content OR<br />
** Educational content OR<br />
** has an authorization of reproduction<br />
<br />
=== Content Requests ===<br />
* Allow everybody to request new, changes or deletion of content<br />
* In full transparency track the lifecycle of our content portfolio<br />
* New content should be assessed and vetted content against publishing policy (see above)<br />
* Content requests should be closed:<br />
** when fully implemented (user visible)<br />
** if refusal or impossibility of implementation<br />
<br />
=== Scraping ===<br />
<br />
=== Library Management ===<br />
<br />
== Processes ==<br />
<br />
=== Content Requests ===<br />
<br />
=== Scraping ===<br />
<br />
=== Library Management ===<br />
<br />
== Worflows ==<br />
<br />
<br />
== Members ==<br />
* [https://github.com/Popolechien Popolechien], manager in line<br />
* [https://github.com/RavanJAltaie Ravan], content manager<br />
* [https://github.com/benoit74 Benoit74], scrapers lead dev<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31324Content team2024-01-18T18:04:47Z<p>Kelson: /* Publishing */</p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format ("books" being understood here as web content stored as single web archives). <br />
== Purpose ==<br />
<br />
Provide web-based educational content to people without internet access, and make the experience as seamless as possible. Access and discovery must be user-friendly and market ready, the content up-to-date and as portable as can technically be.<br />
<br />
== Goals ==<br />
* Book curation must remain focused on educational material, broadly construed;<br />
* Books should have proper visual formatting;<br />
* Books should be up-to-date;<br />
* The Kiwix Library should allow easy and friendly discovery of content.<br />
<br />
== Responsabilities ==<br />
* Content Requests<br />
** Collaborate with requesters to qualify requests properly. Keep them informed.<br />
** Ensure we are allowed and able to fullfill requests<br />
** Initiate new recipes and manage first publishing if new book<br />
** Collaborate with scraper dev. team if necessary<br />
** Keep the tickets up2date<br />
<br />
* Scraping<br />
** Ensure Zimfarm works fine and contribute to its improvements with dev. team<br />
** Analyses failures or unexpected behaviors<br />
** Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team<br />
** Ensure workers are online and are properly configured<br />
** Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)<br />
<br />
* Library management<br />
** Ensure ZIM filenames and location (paths) are correct<br />
** Ensure ZIM Metadata are correct<br />
** Ensure ZIM are recent and kept up2date (AFAP)<br />
** Ensure library is coherent and user-friendly<br />
<br />
== Policies ==<br />
<br />
=== Publishing ===<br />
* Content has to be legal in Switzerland<br />
* Content should not advertise [https://en.wikipedia.org/wiki/Fringe_theory fringe theory]<br />
* Content should betterne [https://en.wikipedia.org/wiki/Free_content|free content]<br />
* If not free, content should be:<br />
** Open content OR<br />
** Educational content OR<br />
** has an authorization of reproduction<br />
<br />
=== Content Requests ===<br />
* Allow everybody to request new, changes or deletion of content<br />
* In full transparency track the lifecycle of our content portfolio<br />
* New content should be assessed and vetted content against publishing policy (see above)<br />
* Content requests should be closed:<br />
** when fully implemented (user visible)<br />
** if refusal or impossibility of implementation<br />
<br />
=== Scraping ===<br />
<br />
=== Library Management ===<br />
<br />
== Processes ==<br />
<br />
=== Content Requests ===<br />
<br />
=== Scraping ===<br />
<br />
=== Library Management ===<br />
<br />
== Worflows ==<br />
<br />
<br />
== Members ==<br />
* [https://github.com/Popolechien Popolechien], manager in line<br />
* [https://github.com/RavanJAltaie Ravan], content manager<br />
* [https://github.com/benoit74 Benoit74], scrapers lead dev<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31323Content team2024-01-18T18:04:33Z<p>Kelson: /* Policies */</p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format ("books" being understood here as web content stored as single web archives). <br />
== Purpose ==<br />
<br />
Provide web-based educational content to people without internet access, and make the experience as seamless as possible. Access and discovery must be user-friendly and market ready, the content up-to-date and as portable as can technically be.<br />
<br />
== Goals ==<br />
* Book curation must remain focused on educational material, broadly construed;<br />
* Books should have proper visual formatting;<br />
* Books should be up-to-date;<br />
* The Kiwix Library should allow easy and friendly discovery of content.<br />
<br />
== Responsabilities ==<br />
* Content Requests<br />
** Collaborate with requesters to qualify requests properly. Keep them informed.<br />
** Ensure we are allowed and able to fullfill requests<br />
** Initiate new recipes and manage first publishing if new book<br />
** Collaborate with scraper dev. team if necessary<br />
** Keep the tickets up2date<br />
<br />
* Scraping<br />
** Ensure Zimfarm works fine and contribute to its improvements with dev. team<br />
** Analyses failures or unexpected behaviors<br />
** Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team<br />
** Ensure workers are online and are properly configured<br />
** Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)<br />
<br />
* Library management<br />
** Ensure ZIM filenames and location (paths) are correct<br />
** Ensure ZIM Metadata are correct<br />
** Ensure ZIM are recent and kept up2date (AFAP)<br />
** Ensure library is coherent and user-friendly<br />
<br />
== Policies ==<br />
<br />
== Publishing ==<br />
* Content has to be legal in Switzerland<br />
* Content should not advertise [https://en.wikipedia.org/wiki/Fringe_theory fringe theory]<br />
* Content should betterne [https://en.wikipedia.org/wiki/Free_content|free content]<br />
* If not free, content should be:<br />
** Open content OR<br />
** Educational content OR<br />
** has an authorization of reproduction<br />
<br />
=== Content Requests ===<br />
* Allow everybody to request new, changes or deletion of content<br />
* In full transparency track the lifecycle of our content portfolio<br />
* New content should be assessed and vetted content against publishing policy (see above)<br />
* Content requests should be closed:<br />
** when fully implemented (user visible)<br />
** if refusal or impossibility of implementation<br />
<br />
=== Scraping ===<br />
<br />
=== Library Management ===<br />
<br />
== Processes ==<br />
<br />
=== Content Requests ===<br />
<br />
=== Scraping ===<br />
<br />
=== Library Management ===<br />
<br />
== Worflows ==<br />
<br />
<br />
== Members ==<br />
* [https://github.com/Popolechien Popolechien], manager in line<br />
* [https://github.com/RavanJAltaie Ravan], content manager<br />
* [https://github.com/benoit74 Benoit74], scrapers lead dev<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31322Content team2024-01-18T17:49:24Z<p>Kelson: </p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format ("books" being understood here as web content stored as single web archives). <br />
== Purpose ==<br />
<br />
Provide web-based educational content to people without internet access, and make the experience as seamless as possible. Access and discovery must be user-friendly and market ready, the content up-to-date and as portable as can technically be.<br />
<br />
== Goals ==<br />
* Book curation must remain focused on educational material, broadly construed;<br />
* Books should have proper visual formatting;<br />
* Books should be up-to-date;<br />
* The Kiwix Library should allow easy and friendly discovery of content.<br />
<br />
== Responsabilities ==<br />
* Content Requests<br />
** Collaborate with requesters to qualify requests properly. Keep them informed.<br />
** Ensure we are allowed and able to fullfill requests<br />
** Initiate new recipes and manage first publishing if new book<br />
** Collaborate with scraper dev. team if necessary<br />
** Keep the tickets up2date<br />
<br />
* Scraping<br />
** Ensure Zimfarm works fine and contribute to its improvements with dev. team<br />
** Analyses failures or unexpected behaviors<br />
** Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team<br />
** Ensure workers are online and are properly configured<br />
** Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)<br />
<br />
* Library management<br />
** Ensure ZIM filenames and location (paths) are correct<br />
** Ensure ZIM Metadata are correct<br />
** Ensure ZIM are recent and kept up2date (AFAP)<br />
** Ensure library is coherent and user-friendly<br />
<br />
== Policies ==<br />
<br />
=== Content Requests ===<br />
<br />
=== Scraping ===<br />
<br />
=== Library Management ===<br />
<br />
== Processes ==<br />
<br />
=== Content Requests ===<br />
<br />
=== Scraping ===<br />
<br />
=== Library Management ===<br />
<br />
== Worflows ==<br />
<br />
<br />
== Members ==<br />
* [https://github.com/Popolechien Popolechien], manager in line<br />
* [https://github.com/RavanJAltaie Ravan], content manager<br />
* [https://github.com/benoit74 Benoit74], scrapers lead dev<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31321Content team2024-01-18T17:48:55Z<p>Kelson: </p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format ("books" being understood here as web content stored as single web archives). <br />
== Purpose ==<br />
<br />
Provide web-based educational content to people without internet access, and make the experience as seamless as possible. Access and discovery must be user-friendly and market ready, the content up-to-date and as portable as can technically be.<br />
<br />
== Goals ==<br />
* Book curation must remain focused on educational material, broadly construed;<br />
* Books should have proper visual formatting;<br />
* Books should be up-to-date;<br />
* The Kiwix Library should allow easy and friendly discovery of content.<br />
<br />
== Responsabilities ==<br />
* Content Requests<br />
** Collaborate with requesters to qualify requests properly. Keep them informed.<br />
** Ensure we are allowed and able to fullfill requests<br />
** Initiate new recipes and manage first publishing if new book<br />
** Collaborate with scraper dev. team if necessary<br />
** Keep the tickets up2date<br />
<br />
* Scraping<br />
** Ensure Zimfarm works fine and contribute to its improvements with dev. team<br />
** Analyses failures or unexpected behaviors<br />
** Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team<br />
** Ensure workers are online and are properly configured<br />
** Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)<br />
<br />
* Library management<br />
** Ensure ZIM filenames and location (paths) are correct<br />
** Ensure ZIM Metadata are correct<br />
** Ensure ZIM are recent and kept up2date (AFAP)<br />
** Ensure library is coherent and user-friendly<br />
<br />
== Policies ==<br />
<br />
=== Content Requests ==<br />
<br />
=== Scraping ==<br />
<br />
=== Library Management ==-<br />
<br />
== Processes ==<br />
<br />
=== Content Requests ==<br />
<br />
=== Scraping ==<br />
<br />
=== Library Management ===<br />
<br />
== Worflows ==<br />
<br />
<br />
== Members ==<br />
* [https://github.com/Popolechien Popolechien], manager in line<br />
* [https://github.com/RavanJAltaie Ravan], content manager<br />
* [https://github.com/benoit74 Benoit74], scrapers lead dev<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31319Content team2023-12-15T12:38:58Z<p>Kelson: /* Duties */</p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format. It has emerged - as effort - in the early 2020s as part of a strong [[content strategy]].<br />
<br />
== Purpose ==<br />
<br />
Provide the best possible (appealing, up2date, small, 'market ready', ...) library of ZIM based books.<br />
<br />
== Goals ==<br />
* Books topics should stick to needs<br />
* Books should have proper visual formatting<br />
* Books should be up-to-date<br />
* Library should allow to discover quickly and nicely (new) books<br />
<br />
== Duties ==<br />
* Requests<br />
** Collaborate with requesters to qualify requests properly. Keep them informed.<br />
** Ensure we are allowed and able to fullfill requests<br />
** Initiate new recipes and manage first publishing if new book<br />
** Collaborate with scraper dev. team if necessary<br />
** Keep the tickets up2date<br />
<br />
* Scraping<br />
** Ensure Zimfarm works fine and contribute to its improvements with dev. team<br />
** Analyses failures or unexpected behaviors<br />
** Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team<br />
** Ensure workers are online and are properly configured<br />
** Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)<br />
<br />
* Library management<br />
** Ensure ZIM filenames and location (paths) are correct<br />
** Ensure ZIM Metadata are correct<br />
** Ensure ZIM are recent and kept up2date (AFAP)<br />
** Ensure library is coherent and user-friendly<br />
<br />
== Members ==<br />
* [https://github.com/Popolechien Popolechien], manager in line<br />
* [https://github.com/RavanJAltaie Ravan], content manager<br />
* [https://github.com/benoit74 Benoit74], scrapers lead dev<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31318Content team2023-12-15T12:38:03Z<p>Kelson: /* Duties */</p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format. It has emerged - as effort - in the early 2020s as part of a strong [[content strategy]].<br />
<br />
== Purpose ==<br />
<br />
Provide the best possible (appealing, up2date, small, 'market ready', ...) library of ZIM based books.<br />
<br />
== Goals ==<br />
* Books topics should stick to needs<br />
* Books should have proper visual formatting<br />
* Books should be up-to-date<br />
* Library should allow to discover quickly and nicely (new) books<br />
<br />
== Duties ==<br />
* Requests<br />
** Collaborate with requesters to qualify requests properly. Keep them informed.<br />
** Ensure we are allowed and able to fullfill requests<br />
** Initiate new recipes and manage first publishing if new book<br />
** Collaborate with scraper dev. team if necessary<br />
** Keep the tickets up2date<br />
<br />
* Scraping<br />
** Ensure Zimfarm works fine and contribute to its improvements with dev. team<br />
** Analyses failures or unexpected behaviors<br />
** Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team<br />
** Ensure workers are online and are properly configured<br />
** Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)<br />
<br />
* CMS<br />
** Ensure ZIM filenames and location (paths) are correct<br />
** Ensure ZIM Metadata are correct<br />
** Ensure ZIM are recent and kept up2date (AFAP)<br />
** Ensure library is coherent and user-friendly<br />
<br />
== Members ==<br />
* [https://github.com/Popolechien Popolechien], manager in line<br />
* [https://github.com/RavanJAltaie Ravan], content manager<br />
* [https://github.com/benoit74 Benoit74], scrapers lead dev<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31316Content team2023-12-15T11:45:14Z<p>Kelson: </p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format. It has emerged - as effort - in the early 2020s as part of a strong [[content strategy]].<br />
<br />
== Purpose ==<br />
<br />
Provide the best possible (appealing, up2date, small, 'market ready', ...) library of ZIM based books.<br />
<br />
== Goals ==<br />
* Books topics should stick to needs<br />
* Books should have proper visual formatting<br />
* Books should be up-to-date<br />
* Library should allow to discover quickly and nicely (new) books<br />
<br />
== Duties ==<br />
* Requests<br />
** Collaborate with requesters to qualify requests properly. Keep them informed.<br />
** Ensure we are allowed and able to fullfill requests<br />
** Initiate new recipes and manage first publishing if new book<br />
** Collaborate with scraper dev. team if necessary<br />
** Keep the tickets up2date<br />
<br />
* Zimfarm<br />
** Ensure Zimfarm works fine and contribute to its improvements with dev. team<br />
** Analyses failures or unexpected behaviors<br />
** Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team<br />
** Ensure workers are online and are properly configured<br />
** Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)<br />
<br />
* CMS<br />
** Ensure ZIM filenames and location (paths) are correct<br />
** Ensure ZIM Metadata are correct<br />
** Ensure ZIM are recent and kept up2date (AFAP)<br />
** Ensure library is coherent and user-friendly<br />
<br />
== Members ==<br />
* [https://github.com/Popolechien Popolechien], manager in line<br />
* [https://github.com/RavanJAltaie Ravan], content manager<br />
* [https://github.com/benoit74 Benoit74], scrapers lead dev<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31315Content team2023-12-15T11:31:03Z<p>Kelson: /* Duties */</p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format. It has emerged - as effort - in the early 2020s as part of a strong [[content strategy]].<br />
<br />
== Purpose ==<br />
<br />
Provide the best possible (appealing, up2date, small, 'market ready', ...) library of ZIM based books.<br />
<br />
== Goals ==<br />
* Books topics should stick to needs<br />
* Books should have proper visual formatting<br />
* Books should be up-to-date<br />
* Library should allow to discover quickly and nicely (new) books<br />
<br />
== Duties ==<br />
* Requests<br />
** Collaborate with requesters to qualify requests properly. Keep them informed.<br />
** Ensure we are allowed and able to fullfill requests<br />
** Initiate new recipes and manage first publishing if new book<br />
** Collaborate with scraper dev. team if necessary<br />
** Keep the tickets up2date<br />
<br />
* Zimfarm<br />
** Ensure Zimfarm works fine and contribute to its improvements with dev. team<br />
** Analyses failures or unexpected behaviors<br />
** Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team<br />
** Ensure workers are online and are properly configured<br />
** Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)<br />
<br />
* CMS<br />
** Ensure ZIM filenames and location (paths) are correct<br />
** Ensure ZIM Metadata are correct<br />
** Ensure ZIM are recent and kept up2date (AFAP)<br />
** Ensure library is coherent and user-friendly<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31314Content team2023-12-15T11:28:05Z<p>Kelson: </p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format. It has emerged - as effort - in the early 2020s as part of a strong [[content strategy]].<br />
<br />
== Purpose ==<br />
<br />
Provide the best possible (appealing, up2date, small, 'market ready', ...) library of ZIM based books.<br />
<br />
== Goals ==<br />
* Books topics should stick to needs<br />
* Books should have proper visual formatting<br />
* Books should be up-to-date<br />
* Library should allow to discover quickly and nicely (new) books<br />
<br />
== Duties ==<br />
* Zimfarm<br />
** Ensure Zimfarm works fine and contribute to its improvements with dev. team<br />
** Analyses failures or unexpected behaviors<br />
** Ensure recipes run properly, fix configuration when necessary and contribute to scraper improvements with dev. team<br />
** Ensure workers are online and are properly configured<br />
** Ensure scrapes lifecycle is correct (Reasonable pipeline size, Running scrapes progressing appropriately, not too many failures)<br />
<br />
* CMS<br />
** Ensure ZIM filenames and location (paths) are correct<br />
** Ensure ZIM Metadata are correct<br />
** Ensure ZIM are recent and kept up2date (AFAP)<br />
** Ensure library is coherent and user-friendly<br />
<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31313Content team2023-12-15T11:16:01Z<p>Kelson: </p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format. It has emerged - as effort - in the early 2020s as part of a strong [[content strategy]].<br />
<br />
== Purpose ==<br />
<br />
Provide the best possible (appealing, up2date, small, 'market ready', ...) library of ZIM based books.<br />
<br />
== Goals ==<br />
* Books topics should stick to needs<br />
* Books should have proper visual formatting<br />
* Books should be up-to-date<br />
* Library should allow to discover quickly and nicely (new) books<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31312Content team2023-12-15T11:15:10Z<p>Kelson: </p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format. It has emerged - as effort - in the early 2020s as part of a strong [[content strategy]].<br />
<br />
== Vision ==<br />
<br />
Provide the best possible (appealing, up2date, small, 'market ready', ...) library of ZIM based books.<br />
<br />
== Goals ==<br />
* Books topics should stick to needs<br />
* Books should have proper visual formatting<br />
* Books should be up-to-date<br />
* Library should allow to discover quickly and nicely (new) books<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31311Content team2023-12-15T11:10:23Z<p>Kelson: </p>
<hr />
<div>The '''Content team''' gathers people in charge of providing books in the ZIM format. It has emerged - as effort - in the early 2020s as part of a strong [[content strategy]].<br />
<br />
== Vision ==<br />
<br />
Provide the best possible (appealing, up2date, small, 'market ready', ...) library of ZIM based books.<br />
<br />
== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_Team&diff=31310Content Team2023-12-15T11:04:06Z<p>Kelson: Kelson moved page Content Team to Content team</p>
<hr />
<div>#REDIRECT [[Content team]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31309Content team2023-12-15T11:04:06Z<p>Kelson: Kelson moved page Content Team to Content team</p>
<hr />
<div>== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_team&diff=31308Content team2023-12-15T11:03:57Z<p>Kelson: Created page with "== See also == * Content strategy"</p>
<hr />
<div>== See also ==<br />
* [[Content strategy]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Content_strategy&diff=31307Content strategy2023-12-15T11:03:54Z<p>Kelson: </p>
<hr />
<div>== Bit of History ==<br />
The openZIM has been created almost 15 years ago. His primary goal was to specify a file based storage solution to efficiently store and access Wikipedia (Mediawiki) offline ; and then implement an official reader/writer software library came right after. The ZIM open spefication and the libzim are still at the core of the project. A few years later the ZIM tools have been launched to easily inspect and manipulate ZIM files on the command line.<br />
<br />
But, over the following years, most of the project activity has grown around scrapers and we actively maintain more than 15 of them now. After a while, it has appeared that we should really start to mutualize code to avoid too much of duplicate code and rationalize maintenant. Therefore, a few libzim bindings and scraper libraries were created.<br />
<br />
These very few years, facing an increasing demand, we have develop the project to industrialise the ZIM creation process. This effort led to the launch of the Zimfarm or our custom CMS. We have taken care as well of the proper further development of Wikipedia selection infrastructure WP1.<br />
<br />
== Context ==<br />
Today, the openZIM projects looks pretty different from what is was at start : it actually delivers far more than expected first. It has moved from a pure software project to what more and more looks like a publishing organisation.<br />
<br />
While these efforts were led on openZIM, the Kiwix project continued his own journey. Even if this is a never ending story to maintain Kiwix software stack, this is a mature portfolio and there is no disruptive plan. But one interesting learning has been made is that people are not interested in Kiwix itself, but in the content which are made available offline. This is pretty clear if we consider the success of or Android custom apps.<br />
<br />
in 2021, we have finally isolated the kiwix-hotspot activities in a dedicated "OffSpot silo" which helps to better appreciate what Kiwix as organisation. The OffSpot activities are strategic, but are not that much concerned by what will follow.<br />
<br />
== Content push ==<br />
<br />
Because of the history and the current context, we believe that we could better deliver if we would be more focus on the content. We are convinced that we should move our focus from a software centric approach to a a more content centric approach. Without leaving (at all) the field of the software development, we should move more torward the publishing field.<br />
<br />
Because ultimatively, users are more interested in content than in software (which are only a mean); we believe this is a way to better come to funding too.<br />
<br />
== Goals ==<br />
<br />
The goals are to propose more and better content in the ZIM format, where our software stack can really make a difference.<br />
<br />
With "better" is meant:<br />
* Securing the content are fancy and user friendly. They don't suffer of bad layout, broken links, or this kind of weaknesses which are too often the case currently.<br />
* More adapted content, which mean an effort at the curation level: checked revisions, selection only, any kind of curation.<br />
<br />
With "more" is meant:<br />
* Always secure new versions can be delivered properly<br />
* Provide solutions to allow people to their own ZIM files following a self-service workflow<br />
* Diversify the content offer, including non-free content.<br />
<br />
Develop a quality content offer makes no sense if this portfolio is not comprehensive. Therefore it is important to develop a library which is comprehensive.<br />
<br />
== Approach ==<br />
<br />
Following that goals will be a long journey, but we don't start with nothing. Serious achievements have been reached these last year around the scrapers, the industrialisation of publishing processes or even the quality of the library.<br />
<br />
He would be the primary axes:<br />
<br />
* Improve the quality of ZIM files<br />
** Focus investments on scrapers<br />
** Develop the JS-API to have better interactions possibles between readers and content<br />
<br />
* Continue the development of the tooling to have a better control over the publishing of content:<br />
** Improve the automatic Q&A chain with "zimcheck" and its integration<br />
** Develop the CMS to have an efficient human driven / semi-automatic control about the library<br />
** Recruit people to focus on publishing/moderation work<br />
<br />
* Develop self-service tools:<br />
** Continue WP1 project transformation to allow anybody to make Wikimedia projects (not only Wikipedia in English) to make selections and better choose revisions<br />
** Improve publishing solution to allow non-tech people to publish custom-applications (maybe not even only on Android)<br />
** Develop<br />
<br />
* Have a non-free content approach:<br />
** Allow to have a ZIM creation and publishing tool chain which is private/propriatery and non public, primarely for the offspot (cardshop).<br />
** Make a clear pricing for the non-free content, with a pricing which is proportional (super propriatery content are the most expensive)<br />
<br />
* Improve the library<br />
** Improve search/filtering/pertinence of the results shown in the library<br />
** Develop library.kiwix.org to make a francy content store<br />
** Allow propriatery library creation/maintenance on-demand<br />
** Finish/Improve library integration in Kiwix ports<br />
<br />
== Financing ==<br />
<br />
== Measure of success ==<br />
<br />
== See also ==<br />
* [[Content team]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Search_indexes&diff=31305Search indexes2023-11-25T16:37:28Z<p>Kelson: /* Tittle index */</p>
<hr />
<div>Zim archives contain specific indexes in the <code>X</code> namespace.<br />
<br />
They are content that can be used by reader implementation to locate user entries.<br />
<br />
All indexes are optional.<br />
<br />
All indexes and listing items MUST be stored in uncompressed cluster.<br />
<br />
== Xapian Indexes ==<br />
Xapian indexes are xapian database. They have to be opened using the xapian library.<br />
<br />
=== Fulltext index ===<br />
'''Namespace''': <code>X</code><br />
<br />
'''Path''': <code>fulltext/xapian</code><br />
<br />
'''Mimetype''' : <code>application/octet-stream+xapian</code><br />
<br />
=== Title index ===<br />
'''Namespace''': <code>X</code><br />
<br />
'''Path''': <code>title/xapian</code><br />
<br />
'''Mimetype''' : <code>application/octet-stream+xapian</code><br />
<br />
== Listing ==<br />
<br />
Listings are listing of entries.<br />
<br />
The content of listing are binary array of entry numbers. Each entry number is 4 bytes (little-endian) unsigned integer. Entry number is the index of the entry in the URL pointer list.<br />
<br />
To get the offset of an entry from the title pointer list, you have to look it up in the URL pointer list<br />
<br />
=== Title index v0 ===<br />
'''Namespace''': <code>X</code><br />
<br />
'''Path''': <code>listing/titleOrdered/v0</code><br />
<br />
'''Mimetype''' : <code>application/octet-stream+zimlisting</code><br />
<br />
The content of the listing is the list of all entries in the zim archive (all namespace included, including <code>X/listing/title/v0</code> itself).<br />
<br />
Entries are sorted using the key <code><namespace><title></code><br />
<br />
Content size is <code>4 * <nbEntries></code><br />
<br />
This is the exact same content of the data <code>titlePtrPos</code>.<br />
<br />
If present, <code>titlePtrPos</code> should directly point to the data of this entry.<br />
<br />
=== Title index v1 ===<br />
'''Namespace''': <code>X</code><br />
<br />
'''Path''': <code>listing/titleOrdered/v1</code><br />
<br />
'''Mimetype''' : <code>application/octet-stream+zimlisting</code><br />
<br />
The content of the listing is the list of all "article entries" in the zim archive.<br />
<br />
Those "article entries" may be redirect (what is a article entry is not really defined in the spec, it is up to the creator to define in which category a entry is).<br />
<br />
Entries are sorted using the key <code><title></code> (All article entries are in <code>C</code> namespace by definition)<br />
<br />
Content size is <code>4 * <nbArticle></code><br />
<br />
<code>listing/titleOrdered/v1</code> may be used to pick random articles or to search article by title and be sure that no resource entries are included.</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=ZIM_file_format&diff=31304ZIM file format2023-11-15T20:48:52Z<p>Kelson: /* Directory Entries */</p>
<hr />
<div>Beginning 2021, we change the way we handle namespaces in ZIM file format.<br />
<br />
This is major change in the way we handle entries in libzim and the semantics around but it is not a break in the binary zim format. Old library/readers are compatible with new zim files.<br />
<br />
This page describe the new format. The old format can be found here : [[ZIM file format old namespace]].[[Image:Schema File Format.png|500px|right]]<br />
== Header ==<br />
A ZIM archive starts with a header :<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !! Offset !! Length !! Description <br />
|-<br />
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)<br />
|-<br />
|majorVersion<br />
|integer<br />
|4<br />
|2<br />
|Major version of the ZIM archive format (6)<br />
|-<br />
| minorVersion || integer || 6 || 2 || Minor version of the ZIM archive format (1 for new namespace usage, 0 for old namespace usage) <br />
|-<br />
| uuid || integer || 8 || 16 || unique id of this zim archive <br />
|-<br />
| entryCount || integer || 24 || 4 || total number of entries <br />
|-<br />
| clusterCount || integer || 28 || 4 || total number of clusters <br />
|-<br />
| urlPtrPos || integer || 32 || 8 || position of the directory pointerlist ordered by URL <br />
|-<br />
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title<br />
This is considered as obsolete, readers should use <code>[[Search indexes#Title index v0|X/listing/titleordered/v0]]</code> instead and fallback to <code>titlePtrPos</code> if entry is not present.<br />
|-<br />
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list <br />
|-<br />
| mimeListPos || integer || 56 || 8 || position of the MIME type list (also header size) <br />
|-<br />
| mainPage || integer || 64 || 4 || main page or 0xffffffff if no main page <br />
|-<br />
| layoutPage || integer || 68 || 4 || [[Layout Page|layout page]] or 0xffffffffff if no layout page (deprecated, always 0xffffffffff) <br />
|-<br />
| checksumPos || integer || 72 || 8 || pointer to the md5checksum of this archive without the checksum itself. This points always 16 bytes before the end of the archive.<br />
|}<br />
<br />
Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)<br />
<br />
Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1)<br />
<br />
The current major version is 6. You may found old zim archives with major version 5. They are the same than 6 less extended cluster, so you can read a 5 major version as if it was a 6.<br />
<br />
The minor version can be :<br />
* 0 : We use the old namespace usage (see [[ZIM file format old namespace]])<br />
* 1 : We use the new namespace usage (describe here).<br />
A zim archive may be embedded in another file at a specific offset. In the context of zim format, the start of the zim header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.<br />
<br />
== MIME Type List (mimeListPos) ==<br />
The MIME type list always follows directly after the header, so the ''mimeListPos'' also defines the end and size of the ZIM file header.<br />
<br />
The MIME types in this list are zero terminated strings. An empty string marks the end of the MIME type list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st MIME Type> || string || 0 ||zero terminated|| declaration of the <1st MIME Type> <br />
|-<br />
| <2nd MIME Type> || string || n/a ||zero terminated|| declaration of the <2nd MIME Type> <br />
|-<br />
| ... || string || ... ||zero terminated|| ... <br />
|-<br />
| <last entry / end> || string || n/a ||zero terminated|| empty string - end of MIME type list <br />
|}<br />
<br />
== URL Pointer List (urlPtrPos) ==<br />
The URL pointer list is a list of 8 byte offsets to the directory entries.<br />
<br />
The directory entries are always ordered by "full" URL (<code><namespace><path></code>). Ordering is simply done by comparing the URL strings.<br />
<br />
Since directory entries have variable sizes this is needed for random access.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st URL> || integer || 0 || 8 || pointer to the directory entry of <1st URL> <br />
|-<br />
| <2nd URL> || integer || 8 || 8 || pointer to the directory entry of <2nd URL> <br />
|-<br />
| <nth URL> || integer ||(n-1)*8|| 8 || pointer to the directory entry of <nth URL> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
Libzim caches directory entries and references the cached entries via the URL pointers.<br />
<br />
== Title Pointer List (titlePtrPos) ==<br />
The title pointer list is a list of entry indices ordered by title (<code><namespace><title></code>). The title pointer list actually points to entries in the URL pointer list.<br />
<br />
Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers.<br />
<br />
To get the offset of an entry from the title pointer list, you have to look it up in the URL pointer list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Title> || integer || 0 || 4 || pointer to the URL pointer of <1st Title> <br />
|-<br />
| <2nd Title> || integer || 4 || 4 || pointer to the URL pointer of <2nd Title> <br />
|-<br />
| <nth Title> || integer ||(n-1)*4|| 4 || pointer to the URL pointer of <nth Title> <br />
|-<br />
| ... || integer || ... || 4 || ... <br />
|}<br />
<br />
The indirection from titles via URLs to directory entries has two reasons:<br />
* the pointer list is only half in size as 4 bytes are enough for each entry<br />
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in libzim.<br />
<br />
== Directory Entries ==<br />
Directory entries hold the meta information about all entries, images and other objects in a ZIM archive.<br />
<br />
There are different types of directory entries:<br />
<br />
=== Content Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || MIME type number as defined in the MIME type list <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters (must be 0) <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| cluster number || integer || 8 || 4 || cluster number in which the data of this directory entry is stored <br />
|-<br />
| blob number || integer || 12 || 4 || blob number inside the compressed cluster where the contents are stored <br />
|-<br />
| url || string || 16 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
=== Redirect Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || 0xffff for redirect <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (not used) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history (must be 0) <br />
|-<br />
| redirect index || integer || 8 || 4 || pointer to the directory entry of the redirect target <br />
|-<br />
| url || string || 12 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
None of the strings should have control characters from U+0000 through U+001F.<br />
<br />
=== Linktarget or deleted Entry (DEPRECATED) ===<br />
There is two kinds of deprecated entry that could be found in pretty old zim files (I, main develloper of libzim, never saw it).<br />
<br />
They have mimetype equal to 0xfffe or 0xfffd. Reader implementation may check for those value and ignore the whole dirent.<br />
<br />
== Cluster Pointer List (clusterPtrPos) ==<br />
The cluster pointer list is a list of 8 byte offsets which point to all data clusters in a ZIM archive.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Cluster> || integer || 0 || 8 || pointer to the <1st Cluster> <br />
|-<br />
| <1st Cluster> || integer || 8 || 8 || pointer to the <2nd Cluster> <br />
|-<br />
| <nth Cluster> || integer ||(n-1)*8|| 8 || pointer to the <nth Cluster> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
== Clusters ==<br />
The clusters contain the actual data of the directory entries. Clusters can be compressed or uncompressed. The purpose of the clusters are that data of more than one directory entry can be compressed inside one cluster, making the compression much more efficient. Typically clusters have a size of about 1 MB.<br />
<br />
The first byte of the cluster identifies some information about the cluster.<br />
<br />
The first fourth low bits identifies if the cluster compression type:<br />
* No compression is indicated by a value of 1<br />
* Compressed clusters are indicated by a value of 4 ([[LZMA2 compression]] (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).<br />
* There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.<br />
* 0 is an obselete code for no compression (inhereted from the Zeno)<br />
<br />
The fifth bit identifies the cluster is extended or not :<br />
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.<br />
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.<br />
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.<br />
<br />
The libzim uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].<br />
<br />
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| cluster information || integer || 0 || 1 || Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed<br />
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8) <br />
|-<br />
| colspan="5" | The following data bytes have to be uncompressed!<br />
|-<br />
| <1st Blob> || integer || 1 || OFFSET_SIZE || offset to the <1st Blob> <br />
|-<br />
| <2nd Blob> || integer || 1+OFFSET_SIZE || OFFSET_SIZE || offset to the <2nd Blob> <br />
|-<br />
| <nth Blob> || integer ||(n-1)*OFFSET_SIZE+1|| OFFSET_SIZE || offset to the <nth Blob> <br />
|-<br />
| ... || integer || ... || OFFSET_SIZE || ... <br />
|-<br />
| <last blob / end> || integer || n/a || OFFSET_SIZE || offset to the end of the cluster <br />
|-<br />
| <1st Blob> || data || n/a || n/a || data of the <1st Blob> <br />
|-<br />
| <2nd Blob> || data || n/a || n/a || data of the <2nd Blob> <br />
|-<br />
| ... || data || ... || n/a || ... <br />
|}<br />
<br />
The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by OFFSET_SIZE. The size of one blob is calculated by the difference of two consecutive offsets.<br />
<br />
== Namespaces ==<br />
Namespaces separate different types of directory entries - which might have the same title or url - stored in the ZIM archive Format.<br />
<br />
The new namespace usage put a strong semantics on the namespaces. The libzim uses this semantics and provide different kind of API to access the different kind of entries.<br />
<br />
The libzim hide the namespace to the user, so a entry `foo.html` in namespace `C` will be accessible as `foo.html`. libzim provides specific API to access metadata.<br />
<br />
Other implementation are free to explicit the namespace or not (but need to provide some fallback for all namespace usage)<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Namespace !! Description <br />
|-<br />
| C || User content entries - see [[Article Format]] <br />
|-<br />
| M || ZIM metadata - see [[Metadata]] <br />
|-<br />
| W || Well know entries (MainPage, Favicon) - see [[Well known entries]] <br />
|-<br />
| X || search indexes - see [[Search indexes]]<br />
|}<br />
<br />
== URLs ==<br />
<br />
=== URL Encoding ===<br />
The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt)<br />
<br />
Some readers process the requests that already do the url decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to libzim.<br />
<br />
=== Local Anchors ===<br />
Many articles - especially when a table of contents is used - use local anchors to jump within an article. <br />
<br />
<pre><br />
<a href="foo#headline1">jump to article foo, headline 1</a><br />
</pre><br />
<br />
The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example "foo". After the article has been loaded the browser will then search for the local anchor tag and jump to the right location.<br />
<br />
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.<br />
<br />
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to libzim.<br />
<br />
== Encodings ==<br />
=== Character Encoding ===<br />
The standard encoding for ZIM archive content is UTF-8. So both article data and URLs should be handled accordingly.<br />
<br />
=== Integer Encoding ===<br />
All types are little-endian.<br />
<br />
All integers are unsigned integers (uint_16, uint_32, uint_64).<br />
<br />
All lengths are bytes.<br />
<br />
== Split ZIM files ==<br />
ZIM archives can be split in multiple files. This is necessary to be able to store big (over 4GB for example) ZIM archives to limited file systems (like FAT32). That said, the files can be of any size, but the naming is really important. The ZIM files should be named like following (the file name extensions matter): ''foobar.zimaa, foobar.zimab, foobar.zimac''...<br />
<br />
== See also ==<br />
* [[Zeno file format]] (deprecated)<br />
* [[ZIM File Format/4]] (deprecated)<br />
* [[ZIM File Example]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Metadata&diff=31303Metadata2023-11-11T18:44:50Z<p>Kelson: /* Keys */ Small precision how "Counter" Metdata is computed</p>
<hr />
<div>In order to provide a description to each ZIM file that can be easily extracted we defined a special '''namespace M''' and a standardized set of keywords that should be used.<br />
<br />
Every key is defined like an article, the key name is used as the article name, the key value is put into the article text. This way also metadata is compressed, but extendable. Further keys could be used in a ZIM file without breaking the standard but please be aware that maybe the openZIM project will define additional keys in the future. Any ZIM library reading this metadata should ignore missing keys / values and just return NULL values in such cases.<br />
<br />
== Keys ==<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Key !! Mandatory !! Description !! Example<br />
|-<br />
! Name<br />
| yes<br />
| A human readable identifier for the resource. It's the same across versions (should be stable across time).<br />
| ''wikipedia_fr_football''<br />
|-<br />
! Title<br />
| yes<br />
| title of zim file. 30 characters maximum recommended.<br />
| ''English Wikipedia''<br />
|-<br />
! Creator<br />
| yes<br />
| creator(s) of the ZIM file content<br />
| ''English speaking Wikipedia contributors''<br />
|-<br />
! Publisher<br />
| yes<br />
| creator of the ZIM file itself<br />
| ''Wikipedia user Foobar''<br />
|-<br />
! Date<br />
| yes<br />
| ZIM creation date (ISO - YYYY-MM-DD)<br />
| ''2009-11-21''<br />
|-<br />
! Description <br />
| yes<br />
| description of content (one short sentence). 80 characters maximum recommended.<br />
| ''All articles (without images) from the english Wikipedia''<br />
|-<br />
! LongDescription <br />
| no<br />
| extended description of content. Carriage return allowed. 4000 characters maximum recommended.<br />
| ''This ZIM file contains all articles (without images) from the english Wikipedia by 2009-11-10. The topics are ...''<br />
|-<br />
! Language<br />
| yes<br />
| [http://www.sil.org/iso639-3/codes.asp ISO639-3 language identifier]. If many, comma separated, and ordered by "importance" (which should be the number of entries, but in a edge case it can be ordered on an other criteria).<br />
| ''eng''<br />
|-<br />
! License<br />
| No<br />
| License code of the content.<br />
| ''CC-BY''<br />
|-<br />
! Tags<br />
| no<br />
| A list of [[tags]]<br />
| ''wikipedia;_category:wikipedia;_pictures:no;_videos:no;_details:yes;_ftindex:yes''<br />
|-<br />
! Relation<br />
| no<br />
| URI of external related ressources<br />
| <br />
|-<br />
! Flavour<br />
| no<br />
| A human readable string describing the way how the content has been scraped. It's the same across versions (should be stable across time).<br />
| ''nopic''<br />
|-<br />
! Source<br />
| no<br />
| URI of the original source<br />
| ''https://en.wikipedia.org/''<br />
|-<br />
! Counter<br />
| no<br />
| Number of non-redirect entries per mime-type in the [[ZIM_file_format#Namespaces|C namespace]]<br />
| image/jpeg=5;image/gif=3;image/png=2;...<br />
|-<br />
! Scraper<br />
| no<br />
| Details about the software used to scrape the content, with its version<br />
| mwoffliner 1.2.3<br />
|-<br />
!Illustration_[height]x[width]@[scale]<br />
|yes<br />
|A png image (resolution [height] by [width]) to illustrate the zim file.<br />
This must be a binary content (png) with mimeytpe `image/png`.<br />
<br />
We follow the same specification than freedesktop https://specifications.freedesktop.org/icon-theme-spec/icon-theme-spec-latest.html for the size and scale of the icon.<br />
<br />
<code>height</code>, <code>width</code>, <code>scale</code> describe the '''target''' size (where the icon is intended to be displayed) :<br />
<br />
- <code>Illustration_48x48@1</code> is a 48x48 pixels image to be displayed as a 48x48 icon on a scale 1 screen.<br />
<br />
- <code>Illustration_48x48@2</code> is a 96x96 pixels image to be displayed as a 48x48 icon on a scale 2 screen.<br />
<br />
- <code>Illustration_96x96@1</code> is a 96x96 pixels image to be displayes as a 96x96 icon on a scale 1 screen.<br />
<br />
<br />
<code>Illustration_48x48@1</code> is mandatory. Others are optional.<br />
!<br />
|}<br />
<br />
== Favicon (Old zim file) ==<br />
<br />
Old zim file may have a illustration in <code>-/favicon</code> (it can be a redirection to the real content).<br />
<br />
Reader must be able to read the illustration using this path.<br />
<br />
Writer must not set a <code>-/favicon</code><br />
<br />
== See also ==<br />
* [http://dublincore.org/documents/dces/ Dublin Core]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=ZIM_file_format_old_namespace&diff=31302ZIM file format old namespace2023-11-11T18:43:58Z<p>Kelson: </p>
<hr />
<div>[[Image:Schema File Format.png|500px|right]]<br />
The '''ZIM file format''' is based on the old and deprecated [[Zeno File Format]]. See also a walk through example at [[ZIM File Example]].<br />
It starts with a header, which is described here:<br />
<br />
== Header ==<br />
A ZIM file starts with a header. This is offset 0.<br />
<br />
Length in bytes, all types are little-endian.<br />
<br />
All integers are unsigned integers (uint_16, uint_32, uint_64).<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !! Offset !! Length !! Description <br />
|-<br />
| magicNumber || integer || 0 || 4 || Magic number to recognise the file format, must be 72173914 (0x44D495A)<br />
|-<br />
|majorVersion<br />
|integer<br />
|4<br />
|2<br />
|Major version of the ZIM file format (5 or 6)<br />
|-<br />
| minorVersion || integer || 6 || 2 || Minor version of the ZIM file format <br />
|-<br />
| uuid || integer || 8 || 16 || unique id of this zim file <br />
|-<br />
| articleCount || integer || 24 || 4 || total number of articles <br />
|-<br />
| clusterCount || integer || 28 || 4 || total number of clusters <br />
|-<br />
| urlPtrPos || integer || 32 || 8 || position of the directory pointerlist ordered by URL <br />
|-<br />
| titlePtrPos || integer || 40 || 8 || position of the directory pointerlist ordered by Title <br />
|-<br />
| clusterPtrPos || integer || 48 || 8 || position of the cluster pointer list <br />
|-<br />
| mimeListPos || integer || 56 || 8 || position of the MIME type list (also header size) <br />
|-<br />
| mainPage || integer || 64 || 4 || main page or 0xffffffff if no main page <br />
|-<br />
| layoutPage || integer || 68 || 4 || [[Layout Page|layout page]] or 0xffffffffff if no layout page <br />
|-<br />
| checksumPos || integer || 72 || 8 || pointer to the md5checksum of this file without the checksum itself. This points always 16 bytes before the end of the file.<br />
|}<br />
<br />
Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1)<br />
<br />
Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1)<br />
<br />
There are currently 2 major versions :<br />
* The version 5<br />
* The version 6 (the same that version 5 + potential extended cluster)<br />
<br />
== MIME Type List (mimeListPos) ==<br />
The MIME type list always follows directly after the header, so the ''mimeListPos'' also defines the end and size of the ZIM file header.<br />
<br />
The MIME types in this list are zero terminated strings. An empty string marks the end of the MIME type list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st MIME Type> || string || 0 ||zero terminated|| declaration of the <1st MIME Type> <br />
|-<br />
| <2nd MIME Type> || string || n/a ||zero terminated|| declaration of the <2nd MIME Type> <br />
|-<br />
| ... || string || ... ||zero terminated|| ... <br />
|-<br />
| <last entry / end> || string || n/a ||zero terminated|| empty string - end of MIME type list <br />
|}<br />
<br />
== URL Pointer List (urlPtrPos) ==<br />
The URL pointer list is a list of 8 byte offsets to the directory entries.<br />
<br />
The directory entries are always ordered by URL. Ordering is simply done by comparing the URL strings.<br />
<br />
Since directory entries have variable sizes this is needed for random access.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st URL> || integer || 0 || 8 || pointer to the directory entry of <1st URL> <br />
|-<br />
| <2nd URL> || integer || 8 || 8 || pointer to the directory entry of <2nd URL> <br />
|-<br />
| <nth URL> || integer ||(n-1)*8|| 8 || pointer to the directory entry of <nth URL> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
Zimlib caches directory entries and references the cached entries via the URL pointers.<br />
<br />
== Title Pointer List (titlePtrPos) ==<br />
The title pointer list is a list of article indices ordered by title. The title pointer list actually points to entries<br />
in the URL pointer list. Note that the title pointers are only 4 bytes. They are not offsets in the file but article numbers.<br />
To get the offset of an article from the title pointer list, you have to look it up in the URL pointer list.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Title> || integer || 0 || 4 || pointer to the URL pointer of <1st Title> <br />
|-<br />
| <2nd Title> || integer || 4 || 4 || pointer to the URL pointer of <2nd Title> <br />
|-<br />
| <nth Title> || integer ||(n-1)*4|| 4 || pointer to the URL pointer of <nth Title> <br />
|-<br />
| ... || integer || ... || 4 || ... <br />
|}<br />
<br />
The indirection from titles via URLs to directory entries has two reasons:<br />
* the pointer list is only half in size as 4 bytes are enough for each entry<br />
* accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in zimlib.<br />
<br />
== Directory Entries ==<br />
Directory entries hold the meta information about all articles, images and other objects in a ZIM file.<br />
<br />
There are many types of directory entries:<br />
<br />
=== Article Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || MIME type number as defined in the MIME type list <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (optional) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history <br />
|-<br />
| cluster number || integer || 8 || 4 || cluster number in which the data of this directory entry is stored <br />
|-<br />
| blob number || integer || 12 || 4 || blob number inside the compressed cluster where the contents are stored <br />
|-<br />
| url || string || 16 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
=== Redirect Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || 0xffff for redirect <br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (optional) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history <br />
|-<br />
| redirect index || integer || 8 || 4 || pointer to the directory entry of the redirect target <br />
|-<br />
| url || string || 12 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
=== Linktarget or deleted Entry ===<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| mimetype || integer || 0 || 2 || 0xfffe for linktarget, 0xfffd for deleted entry<br />
|-<br />
| parameter len || byte || 2 || 1 || (not used) length of extra paramters <br />
|-<br />
| namespace || char || 3 || 1 || defines to which namespace this directory entry belongs <br />
|-<br />
| revision || integer || 4 || 4 || (optional) identifies a revision of the contents of this directory entry, needed to identify updates or revisions in the original history <br />
|-<br />
| url || string || 16 ||zero terminated|| string with the URL as refered in the URL pointer list <br />
|-<br />
| title || string || n/a ||zero terminated|| string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title <br />
|-<br />
| parameter || data || ||see parameter len|| (not used) extra parameters <br />
|}<br />
<br />
== Cluster Pointer List (clusterPtrPos) ==<br />
The cluster pointer list is a list of 8 byte offsets which point to all data clusters in a ZIM file.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| <1st Cluster> || integer || 0 || 8 || pointer to the <1st Cluster> <br />
|-<br />
| <1st Cluster> || integer || 8 || 8 || pointer to the <2nd Cluster> <br />
|-<br />
| <nth Cluster> || integer ||(n-1)*8|| 8 || pointer to the <nth Cluster> <br />
|-<br />
| ... || integer || ... || 8 || ... <br />
|}<br />
<br />
== Clusters ==<br />
The clusters contain the actual data of the directory entries. Clusters can be compressed or uncompressed. The purpose of the clusters are that data of more than one directory entry can be compressed inside one cluster, making the compression much more efficient. Typically clusters have a size of about 1 MB.<br />
<br />
The first byte of the cluster identifies some information about the cluster.<br />
<br />
The first fourth low bits identifies if the cluster is compressed (4) or not (0). The default is uncompressed indicated by a value of 0 or 1 (obsoleted, inherited by Zeno) while compressed clusters are indicated by a value of 4 which indicates [[LZMA2 compression]] (or more precisely XZ, since there is a XZ header) and 5 the Zstandard compression. There have been other compression algorithms used before (2: zlib, 3: bzip2) which have been removed. The zimlib uses [http://tukaani.org/xz/ xz-utils] as a C++ implementation of lzma2, for Java see [http://tukaani.org/xz/java.html XZ-Java].<br />
<br />
The firth bit identifies if the cluster is extended or not :<br />
* By default (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.<br />
* If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.<br />
A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended.<br />
<br />
To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Field Name !! Type !!Offset!!Length!! Description <br />
|-<br />
| cluster information || integer || 0 || 1 || Fourth low bits : 0: default (no compression), 1: none (inherited from Zeno), 4: LZMA2 compressed<br />
Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8) <br />
|-<br />
| colspan="5" | The following data bytes have to be uncompressed!<br />
|-<br />
| <1st Blob> || integer || 1 || OFFSET_SIZE || offset to the <1st Blob> <br />
|-<br />
| <2nd Blob> || integer || 1+OFFSET_SIZE || OFFSET_SIZE || offset to the <2nd Blob> <br />
|-<br />
| <nth Blob> || integer ||(n-1)*OFFSET_SIZE+1|| OFFSET_SIZE || offset to the <nth Blob> <br />
|-<br />
| ... || integer || ... || OFFSET_SIZE || ... <br />
|-<br />
| <last blob / end> || integer || n/a || OFFSET_SIZE || offset to the end of the cluster <br />
|-<br />
| <1st Blob> || data || n/a || n/a || data of the <1st Blob> <br />
|-<br />
| <2nd Blob> || data || n/a || n/a || data of the <2nd Blob> <br />
|-<br />
| ... || data || ... || n/a || ... <br />
|}<br />
<br />
The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by OFFSET_SIZE. The size of one blob is calculated by the difference of two consecutive offsets.<br />
<br />
== Namespaces ==<br />
Namespaces seperate different types of directory entries - which might have the same title - stored in the ZIM File Format.<br />
<br />
They can be distinguished by prepending the article namespace before the article name in the URL path, eg. ''http://localhost/A/Articlename''.<br />
<br />
{| class="sortable" style="border-width:1px; border-style:solid; border-color:#888888; background-color:#eeeeee; border-collapse:collapse; empty-cells:show" cellspacing="0" cellpadding="4" {{Prettytable}}<br />
! Namespace !! Description <br />
|-<br />
| - || layout, eg. the LayoutPage, CSS, favicon.png (48x48), JavaScript and images not related to the articles <br />
|-<br />
| A || articles - see [[Article Format]] <br />
|-<br />
| B || article meta data - see [[Article Format]] <br />
|-<br />
| I || images, files - see [[Image Handling]] <br />
|-<br />
| J || images, text - see [[Image Handling]] <br />
|-<br />
| M || ZIM metadata - see [[Metadata]] <br />
|-<br />
| U || categories, text - see [[Category Handling]] <br />
|-<br />
| V || categories, article list - see [[Category Handling]] <br />
|-<br />
| W || categories per article, category list - see [[Category Handling]] <br />
|-<br />
| X || search indexes<br />
|}<br />
<br />
== URLs ==<br />
<br />
ZIM contents are addressed using URLs fitting the following pattern: <namespace>/<article_url>. The references in articles HTML code (''<a href=""></a>'', ''<img src="">'', etc.) are URL-encoded following the [http://www.ietf.org/rfc/rfc1738.txt RFC 1738] rules.<br />
<br />
Absolute URLs, ie. with a leading slash (''/'') are forbidden, because this avoid including the ZIM contents in any HTTP sub-hierachy. ZIM contents URLs must consequently be relative.<br />
<br />
The URLs in the UrlPointerlist are not encoded. Some readers process the requests that already do the decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to zimlib, but zimlib already provides a method to do so.<br />
<br />
=== Local Anchors ===<br />
Many articles - especially when a table of contents is used - use local anchors to jump within an article. <br />
<br />
<pre><br />
<a href="../A/foo#headline1">jump to article foo, headline 1</a><br />
</pre><br />
<br />
The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example "../A/foo". After the article has been loaded the browser will then search for the local anchor tag and jump to the right location.<br />
<br />
If you use a common rendering engine or HTML widget you don't have to care for this cases, you can just use the requests as they are submitted by the engine / widget.<br />
<br />
Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to zimlib.<br />
<br />
== Encodings ==<br />
=== Character Encoding ===<br />
The standard encoding for ZIM file content is UTF-8. So both article data and URLs should be handled accordingly.<br />
<br />
Old Zeno files used a mixture of Latin1 and UTF-8 so there is still some "auto detection" code left in the ''zimlib'', a workaround for this bug. This will be removed in future versions. Zeno files are not supported anymore.<br />
<br />
=== Integer Encoding ===<br />
For integer encoding the same algorithm as UTF-8 encoding is used. This encoding is also known as "integer compression". It safes some bytes by using variable lengths of integer fields, depending on the actual value of the number.<br />
<br />
See also http://en.wikipedia.org/wiki/UTF-8#Design.<br />
<br />
Old Zeno files used the QUnicode library instead. By switching to UTF-8 the new format is more standard-adherent and easier to understand.<br />
<br />
== Split ZIM files ==<br />
ZIM files can be split in multiple chunks. This is necessary to be able to store big (over 4GB for example) ZIM files to limited file systems (like FAT32). That said, the chunks can be of any size, but the naming is really important. The ZIM file chunks should be named like following (the file name extensions matter): ''foobar.zimaa, foobar.zimab, foobar.zimac''...<br />
<br />
== See also ==<br />
* [[ZIM file format]]<br />
* [[Zeno file format]] (deprecated)<br />
* [[ZIM File Format/4]] (deprecated)<br />
* [[ZIM File Example]]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Zimit&diff=31287Zimit2023-05-23T15:02:51Z<p>Kelson: /* Source code */</p>
<hr />
<div>'''Zimit''' is a tool allowing to create a ZIM file of "any" Web site.<br />
<br />
== Context ==<br />
<br />
openZIM provides many scrapers software solutions for dedicated source of content like: TED, Wikipedia (Mediawiki, Project Gutenberg, ...). This is a great solution to provide quality ZIM files, but developing and maintaining each of them is costly.<br />
<br />
Zimit is our approach to allow to scrape "random" Web site and get an acceptable snapshot to be used offline.<br />
<br />
One important point is that specific javascript embeded pieces code, in particular to read videos, continues to work.<br />
<br />
== Principle ==<br />
<br />
The principles of Zimit are:<br />
* Crawl the remote WebSite to retrieve all the necessary content<br />
* Save all the retrieved content in WARC file(s)<br />
* Convert WARC file(s) to one ZIM file (this implies embedding a reader in the WARC file, so this is a kind of offline Web App)<br />
* Read the ZIM file in any Kiwix reader<br />
<br />
== Player ==<br />
<br />
* the SW is installed on the welcome page. If any page is loaded and the SW still not loaded, a redirection to the homepage will happen to load the SW and then automatically come back to the original page. Do achieve to do that, each page HEAD node is modify to insert the appropriate piece of Javascript at the time of the warc 2 zim conversion.<br />
* In the reader Wabac.js, there is only one specific part related to ZIM content structure and this is in "RemoteWARCProxy". This part knows how to retrieve content from the specific ZIM storage backend. For the rest the code is the same as before.<br />
* Regarding URL rewriting itself, we have two kinds:<br />
** The static URL rewriting which is done with Wombat (mostly code-driven)<br />
** The Fuzzy matching which is done within the ServiceWorker (mostly data-driven)<br />
* The URL rewriting is done at two levels:<br />
** When the javascript code calls specific Browsers API, these calls are superseeded and ultimatively call Wonbat<br />
** When a URL is called, then it goes through the service-worker which does the fuzzy-matching and the URL rewriting.<br />
<br />
== Source code ==<br />
* [https://github.com/webrecorder/browsertrix-crawler Browsertrix crawler], the Web crawler which gather everything in a [https://en.wikipedia.org/wiki/Web_ARChive WARC] file<br />
** [https://github.com/webrecorder/wombat Wombat] a standalone client-side URL rewriting system<br />
* [https://github.com/openzim/warc2zim Warc2zim], a command line tool transforming a WARC file to a ZIM file<br />
* [https://github.com/webrecorder/wabac.js Wabac.js] is the ServiceWorker based reader for the content<br />
* [https://github.com/openzim/zimit Zimit], the packaing within a Docker image of both Browsertrix and Warc2zim<br />
* [https://github.com/openzim/zimit-frontend Zimit frontend], which is the Web UI use for the [https://youzim.it Zimit SaaS solution youzim.it]<br />
<br />
== Questions ==<br />
=== Kelson ===<br />
* How well maintained is the Python server Pywb? Who use it?<br />
* Do we have other places on top of "RemoteWARCProxy" where we have javascript code dedicated to Kiwix in Wabac/Wonbat?<br />
* I URL rewriting really data-driven? Same question for Fuzzy-matching?<br />
* Can we easily use Wombat without the rest of Wabac?</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Zimit&diff=31286Zimit2023-05-23T15:02:18Z<p>Kelson: /* Player */</p>
<hr />
<div>'''Zimit''' is a tool allowing to create a ZIM file of "any" Web site.<br />
<br />
== Context ==<br />
<br />
openZIM provides many scrapers software solutions for dedicated source of content like: TED, Wikipedia (Mediawiki, Project Gutenberg, ...). This is a great solution to provide quality ZIM files, but developing and maintaining each of them is costly.<br />
<br />
Zimit is our approach to allow to scrape "random" Web site and get an acceptable snapshot to be used offline.<br />
<br />
One important point is that specific javascript embeded pieces code, in particular to read videos, continues to work.<br />
<br />
== Principle ==<br />
<br />
The principles of Zimit are:<br />
* Crawl the remote WebSite to retrieve all the necessary content<br />
* Save all the retrieved content in WARC file(s)<br />
* Convert WARC file(s) to one ZIM file (this implies embedding a reader in the WARC file, so this is a kind of offline Web App)<br />
* Read the ZIM file in any Kiwix reader<br />
<br />
== Player ==<br />
<br />
* the SW is installed on the welcome page. If any page is loaded and the SW still not loaded, a redirection to the homepage will happen to load the SW and then automatically come back to the original page. Do achieve to do that, each page HEAD node is modify to insert the appropriate piece of Javascript at the time of the warc 2 zim conversion.<br />
* In the reader Wabac.js, there is only one specific part related to ZIM content structure and this is in "RemoteWARCProxy". This part knows how to retrieve content from the specific ZIM storage backend. For the rest the code is the same as before.<br />
* Regarding URL rewriting itself, we have two kinds:<br />
** The static URL rewriting which is done with Wombat (mostly code-driven)<br />
** The Fuzzy matching which is done within the ServiceWorker (mostly data-driven)<br />
* The URL rewriting is done at two levels:<br />
** When the javascript code calls specific Browsers API, these calls are superseeded and ultimatively call Wonbat<br />
** When a URL is called, then it goes through the service-worker which does the fuzzy-matching and the URL rewriting.<br />
<br />
== Source code ==<br />
* [https://github.com/webrecorder/browsertrix-crawler Browsertrix crawler], the Web crawler which gather everything in a [https://en.wikipedia.org/wiki/Web_ARChive WARC] file<br />
** [https://github.com/webrecorder/wombat Wombat] a standalone client-side URL rewriting system<br />
* [https://github.com/openzim/warc2zim Warc2zim], a command line tool transforming a WARC file to a ZIM file<br />
* [https://github.com/webrecorder/wabac.js Wabac.js] is the ServiceWorker based reader for the content<br />
* [https://github.com/openzim/zimit Zimit], the packaing withing a Docker image of both Browsertrix and Warc2zim<br />
* [https://github.com/openzim/zimit-frontend Zimit frontend], which is the Web UI use for the [https://youzim.it Zimit SaaS solution youzim.it]<br />
<br />
== Questions ==<br />
=== Kelson ===<br />
* How well maintained is the Python server Pywb? Who use it?<br />
* Do we have other places on top of "RemoteWARCProxy" where we have javascript code dedicated to Kiwix in Wabac/Wonbat?<br />
* I URL rewriting really data-driven? Same question for Fuzzy-matching?<br />
* Can we easily use Wombat without the rest of Wabac?</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Zimit&diff=31285Zimit2023-05-23T13:33:26Z<p>Kelson: /* Questions */</p>
<hr />
<div>'''Zimit''' is a tool allowing to create a ZIM file of "any" Web site.<br />
<br />
== Context ==<br />
<br />
openZIM provides many scrapers software solutions for dedicated source of content like: TED, Wikipedia (Mediawiki, Project Gutenberg, ...). This is a great solution to provide quality ZIM files, but developing and maintaining each of them is costly.<br />
<br />
Zimit is our approach to allow to scrape "random" Web site and get an acceptable snapshot to be used offline.<br />
<br />
One important point is that specific javascript embeded pieces code, in particular to read videos, continues to work.<br />
<br />
== Principle ==<br />
<br />
The principles of Zimit are:<br />
* Crawl the remote WebSite to retrieve all the necessary content<br />
* Save all the retrieved content in WARC file(s)<br />
* Convert WARC file(s) to one ZIM file (this implies embedding a reader in the WARC file, so this is a kind of offline Web App)<br />
* Read the ZIM file in any Kiwix reader<br />
<br />
== Player ==<br />
<br />
* the SW is installed on the welcome page. If any page is loaded and the SW still not loaded, a redirection to the homepage will happen to load the SW and then automatically come back to the original page. Do achieve to do that, each page HEAD node is modify to insert the appropriate piece of Javascript at the time of the warc 2 zim conversion.<br />
* In the reader Wabac.js, there is only one specific part related to ZIM content structure and this is in "RemoteWARCProxy". This part knows how to retrieve content from the specific ZIM storage backend. For the rest the code is the same as before.<br />
* Regarding URL rewriting itself, we have two kinds which are both data-driven:<br />
** The static URL rewriting which is done with Wombat<br />
** The Fuzzy matching which is done within the ServiceWorker<br />
* The URL rewriting is done at two levels:<br />
** When the javascript code calls specific Browsers API, these calls are superseeded and ultimatively call Wonbat<br />
** When a URL is called, then it goes through the service-worker which does the fuzzy-matching and the URL rewriting.<br />
<br />
== Source code ==<br />
* [https://github.com/webrecorder/browsertrix-crawler Browsertrix crawler], the Web crawler which gather everything in a [https://en.wikipedia.org/wiki/Web_ARChive WARC] file<br />
** [https://github.com/webrecorder/wombat Wombat] a standalone client-side URL rewriting system<br />
* [https://github.com/openzim/warc2zim Warc2zim], a command line tool transforming a WARC file to a ZIM file<br />
* [https://github.com/webrecorder/wabac.js Wabac.js] is the ServiceWorker based reader for the content<br />
* [https://github.com/openzim/zimit Zimit], the packaing withing a Docker image of both Browsertrix and Warc2zim<br />
* [https://github.com/openzim/zimit-frontend Zimit frontend], which is the Web UI use for the [https://youzim.it Zimit SaaS solution youzim.it]<br />
<br />
== Questions ==<br />
=== Kelson ===<br />
* How well maintained is the Python server Pywb? Who use it?<br />
* Do we have other places on top of "RemoteWARCProxy" where we have javascript code dedicated to Kiwix in Wabac/Wonbat?<br />
* I URL rewriting really data-driven? Same question for Fuzzy-matching?<br />
* Can we easily use Wombat without the rest of Wabac?</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Zimit&diff=31284Zimit2023-05-23T13:31:39Z<p>Kelson: </p>
<hr />
<div>'''Zimit''' is a tool allowing to create a ZIM file of "any" Web site.<br />
<br />
== Context ==<br />
<br />
openZIM provides many scrapers software solutions for dedicated source of content like: TED, Wikipedia (Mediawiki, Project Gutenberg, ...). This is a great solution to provide quality ZIM files, but developing and maintaining each of them is costly.<br />
<br />
Zimit is our approach to allow to scrape "random" Web site and get an acceptable snapshot to be used offline.<br />
<br />
One important point is that specific javascript embeded pieces code, in particular to read videos, continues to work.<br />
<br />
== Principle ==<br />
<br />
The principles of Zimit are:<br />
* Crawl the remote WebSite to retrieve all the necessary content<br />
* Save all the retrieved content in WARC file(s)<br />
* Convert WARC file(s) to one ZIM file (this implies embedding a reader in the WARC file, so this is a kind of offline Web App)<br />
* Read the ZIM file in any Kiwix reader<br />
<br />
== Player ==<br />
<br />
* the SW is installed on the welcome page. If any page is loaded and the SW still not loaded, a redirection to the homepage will happen to load the SW and then automatically come back to the original page. Do achieve to do that, each page HEAD node is modify to insert the appropriate piece of Javascript at the time of the warc 2 zim conversion.<br />
* In the reader Wabac.js, there is only one specific part related to ZIM content structure and this is in "RemoteWARCProxy". This part knows how to retrieve content from the specific ZIM storage backend. For the rest the code is the same as before.<br />
* Regarding URL rewriting itself, we have two kinds which are both data-driven:<br />
** The static URL rewriting which is done with Wombat<br />
** The Fuzzy matching which is done within the ServiceWorker<br />
* The URL rewriting is done at two levels:<br />
** When the javascript code calls specific Browsers API, these calls are superseeded and ultimatively call Wonbat<br />
** When a URL is called, then it goes through the service-worker which does the fuzzy-matching and the URL rewriting.<br />
<br />
== Source code ==<br />
* [https://github.com/webrecorder/browsertrix-crawler Browsertrix crawler], the Web crawler which gather everything in a [https://en.wikipedia.org/wiki/Web_ARChive WARC] file<br />
** [https://github.com/webrecorder/wombat Wombat] a standalone client-side URL rewriting system<br />
* [https://github.com/openzim/warc2zim Warc2zim], a command line tool transforming a WARC file to a ZIM file<br />
* [https://github.com/webrecorder/wabac.js Wabac.js] is the ServiceWorker based reader for the content<br />
* [https://github.com/openzim/zimit Zimit], the packaing withing a Docker image of both Browsertrix and Warc2zim<br />
* [https://github.com/openzim/zimit-frontend Zimit frontend], which is the Web UI use for the [https://youzim.it Zimit SaaS solution youzim.it]<br />
<br />
== Questions ==<br />
=== Kelson ===<br />
* How well maintained is the Python server Pywb? Who use it?<br />
* Do we have other places on top of "RemoteWARCProxy" where we have javascript code dedicated to Kiwix in Wabac/Wonbat?<br />
* I URL rewriting really data-driven? Same question for Fuzzy-matching?</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Zimit&diff=31283Zimit2023-05-23T13:24:15Z<p>Kelson: /* Player */</p>
<hr />
<div>'''Zimit''' is a tool allowing to create a ZIM file of "any" Web site.<br />
<br />
== Context ==<br />
<br />
openZIM provides many scrapers software solutions for dedicated source of content like: TED, Wikipedia (Mediawiki, Project Gutenberg, ...). This is a great solution to provide quality ZIM files, but developing and maintaining each of them is costly.<br />
<br />
Zimit is our approach to allow to scrape "random" Web site and get an acceptable snapshot to be used offline.<br />
<br />
One important point is that specific javascript embeded pieces code, in particular to read videos, continues to work.<br />
<br />
== Principle ==<br />
<br />
The principles of Zimit are:<br />
* Crawl the remote WebSite to retrieve all the necessary content<br />
* Save all the retrieved content in WARC file(s)<br />
* Convert WARC file(s) to one ZIM file (this implies embedding a reader in the WARC file, so this is a kind of offline Web App)<br />
* Read the ZIM file in any Kiwix reader<br />
<br />
== Player ==<br />
<br />
* the SW is installed on the welcome page. If any page is loaded and the SW still not loaded, a redirection to the homepage will happen to load the SW and then automatically come back to the original page. Do achieve to do that, each page HEAD node is modify to insert the appropriate piece of Javascript at the time of the warc 2 zim conversion.<br />
* In the reader Wabac.js, there is only one specific part related to ZIM content structure and this is in "RemoteWARCProxy". This part knows how to retrieve content from the specific ZIM storage backend. For the rest the code is the same as before.<br />
* Regarding URL rewriting itself, we have two kinds which are both data-driven:<br />
** The static URL rewriting which is done with Wombat<br />
** The Fuzzy matching which is done within the ServiceWorker<br />
* The URL rewriting is done at two levels:<br />
** When the javascript code calls specific Browsers API, these calls are superseeded and ultimatively call Wonbat<br />
** When a URL is called, then it goes through the service-worker which does the fuzzy-matching and the URL rewriting.<br />
<br />
== Source code ==<br />
* [https://github.com/webrecorder/browsertrix-crawler Browsertrix crawler], the Web crawler which gather everything in a [https://en.wikipedia.org/wiki/Web_ARChive WARC] file<br />
** [https://github.com/webrecorder/wombat Wombat] a standalone client-side URL rewriting system<br />
* [https://github.com/openzim/warc2zim Warc2zim], a command line tool transforming a WARC file to a ZIM file<br />
* [https://github.com/webrecorder/wabac.js Wabac.js] is the ServiceWorker based reader for the content<br />
* [https://github.com/openzim/zimit Zimit], the packaing withing a Docker image of both Browsertrix and Warc2zim<br />
* [https://github.com/openzim/zimit-frontend Zimit frontend], which is the Web UI use for the [https://youzim.it Zimit SaaS solution youzim.it]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Zimit&diff=31281Zimit2023-05-23T10:34:01Z<p>Kelson: /* Player */</p>
<hr />
<div>'''Zimit''' is a tool allowing to create a ZIM file of "any" Web site.<br />
<br />
== Context ==<br />
<br />
openZIM provides many scrapers software solutions for dedicated source of content like: TED, Wikipedia (Mediawiki, Project Gutenberg, ...). This is a great solution to provide quality ZIM files, but developing and maintaining each of them is costly.<br />
<br />
Zimit is our approach to allow to scrape "random" Web site and get an acceptable snapshot to be used offline.<br />
<br />
One important point is that specific javascript embeded pieces code, in particular to read videos, continues to work.<br />
<br />
== Principle ==<br />
<br />
Rhe principles of Zimit are:<br />
* Crawl the remote WebSite to retrieve all the necessary content<br />
* Save all the retrieved content in WARC file(s)<br />
* Convert WARC file(s) to one ZIM file (this implies embedding a reader in the WARC file, so this is a kind of offline Web App)<br />
* Read the ZIM file in any Kiwix reader<br />
<br />
== Player ==<br />
<br />
* the SW is installed on the welcome page. If any page is loaded and the SW still not loaded, a redirection to the homepage will happen to load the SW and then automatically come back to the original page. Do achieve to do that, each page HEAD node is modify to insert the appropriate piece of Javascript at the time of the warc 2 zim conversion.<br />
* In the reader Wabac.js, there is only one specific part related to ZIM content structure and this is in "RemoteWARCProxy". This part knows how to retrieve content from the specific ZIM storage backend. For the rest the code is the same as before.<br />
* Regarding URL rewriting itself, we have two kinds which are both data-drivent:<br />
** The static URL rewriting which is done with Wombat<br />
** The Fuzzy matching which is done within the ServiceWorker<br />
* The URL rewriting is done at two levels:<br />
** When the javascript code calls specific Browsers API, these calls are superseeded and ultimatively call Wonbat<br />
** When a URL is called, then it goes through the service-worker which does the fuzzy-matching and the URL rewriting.<br />
<br />
== Source code ==<br />
* [https://github.com/webrecorder/browsertrix-crawler Browsertrix crawler], the Web crawler which gather everything in a [https://en.wikipedia.org/wiki/Web_ARChive WARC] file<br />
** [https://github.com/webrecorder/wombat Wombat] a standalone client-side URL rewriting system<br />
* [https://github.com/openzim/warc2zim Warc2zim], a command line tool transforming a WARC file to a ZIM file<br />
* [https://github.com/webrecorder/wabac.js Wabac.js] is the ServiceWorker based reader for the content<br />
* [https://github.com/openzim/zimit Zimit], the packaing withing a Docker image of both Browsertrix and Warc2zim<br />
* [https://github.com/openzim/zimit-frontend Zimit frontend], which is the Web UI use for the [https://youzim.it Zimit SaaS solution youzim.it]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Zimit&diff=31280Zimit2023-05-22T12:54:10Z<p>Kelson: /* Player */</p>
<hr />
<div>'''Zimit''' is a tool allowing to create a ZIM file of "any" Web site.<br />
<br />
== Context ==<br />
<br />
openZIM provides many scrapers software solutions for dedicated source of content like: TED, Wikipedia (Mediawiki, Project Gutenberg, ...). This is a great solution to provide quality ZIM files, but developing and maintaining each of them is costly.<br />
<br />
Zimit is our approach to allow to scrape "random" Web site and get an acceptable snapshot to be used offline.<br />
<br />
One important point is that specific javascript embeded pieces code, in particular to read videos, continues to work.<br />
<br />
== Principle ==<br />
<br />
Rhe principles of Zimit are:<br />
* Crawl the remote WebSite to retrieve all the necessary content<br />
* Save all the retrieved content in WARC file(s)<br />
* Convert WARC file(s) to one ZIM file (this implies embedding a reader in the WARC file, so this is a kind of offline Web App)<br />
* Read the ZIM file in any Kiwix reader<br />
<br />
== Player ==<br />
<br />
* the SW is installed on the welcome page. If any page is loaded and the SW still not loaded, a redirection to the homepage will happen to load the SW and then automatically come back to the original page. Do achieve to do that, each page HEAD node is modify to insert the appropriate piece of Javascript at the time of the warc 2 zim conversion.<br />
* In the reader, in Wabac.js, there is a specific part related to ZIM content structure and this is in "RemoteWARCProxy", for the rest the code is the same as before.<br />
<br />
== Source code ==<br />
* [https://github.com/webrecorder/browsertrix-crawler Browsertrix crawler], the Web crawler which gather everything in a [https://en.wikipedia.org/wiki/Web_ARChive WARC] file<br />
** [https://github.com/webrecorder/wombat Wombat] a standalone client-side URL rewriting system<br />
* [https://github.com/openzim/warc2zim Warc2zim], a command line tool transforming a WARC file to a ZIM file<br />
* [https://github.com/webrecorder/wabac.js Wabac.js] is the ServiceWorker based reader for the content<br />
* [https://github.com/openzim/zimit Zimit], the packaing withing a Docker image of both Browsertrix and Warc2zim<br />
* [https://github.com/openzim/zimit-frontend Zimit frontend], which is the Web UI use for the [https://youzim.it Zimit SaaS solution youzim.it]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Zimit&diff=31279Zimit2023-05-22T08:44:00Z<p>Kelson: /* Player */</p>
<hr />
<div>'''Zimit''' is a tool allowing to create a ZIM file of "any" Web site.<br />
<br />
== Context ==<br />
<br />
openZIM provides many scrapers software solutions for dedicated source of content like: TED, Wikipedia (Mediawiki, Project Gutenberg, ...). This is a great solution to provide quality ZIM files, but developing and maintaining each of them is costly.<br />
<br />
Zimit is our approach to allow to scrape "random" Web site and get an acceptable snapshot to be used offline.<br />
<br />
One important point is that specific javascript embeded pieces code, in particular to read videos, continues to work.<br />
<br />
== Principle ==<br />
<br />
Rhe principles of Zimit are:<br />
* Crawl the remote WebSite to retrieve all the necessary content<br />
* Save all the retrieved content in WARC file(s)<br />
* Convert WARC file(s) to one ZIM file (this implies embedding a reader in the WARC file, so this is a kind of offline Web App)<br />
* Read the ZIM file in any Kiwix reader<br />
<br />
== Player ==<br />
<br />
* the SW is installed on the welcome page. If any page is loaded and the SW still not loaded, a redirection to the homepage will happen to load the SW and then automatically come back to the original page. Do achieve to do that, each page HEAD node is modify to insert the appropriate piece of Javascript at the time of the warc 2 zim conversion.<br />
<br />
== Source code ==<br />
* [https://github.com/webrecorder/browsertrix-crawler Browsertrix crawler], the Web crawler which gather everything in a [https://en.wikipedia.org/wiki/Web_ARChive WARC] file<br />
** [https://github.com/webrecorder/wombat Wombat] a standalone client-side URL rewriting system<br />
* [https://github.com/openzim/warc2zim Warc2zim], a command line tool transforming a WARC file to a ZIM file<br />
* [https://github.com/webrecorder/wabac.js Wabac.js] is the ServiceWorker based reader for the content<br />
* [https://github.com/openzim/zimit Zimit], the packaing withing a Docker image of both Browsertrix and Warc2zim<br />
* [https://github.com/openzim/zimit-frontend Zimit frontend], which is the Web UI use for the [https://youzim.it Zimit SaaS solution youzim.it]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Zimit&diff=31278Zimit2023-05-22T08:42:18Z<p>Kelson: /* URL rewriting */</p>
<hr />
<div>'''Zimit''' is a tool allowing to create a ZIM file of "any" Web site.<br />
<br />
== Context ==<br />
<br />
openZIM provides many scrapers software solutions for dedicated source of content like: TED, Wikipedia (Mediawiki, Project Gutenberg, ...). This is a great solution to provide quality ZIM files, but developing and maintaining each of them is costly.<br />
<br />
Zimit is our approach to allow to scrape "random" Web site and get an acceptable snapshot to be used offline.<br />
<br />
One important point is that specific javascript embeded pieces code, in particular to read videos, continues to work.<br />
<br />
== Principle ==<br />
<br />
Rhe principles of Zimit are:<br />
* Crawl the remote WebSite to retrieve all the necessary content<br />
* Save all the retrieved content in WARC file(s)<br />
* Convert WARC file(s) to one ZIM file (this implies embedding a reader in the WARC file, so this is a kind of offline Web App)<br />
* Read the ZIM file in any Kiwix reader<br />
<br />
== Player ==<br />
<br />
* the SW is installed on the welcome page. If any page is loaded and the SW still not loaded, a redirection to the homepage will happen to load the SW and then automatically come back to the original page.<br />
<br />
== Source code ==<br />
* [https://github.com/webrecorder/browsertrix-crawler Browsertrix crawler], the Web crawler which gather everything in a [https://en.wikipedia.org/wiki/Web_ARChive WARC] file<br />
** [https://github.com/webrecorder/wombat Wombat] a standalone client-side URL rewriting system<br />
* [https://github.com/openzim/warc2zim Warc2zim], a command line tool transforming a WARC file to a ZIM file<br />
* [https://github.com/webrecorder/wabac.js Wabac.js] is the ServiceWorker based reader for the content<br />
* [https://github.com/openzim/zimit Zimit], the packaing withing a Docker image of both Browsertrix and Warc2zim<br />
* [https://github.com/openzim/zimit-frontend Zimit frontend], which is the Web UI use for the [https://youzim.it Zimit SaaS solution youzim.it]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Zimit&diff=31277Zimit2023-05-22T08:37:46Z<p>Kelson: /* Source code */</p>
<hr />
<div>'''Zimit''' is a tool allowing to create a ZIM file of "any" Web site.<br />
<br />
== Context ==<br />
<br />
openZIM provides many scrapers software solutions for dedicated source of content like: TED, Wikipedia (Mediawiki, Project Gutenberg, ...). This is a great solution to provide quality ZIM files, but developing and maintaining each of them is costly.<br />
<br />
Zimit is our approach to allow to scrape "random" Web site and get an acceptable snapshot to be used offline.<br />
<br />
One important point is that specific javascript embeded pieces code, in particular to read videos, continues to work.<br />
<br />
== Principle ==<br />
<br />
Rhe principles of Zimit are:<br />
* Crawl the remote WebSite to retrieve all the necessary content<br />
* Save all the retrieved content in WARC file(s)<br />
* Convert WARC file(s) to one ZIM file (this implies embedding a reader in the WARC file, so this is a kind of offline Web App)<br />
* Read the ZIM file in any Kiwix reader<br />
<br />
== URL rewriting ==<br />
<br />
== Source code ==<br />
* [https://github.com/webrecorder/browsertrix-crawler Browsertrix crawler], the Web crawler which gather everything in a [https://en.wikipedia.org/wiki/Web_ARChive WARC] file<br />
** [https://github.com/webrecorder/wombat Wombat] a standalone client-side URL rewriting system<br />
* [https://github.com/openzim/warc2zim Warc2zim], a command line tool transforming a WARC file to a ZIM file<br />
* [https://github.com/webrecorder/wabac.js Wabac.js] is the ServiceWorker based reader for the content<br />
* [https://github.com/openzim/zimit Zimit], the packaing withing a Docker image of both Browsertrix and Warc2zim<br />
* [https://github.com/openzim/zimit-frontend Zimit frontend], which is the Web UI use for the [https://youzim.it Zimit SaaS solution youzim.it]</div>Kelsonhttps://wiki.openzim.org/w/index.php?title=Zimit&diff=31276Zimit2023-05-22T08:23:40Z<p>Kelson: /* Principle */</p>
<hr />
<div>'''Zimit''' is a tool allowing to create a ZIM file of "any" Web site.<br />
<br />
== Context ==<br />
<br />
openZIM provides many scrapers software solutions for dedicated source of content like: TED, Wikipedia (Mediawiki, Project Gutenberg, ...). This is a great solution to provide quality ZIM files, but developing and maintaining each of them is costly.<br />
<br />
Zimit is our approach to allow to scrape "random" Web site and get an acceptable snapshot to be used offline.<br />
<br />
One important point is that specific javascript embeded pieces code, in particular to read videos, continues to work.<br />
<br />
== Principle ==<br />
<br />
Rhe principles of Zimit are:<br />
* Crawl the remote WebSite to retrieve all the necessary content<br />
* Save all the retrieved content in WARC file(s)<br />
* Convert WARC file(s) to one ZIM file (this implies embedding a reader in the WARC file, so this is a kind of offline Web App)<br />
* Read the ZIM file in any Kiwix reader<br />
<br />
== URL rewriting ==<br />
<br />
== Source code ==<br />
* [https://github.com/webrecorder/browsertrix-crawler Browsertrix crawler], the Web crawler which gather everything in a [https://en.wikipedia.org/wiki/Web_ARChive WARC] file<br />
* [https://github.com/openzim/warc2zim Warc2zim], a command line tool transforming a WARC file to a ZIM file<br />
* [https://github.com/webrecorder/wabac.js Wabac.js] is the ServiceWorker based reader for the content<br />
* [https://github.com/openzim/zimit Zimit], the packaing withing a Docker image of both Browsertrix and Warc2zim<br />
* [https://github.com/openzim/zimit-frontend Zimit frontend], which is the Web UI use for the [https://youzim.it Zimit SaaS solution youzim.it]</div>Kelson