Zipdoc

Encode/decode filter for version controlling zipped document formats like docx or odt as uncompressed zip archives. This improves delta compression and results in less space consumption.

1. Status

This extension is not distributed with Mercurial.

Author: Andreas Gobell

Repository: https://bitbucket.org/gobell/hg-zipdoc/

2. Overview

Some document formats like docx or odt are really ZIP archives containing primarily XML documents. If these documents are version controlled with Mercurial the delta compression is not very efficient as these are binary formats and every change to the document can significantly change the bytes of the ZIP archive, e.g. the docx or odt file. When the files are stored as an uncompressed ZIP they contain the plain XML files plus some header information. This way the change only concerns the respective parts of the XML files. Thus deltas can be computed more efficiently and the delta compression improves (Note: docx and odt files also contain small binary thumbnail images that change as well).

3. Configuration

Configure your .hgrc to enable the extension by adding following lines:

[extensions]
hgext.zipdoc = /path/to/zipdoc.py

For every file format that is a zipped archive and should be stored uncompressed an encode/decode pair has to be added:

[encode]
**.docx = zipdocencode
**.odt = zipdocencode

[decode]
**.docx = zipdocdecode
**.odt = zipdocdecode

4. How it works

On every write to the repository (e.g. commit) the encode filter recompresses the zipped document file without any compression. This uncompressed version will be managed by Mercurial. At first the space consumption of the repository might be higher compared to version controlling an compressed document but after some changes to the document the better delta compression of the uncompressed file will result in clearly less space consumption.

On every read from the repository (e.g. update, archive) the decode filter will recompress the zipped document file with compression. This way the file will consume less space in the working directory.

5. Notes and Tips

5.1. Differing file sizes

A file read from a Mercurial repository might be smaller than the same file saved by the respective application. This is due to different compression levels used by this filter and the application. E.g. if you have a docx file saved by Microsoft Word, commit it to the repository and archive the committed file with Mercurial (or do an update where the file will be replaced) the archived file will probably be smaller than the one written by Word. This is no problem just keep this in mind if you are comparing file sizes or wonder why a file suddenly got larger after saving with the application.

5.2. Viewing the document's XML text in diff

When using the diff command specifying the -a (or --text) option will tell Mercurial to treat the file as text. This way you are able to see the changes to the XML that are contained in the uncompressed zip stored in the repository. Note that this does not work if git style diff is used.

5.3. Plain zip files

This extensions makes no assumptions about the specific format of the filtered zip file. Thus any file that is a valid zip archive can be processed with this filter.


CategoryExtensionsByOthers

ZipdocExtension (last edited 2012-11-04 02:37:29 by mpm)