[PATCH 1 of 3] help: clarify revision / chunk behavior

Thu Mar 2 22:19:25 UTC 2017

Excerpts from Gregory Szorc's message of 2017-02-27 12:54:00 -0800:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc at gmail.com>
> # Date 1488226671 28800
> #      Mon Feb 27 12:17:51 2017 -0800
> # Node ID ded4aedfaffbabce6c083f660fc5feeeeb287f0c
> # Parent  abb92b3d370e116b29eba4d2e3154e9691c8edbb
> help: clarify revision / chunk behavior
> 
> Try to make it easier to understand the differences between the logical
> and physical model of revlog storage.
> 
> diff --git a/mercurial/help/internals/revlogs.txt b/mercurial/help/internals/revlogs.txt
> --- a/mercurial/help/internals/revlogs.txt
> +++ b/mercurial/help/internals/revlogs.txt
> @@ -2,17 +2,18 @@ Revision logs - or *revlogs* - are an ap
>  storing discrete entries, or *revisions*. They are the primary storage
>  mechanism of repository data.
>  
> +A revlog revision logically consists of 2 parts: metadata and a content

"revision" is undefined to a new person reading here. How about moving it to
the paragraph below, and replacing it with "node" (or, make it clear that a
"node" is a "revision") ?

> +blob. Metadata includes the hash of the revision's content, sizes, and
> +links to its *parent* entries. The collective metadata is referred
> +to as the *index* and the revision content is the *data*.
> +
>  Revlogs effectively model a directed acyclic graph (DAG). Each node
>  has edges to 1 or 2 *parent* nodes. Each node contains metadata and
>  the raw value for that node.
>  
> -Revlogs consist of entries which have metadata and revision data.
> -Metadata includes the hash of the revision's content, sizes, and
> -links to its *parent* entries. The collective metadata is referred
> -to as the *index* and the revision data is the *data*.

Actually I think the old version is good enough and in a better order -
first introduce the DAG concept, then explain details.

> -
> -Revision data is stored as a series of compressed deltas against previous
> -revisions.

I'd keep the above sentence - it's concise and does not hurt.

> +The revision data physically stored in a revlog entry is referred to as

"entry" vs "revision" vs "node" could confuse new people.

> +a *chunk*. A *chunk* is either the raw fulltext of a revision or a delta
> +against a previous fulltext. In both cases, a *chunk* may be compressed.

I'd say "against another revision". "previous" may imply rev-1. "fulltext"
may imply that delta base cannot be a delta.

>  Revlogs are written in an append-only fashion. We never need to rewrite
>  a file to insert nor do we need to remove data. Rolling back in-progress
> @@ -87,7 +88,7 @@ 0-3 (4 bytes) (rev 0 only)
>     Revlog header
>  
>  0-5 (6 bytes)
> -   Absolute offset of revision data from beginning of revlog.
> +   Absolute offset of revision chunk from beginning of revlog.
>  
>  6-7 (2 bytes)
>     Bit flags impacting revision behavior. The following bit offsets define:
> @@ -100,15 +101,15 @@ 6-7 (2 bytes)
>     2: REVIDX_EXTSTORED revision data is stored externally.
>  
>  8-11 (4 bytes)
> -   Compressed length of revision data / chunk as stored in revlog.
> +   Compressed length of revision chunk as stored in revlog.
>  
>  12-15 (4 bytes)
>     Uncompressed length of revision data. This is the size of the full
> -   revision data, not the size of the chunk post decompression.
> +   revision data (as opposed to the delta/chunk).
>  
>  16-19 (4 bytes)
>     Base or previous revision this revision's delta was produced against.
> -   -1 means this revision holds full text (as opposed to a delta).
> +   -1 means this chunk holds full text (as opposed to a delta).
>     For generaldelta repos, this is the previous revision in the delta
>     chain. For non-generaldelta repos, this is the base or first
>     revision in the delta chain.
> @@ -185,16 +186,16 @@ The actual layout of revlog files on dis
>  *store format*. Typically, a ``.i`` file represents the index revlog
>  (possibly containing inline data) and a ``.d`` file holds the revision data.
>  
> -Revision Entries
> -================
> +Revision Chunks
> +===============
>  
> -Revision entries consist of an optional 1 byte header followed by an
> -encoding of the revision data. The headers are as follows:
> +Chunks in revision entries consist of an optional 1 byte header followed
> +by an encoding of the chunk data. The headers are as follows:
>  
>  \0 (0x00)
> -   Revision data is the entirety of the entry, including this header.
> +   Chunk data is the entirety of the entry, including this header.
>  u (0x75)
> -   Raw revision data follows.
> +   Raw chunk data follows.
>  x (0x78)
>     zlib (RFC 1950) data.
>  

These changes look good to me.