RFC: Shallow Cloning, Take 2

Sat May 16 02:05:02 CDT 2009

On Sat, May 16, 2009 at 8:58 AM, Peter Arrenbrecht
<peter.arrenbrecht at gmail.com> wrote:
> [deliberate top post]
>
> Folks, I've been thinking some more about these two approaches to
> shallow clones:
>
>  * punching: omitting only diff data from unwanted filelog entries
>
>  * pruning: omitting unwanted filelog entries entirely and nulling of
> parent IDs that point to them
>
> Pruning is what I am using at the moment in my prototypes for manifest
> and file logs (let's call them data logs). The changelog is newly
> always cloned in full. mpm is still worried about the faked parents in
> the data logs.
>
>
> A key problem for both is how they cope with formerly absent revs
> suddenly being needed. This typically happens when a merge brings in
> filerevs from a formerly unrelated branch [1].
>
> With punching, the diff data for absent revs is not stored in the
> revlog, but all linkage is (nodeid, parentrevs, linkrev). So when a
> formerly absent rev is needed, I see two choices:
>
>  * Rewrite the revlog to add the formerly missing data. This breaks
> append-onlyness during pull.
>
>  * Append a new entry for the revision, this time including the data,
> and with a flag indicating it is a dupe. This could work, but it might
> seriously break the rev -> index offset assumptions in the lazy index.
>
> With pruning, the formerly absent rev was never stored at all. So we
> can simply append it now. However, if any present node already had a
> reference to the formerly absent node, that reference was faked to
> nullid. We cannot undo this unless we also break append-onlyness. And
> even if we did, we would suddenly end up with a parentrev > rev, a
> serious break of existing assumptions. However, I think this is a very
> minor issue, as it only affects tracing of file history (and only into
> unwanted territory), not of changeset history. And merge parent
> determination is based on the latter.
>
>
> A second issue is bandwidth and disk usage. Pruning clearly has an
> advantage here, especially in repos which have seen a bunch of
> deletions (through renames, possibly). Pruning does not create any
> .i/.d files for files that were deleted before the root of the shallow
> clone.
>
> I am assuming that we do not punch the changelog. Otherwise getting
> rid of unwanted changelists could be an advantage for punching.
> However, without special treatment, that would get rid of all the data
> for unwanted changelogs. In any case, this could be done for the
> pruning approach as well, as in the changelog you cannot have formerly
> absent revs.
>
>
> Finally, how do the approaches affect client code? Pruning presents a
> fairly robust picture of what's actually present. And it gracefully
> stops client code from inadvertently straying into absent territory
> (by clearing manifests, nulling parentids and redirecting linkrevs
> that would lead into it). But client code that is aware of shallowness
> can still query the real parentids (though not the real linkrevs).

Note that linkrevs can get changed even in normal hg usage:

hg init main; cd main
echo a>a
hg ci -Ama1
hg up null
echo a>a
hg ci -Ama2
hg log a
> a1
hg init part; cd part
hg pull -r1 ..
hg log a
> a2

-parren

> Punching, as I see it, will expose all client code to exceptions due
> to missing data. This can be good and bad. If we don't punch the
> changelog, I guess it's not that many places where we have to catch
> these exceptions.
>
>
> So, pruning is better on space efficiency and I think with respect to
> the formerly absent filerev problem. Punching is simpler and maybe
> more robust, provided we can find a good solution for the formerly
> absent filerevs.
>
> For pruning, I already have a fairly robust prototype. Punching still
> feels less well-charted to me. A lot of the work in the prototype for
> negotiating changegroups probably applies equally to both.
>
> I'm still with pruning.
> -parren
>
>
> Notes:
>
> [1] Say you have the following, where all csets change a file X:
>
>  a - b
>    \
>      c
>
> You shallow-clone this at b, yielding just b. The only filerev of X
> you're interested in is X(b). Now someone merges in the main repo,
> discarding b's changes to X and keeping c's. So d's manifest points to
> X(c):
>
>  a - b - d
>    \   /
>      c
>
> When you pull this into the shallow repo, you get "b - d". But now
> X(c) is suddenly needed, while before the merge it was not.
>
>
> On Wed, May 13, 2009 at 10:34 PM, Peter Arrenbrecht
> <peter.arrenbrecht at gmail.com> wrote:
>> Matt, folks,
>>
>> I've been tinkering with shallow cloning again. This time around, I
>> keep the entire changelog (as you suggested, Matt), but still prune
>> the manifest and file logs.
>>
>> On the plus side:
>>
>>  + always know about full node graph
>>  + no change to incoming negotiation
>>  + chosen merge parent is consistent with full repo
>>  + get a glimpse of full activity in log
>>  + if node from shallower repo references parent not in bundle, I can
>> tell whether I would need it
>>  + corresponds to envisioned narrow clones which know entire manifest, too
>>
>> Less nice, but acceptable I think:
>>
>>  - can have absent tip after pulling -> empty repo (might want to
>> modify tip so it points to the newset present node)
>>  - irrelevant incoming
>>  - irrelevant log entries
>>  - irrelevant log entries might prevent me from pushing my own work to
>> a full clone (if the full clone does not yet have some of those
>> entries; can be worked around by only pushing selected heads)
>>  - transmits superfluous changelists as part of the changelog
>>
>> To prune, I still fake parent IDs to nullid. But the new version keeps
>> the real parent IDs around (in thefile.p alongside thefile.i/d). So it
>> never puts fake information on the wire or into bundles. I changed the
>> revlog format by introducing a per-rev bit that indicates faked
>> information (there already is room for such bits). And I added the .p
>> files. Their handling is still a bit suboptimal (need to check the fs
>> to see if they are there, for instance).
>>
>> I also had to change the wire/bundle format. Currently, the prototype
>> simply assumes only the new format. If the approach is deemed
>> reasonable, I shall look into making it work with older
>> servers/clients. I added two things:
>>
>>  * A header which records the shallow root(s) for which the bundle was
>> generated. This is used to initialize fresh clones of the bundle to
>> said roots. It also could be used to omit certain completeness checks
>> when unbundling. The roots also determine diff parents (see below).
>>
>>  * Every group entry has an additional char flag indicating the diff
>> parent used:
>>     . = normal chaining (as today)
>>     0 = full revision
>>     1 = parent 1
>>     2 = parent 2
>>   This is necessary because a shallow clone might not have the normal
>> chain parent present, so the server must not send diffs against it.
>>
>> The prototype is already quite promising. It still lacks
>>
>>  - support for older bundle formats
>>  - verify we got all required revs in addchangegroup (unnecessary if
>> bundle roots = own roots)
>>  - visual feedback in log about which entries are shallow
>>  - wire implementation (it handles bundles, so wire should not pose
>> hard problems)
>>  - some tuning (scan manifest diffs instead of manifests, for instance)
>>  - tests for various scenarios
>>
>> The code is here (pbranch; just clone, then do `hg diff -r default:shallow.p`):
>>
>>  http://bitbucket.org/parren/hg-partial-pbranch/overview/
>>
>> The tested and working scenarios are here:
>>
>>  http://bitbucket.org/parren/hg-partial-pbranch/src/tip/tests/test-shallow-situations
>>
>> Tests for the pruning revlog are here:
>>
>>  http://bitbucket.org/parren/hg-partial-pbranch/src/tip/tests/test-revlog-partial.py
>>
>> I can also shallow clone hg itself and some other repos of mine.
>>
>> Does this approach sound acceptable? If so, I shall start pushing it
>> towards something releasable (my target would be hg 1.4). Any help
>> would be appreciated, too.
>>
>> -parren
>>
>