RFC: collaborative branch cache between filtering level

Pierre-Yves David pierre-yves.david at ens-lyon.org
Thu Dec 27 09:10:18 CST 2012


Here is a detailed summary of the current branchcache approach and a proposal
to use filtering to use a simpler, more robust and faster approach. Each
section have a TL;DR; for lazy people. The initial suggestion comes from Matt.

This proposal is both a necessity to have branchmap for filtering work
efficiently and an improvement over the previous approach.

I'm running with an implementation of this since last week.

=== Current branchmap state of the art (prior filtering) ===

TL;DR; branchmap is expensive to compute from scratch. You can cheaply update
       it in append only case. We have work around for everything that strip.

Branchmap is used to store branches heads of repo. It is a `{'branches': [heads]}`
mapping. It is expensive to compute form scratch. To work around this expensive
computation we do two things:

- branchmap are cached in memory and on disk.
- We use the fact that if you have a branchmap for a subset of the repo, you
  can compute the branchmap for the full repo from it. Using only changeset not
  already included in the subset.

  eg: You have branchmap for changeset 0:42. You add 18 changesets. You can use
      the previous cache and changeset from 43:60 to compute a cache for 0:60.

But this only work for subset of the repo. If the branchmap have been computed
with changeset that are not in the repo anymore, we have to throw it away.
Every function that use strip would be impacted  but:

- Mq have a special work around that ensures the branchmap written on disk does
  not includes qbase.
- Strip and branchmap update have special hack to reuse branchmap after a strip.

Note: I just broke the strip hack with the last pushed series.

=== Additionnal issue related to filtering ===

TL;DR; Filtered revision needs their own cache. This cache is even more likely
       to be invalidated. We do not have work around installed like strip does.

We can now "filter" repository, pretending revision are not there at all. It is
used to filter "hidden" (obsolete) revision and to exclude "unserved" (hidden +
secret) revision during exchange.

Filtered repository can not use the same branchmap than the unfiltered version
because they have *less* revision than the unfiltered version. They needs to
maintains there own cache.

But this cache is even more fragile than the unfiltered one:

- We do not have work around to handle revision "removed" from the repo (as we
  have around strip )
- People may decide to move changeset to secret phase (this is reinforced by the
  lack of --secret flag on `hg commit`
- People using changeset evolution will turn of lot of changeset "hidden".
- The way the cache is validated for filtered version is too simple and could
  lead to invalidation even in append only case.
  (The full list of revision filtered for generating a cache may be too big to
  be written on disk. We only write a hash of filtered revision. This hash only
  allow to detect that the filtering is different, not the difference. This
  leads to invalidation of the cache for any difference. This last point could
  be improved)

=== Proposal for a better future ===

TL;DR; filtered version are subset of each other. If we use the nearest subset
       cache to rebuilt our we could have an efficient rebuilt while dropping
       all dedicated hack.

Filtered repository are *subset* of unfiltered repository. This means that a filtered branchmap could be use to compute the unfiltered version.

And filtered version happen to be subset of each other:
- "all() - unserved()" is a subset of "all() - hidden()"
- "all() - hidden()" is a subset of "all()"

This means that branchmap with "unfiltered" filter can be used as a base for
"hidden" branchmap that itself could be used as a base for unfiltered
branchmap.

    unserved < hidden < None

But both "hidden" and "unserved" set are a bit volatile. We can add more stable
filtering like "mutable" (secret() + draft()) and even "impactable" (everything
with a revision number higher that something mutable. Like MQ did with qbase)

This gives:

    impactable < public < unserved < hidden < None

When a subset detect its cache is invalid it could use the nearest subset
cache as a base to rebuilt its own. This results in small and fast partial
update.

Some of the existing work around may be totally dropped (MQ hack) and other
will need to be adapted (strip).


The implementation is dead simple. You can have a look at it here:

    http://hg-lab.logilab.org/wip/hg/rev/2fb52bac5bb9#l1.12

(Do not review the whole changeset it needs clean up)

-- 
Pierre-Yves David



More information about the Mercurial-devel mailing list