RFC: collaborative branch cache between filtering level
Pierre-Yves David
pierre-yves.david at ens-lyon.org
Thu Dec 27 09:10:18 CST 2012
Here is a detailed summary of the current branchcache approach and a proposal
to use filtering to use a simpler, more robust and faster approach. Each
section have a TL;DR; for lazy people. The initial suggestion comes from Matt.
This proposal is both a necessity to have branchmap for filtering work
efficiently and an improvement over the previous approach.
I'm running with an implementation of this since last week.
=== Current branchmap state of the art (prior filtering) ===
TL;DR; branchmap is expensive to compute from scratch. You can cheaply update
it in append only case. We have work around for everything that strip.
Branchmap is used to store branches heads of repo. It is a `{'branches': [heads]}`
mapping. It is expensive to compute form scratch. To work around this expensive
computation we do two things:
- branchmap are cached in memory and on disk.
- We use the fact that if you have a branchmap for a subset of the repo, you
can compute the branchmap for the full repo from it. Using only changeset not
already included in the subset.
eg: You have branchmap for changeset 0:42. You add 18 changesets. You can use
the previous cache and changeset from 43:60 to compute a cache for 0:60.
But this only work for subset of the repo. If the branchmap have been computed
with changeset that are not in the repo anymore, we have to throw it away.
Every function that use strip would be impacted but:
- Mq have a special work around that ensures the branchmap written on disk does
not includes qbase.
- Strip and branchmap update have special hack to reuse branchmap after a strip.
Note: I just broke the strip hack with the last pushed series.
=== Additionnal issue related to filtering ===
TL;DR; Filtered revision needs their own cache. This cache is even more likely
to be invalidated. We do not have work around installed like strip does.
We can now "filter" repository, pretending revision are not there at all. It is
used to filter "hidden" (obsolete) revision and to exclude "unserved" (hidden +
secret) revision during exchange.
Filtered repository can not use the same branchmap than the unfiltered version
because they have *less* revision than the unfiltered version. They needs to
maintains there own cache.
But this cache is even more fragile than the unfiltered one:
- We do not have work around to handle revision "removed" from the repo (as we
have around strip )
- People may decide to move changeset to secret phase (this is reinforced by the
lack of --secret flag on `hg commit`
- People using changeset evolution will turn of lot of changeset "hidden".
- The way the cache is validated for filtered version is too simple and could
lead to invalidation even in append only case.
(The full list of revision filtered for generating a cache may be too big to
be written on disk. We only write a hash of filtered revision. This hash only
allow to detect that the filtering is different, not the difference. This
leads to invalidation of the cache for any difference. This last point could
be improved)
=== Proposal for a better future ===
TL;DR; filtered version are subset of each other. If we use the nearest subset
cache to rebuilt our we could have an efficient rebuilt while dropping
all dedicated hack.
Filtered repository are *subset* of unfiltered repository. This means that a filtered branchmap could be use to compute the unfiltered version.
And filtered version happen to be subset of each other:
- "all() - unserved()" is a subset of "all() - hidden()"
- "all() - hidden()" is a subset of "all()"
This means that branchmap with "unfiltered" filter can be used as a base for
"hidden" branchmap that itself could be used as a base for unfiltered
branchmap.
unserved < hidden < None
But both "hidden" and "unserved" set are a bit volatile. We can add more stable
filtering like "mutable" (secret() + draft()) and even "impactable" (everything
with a revision number higher that something mutable. Like MQ did with qbase)
This gives:
impactable < public < unserved < hidden < None
When a subset detect its cache is invalid it could use the nearest subset
cache as a base to rebuilt its own. This results in small and fast partial
update.
Some of the existing work around may be totally dropped (MQ hack) and other
will need to be adapted (strip).
The implementation is dead simple. You can have a look at it here:
http://hg-lab.logilab.org/wip/hg/rev/2fb52bac5bb9#l1.12
(Do not review the whole changeset it needs clean up)
--
Pierre-Yves David
More information about the Mercurial-devel
mailing list