[PATCH 3 of 6] branchcache: introduce revbranchcache for caching of revision branch names

Sun Dec 14 18:17:34 CST 2014

On 12/14/2014 10:34 AM, Mads Kiilerich wrote:
> # HG changeset patch
> # User Mads Kiilerich <madski at unity3d.com>
> # Date 1418581983 -3600
> #      Sun Dec 14 19:33:03 2014 +0100
> # Node ID eeeae034dc7aa8501c3e8f510a0788a77e54bc9f
> # Parent  4f99561d776947cc3911b645c485be1bc2cf0c1f
> branchcache: introduce revbranchcache for caching of revision branch names
>
> It is expensive to retrieve the branch name of a revision. Very expensive when
> creating a changectx and calling .branch() every time - slightly less when
> using changelog.branchinfo().
>
> Now, to really speed things up, we cache the results on disk. To get constant
> size records (so we can make efficient lookups based on revision number) and
> avoid storing the same branch name over and ever, we store each branch name
> once with a fixed index. For each repo revision, we store the node hash and the
> index of the branch name. The whole cache file is rewritten atomically every
> time it is updated. To make the system 100% stable against repository
> mutations, we always check the node hash for each revision before using the
> cache content.
>
> This new method is still unused but promise to make some operations up 20 times
> faster once it actually is used.

I like the idea and support the feature. but I've multiple feedback on 
the implementations. here is a summary of the feedback:

1) The cache file will grow huge, We should have a more stable file 
where the revision rework do not need to be rewritten on disk at every 
new branch. This mean extracting the branch list in another files.

2) We should avoid cycle and not keep a repository instance in the object.

3) The binary format need more extensive documentation.

More details inlined

>
> diff --git a/mercurial/branchmap.py b/mercurial/branchmap.py
> --- a/mercurial/branchmap.py
> +++ b/mercurial/branchmap.py
> @@ -9,6 +9,7 @@
>   import encoding
>   import util
>   import time
> +import struct, array
>
>   def _filename(repo):
>       """name of a branchcache file for a given repo or repoview"""
> @@ -286,3 +287,110 @@
>           duration = time.time() - starttime
>           repo.ui.log('branchcache', 'updated %s branch cache in %.4f seconds\n',
>                       repo.filtername, duration)
> +
> +# Revision branch name cache
> +
> +rbcfilename = 'cache/revbranchnames'
> +rbcheaderfmt = '>LL' # file header: records start, records length
> +rbcrecfmt = '>20sH' # a record: node hash, branch name reference
> +rbcheadersize = struct.calcsize(rbcheaderfmt)
> +rbcrecsize = struct.calcsize(rbcrecfmt)
> +rbcclosebitmask = 1 << (8 * struct.calcsize('>H') - 1)
> +rbcbranchidxmask = rbcclosebitmask - 1

The class have some details about the format, but it is very light. If 
you look at the bundle2, or obsmarker format, we have a more detailed 
explanation of what the binary format is (with binary types) This would 
also helps explaining some of the dance we do when defining the format 
constant.

> +
> +class revbranchcache(object):
> +    """Persistent cache mapping from revision number to branch name.
> +    Consistency is guaranteed by verifying the node hash before use.
> +
> +    The file format is binary, everything in one file to ensure consistency:
> +    First a header (rbcheaderfmt) with start and length of the records.
> +    After header and until records starts: UTF-8 branch names separated by 0.
> +    Branch names are added to the list on demand, with indices fixed once added.
> +    Records run until end of file and is records (rbcrecfmt) with the hash of
> +    the corresponding node, the index of the corresponding branch name, and a
> +    flag for closed.
> +    """

Have a single file is tempting but makes it too unstable to be used on 
very large repo (not willing to write multiple MB down at every changes. 
So we needs a way to extract the branch list in another file.

I can think of one way to do it:

- branch names are stored (as a list) in an extra cache file.

- The main file (with revision record) could contains a hash to make 
sure this branches file is valid.

- This hash is the result of hashing all branch names in sequence.

- Each entry in the branch names list contains both the name and the 
hash at this index. So that the main file can detect its hash is 
different because it contains less data then the branch lists or because 
their are mismatching.

With the setup described above, you can have consistent separated file. 
The main file can be updated in place and the branch names list is 
append only.

> +    def __init__(self, repo):
> +        self._repo = repo

We should avoid reference cycle. We only use a limited amount of the 
repo content.

- The vfs (could be premanently cached
- The changelog (we could restrict to storing that + some logic to 
update it if needed).

Also, not using repo directly would help to avoid repoview hazard.

> +        self._dirty = False
> +        reporecslen = len(self._repo) * rbcrecsize

Not sure why we compute the repo

> +
> +        try:
> +            data = self._repo.vfs.open(rbcfilename).read()
> +        except (IOError, OSError, util.Abort):
> +            data = ''
> +
> +        self._namesutf8 = [] # utf8 branch names referenced from the records
> +        self._records = array.array('c') # bytes with structs of type rbcrecfmt
> +        if len(data) >= rbcheadersize:
> +            # header
> +            recsstart, recslen = struct.unpack_from(rbcheaderfmt, data)
> +            if len(data) == recsstart + recslen:
> +                # between header and records: \0 separated branch names
> +                if recsstart != rbcheadersize:
> +                    self._namesutf8 = \
> +                        data[rbcheadersize:recsstart].split('\0')
> +                # read records, cap at repo size
> +                self._records.fromstring(
> +                    buffer(data, recsstart, min(recslen, reporecslen)))
> +
> +        # pad to repo size
> +        if len(self._records) < reporecslen:
> +            self._records.extend(
> +                '\xff' * (reporecslen - len(self._records)))

- That '\xff' is a bit a obscure. Un initialised data are all ones?
- not sure why we update the length here as the getter function will do 
it if needed anyway.

> +
> +        self._branchnamesindex = dict((b, r)
> +                                      for r, b in enumerate(self._namesutf8))
> +
> +    def branchinfoutf8(self, rev):
> +        """Return branch name and close flag for rev, using and updating
> +        persistent cache."""
> +        node = self._repo.changelog.node(rev)
> +        recordindex = rev * rbcrecsize
> +        if len(self._records) <= recordindex:
> +            self._records.extend(
> +                '\xff' * (len(self._repo) * rbcrecsize - len(self._records)))
> +
> +        # fast path: extract data from cache, use it if node is matching
> +        cachenode, branchidx = struct.unpack_from(rbcrecfmt, self._records,
> +                                                  recordindex)
> +        close = branchidx & rbcclosebitmask
> +        branchidx &= rbcbranchidxmask
> +        if cachenode == node and branchidx < len(self._namesutf8):
> +            butf8 = self._namesutf8[branchidx]
> +            return butf8, close
> +
> +        # slow path: retrieve from changelog and store in cache
> +        butf8, close = self._repo.changelog.branchinfoutf8(rev)
> +        # reuse branch name entry to add a new one
> +        if butf8 in self._branchnamesindex:
> +            branchidx = self._branchnamesindex[butf8]
> +        else:
> +            branchidx = len(self._namesutf8)
> +            if branchidx >= rbcbranchidxmask:
> +                # fall-back to no caching if cache full
> +                return butf8, close
> +            self._namesutf8.append(butf8)
> +            self._branchnamesindex[butf8] = branchidx
> +        if close:
> +            branchidx |= rbcclosebitmask
> +        struct.pack_into(rbcrecfmt, self._records, recordindex,
> +                         node, branchidx)
> +        self._dirty = True
> +        return butf8, close
> +
> +    def save(self):
> +        """Save branch cache if it is dirty."""
> +        if self._dirty:
> +            try:
> +                f = self._repo.vfs.open(rbcfilename, 'w', atomictemp=True)
> +                s = '\0'.join(self._namesutf8)
> +                f.write(struct.pack(rbcheaderfmt, rbcheadersize + len(s),
> +                                    len(self._records)))
> +                f.write(s)
> +                f.write(self._records)
> +                f.close()
> +            except (IOError, OSError, util.Abort):
> +                pass
> +            self._dirty = False

We should try locking the repo before doing this. This will prevent 
stuff overwriting each other for bad reason. this will also be necessary 
for updating the file in place.

-- 
Pierre-Yves David