[PATCH 1 of 2 v4] localrepo: persistent caching of branch names

Fri Oct 17 00:11:29 CDT 2014

On 10/16/2014 07:45 AM, Mads Kiilerich wrote:
> On 10/16/2014 09:27 AM, Pierre-Yves David wrote:
>> On 10/15/2014 06:48 PM, Mads Kiilerich wrote:
>>> # HG changeset patch
>>> # User Mads Kiilerich <madski at unity3d.com>
>>> # Date 1413424006 -7200
>>> #      Thu Oct 16 03:46:46 2014 +0200
>>> # Node ID d87009e720063a8d6d80afdbe6bb9323e2849030
>>> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
>>> localrepo: persistent caching of branch names
>>>
>>> It is expensive to retrieve the branch name. Very expensive when
>>> creating a
>>> changectx and calling .branch() - slightly less when using
>>> changelog.branchinfo().
>>>
>>> Now, to really speed things up, cache the results on disk. To get
>>> efficient
>>> lookup for revisions (constant size records) and avoid storing the
>>> same branch
>>> name over and ever, store the name of each branch once with a fixed
>>> ordering.
>>> For each repo revision, store the node hash and the index of the
>>> branch name.
>>> To make it 100% stable against repository mutations, always check the
>>> node hash
>>> before using the cache content.
>>>
>>> The code for this is kind of similar to the branchmap handling and is
>>> placed in
>>> the same module even though the name is not completely spot on.
>>>
>>> This new method promise to make some operations up 20 times faster
>>> once it
>>> actually is used.
>>>
>>> A simpler approach that didn't store and validate node hashes for every
>>> revision was significantly faster (x2) but could be tricked when
>>> modifying
>>> history. The usual worst case would be that the whole cache was
>>> invalidated
>>> when the repository history was modified, but when trying very hard
>>> it could be
>>> tricked into not noticing changes.
>>>
>>> diff --git a/mercurial/branchmap.py b/mercurial/branchmap.py
>>> --- a/mercurial/branchmap.py
>>> +++ b/mercurial/branchmap.py
>>> @@ -9,6 +9,7 @@ from node import bin, hex, nullid, nullr
>>>   import encoding
>>>   import util
>>>   import time
>>> +import struct, array
>>>
>>>   def _filename(repo):
>>>       """name of a branchcache file for a given repo or repoview"""
>>> @@ -285,3 +286,97 @@ class branchcache(dict):
>>>           duration = time.time() - starttime
>>>           repo.ui.log('branchcache', 'updated %s branch cache in %.4f
>>> seconds\n',
>>>                       repo.filtername, duration)
>>> +
>>> +class revbranchcache(object):
>>> +    """Persistent cache mapping from revision number to branch name.
>>> +    Consistency is guaranteed by verifying the node hash."""
>>> +
>>> +    filename = 'cache/branchnames'
>>> +    formatversion = 2345164374
>>> +    headerfmt = '>LLL' # file header: version, records start,
>>> records length
>>> +    recfmt = '>20sH' # a record: node hash, branch name reference
>>> +    headersize = struct.calcsize(headerfmt)
>>> +    recsize = struct.calcsize(recfmt)
>>
>> We could use a textual description of the format I've trouble
>> identifying how branch name are first introduced.
>>
>> Your formatversion value looks temporary
>
> I can rename it to magic. That makes it more resilient than starting
> with number 0 as many other files do. (Or we could just leave it out. It
> is just a cache and errors will be detected and we would probably just
> rebuild the cache after format change anyway.)

Having this kind of version number to fail fast and gracefully is useful.

Please start it to 0 or 1 (or something sensible that helps tracking 
error down.

>>> +
>>> +    def __init__(self, repo):
>>> +        self._repo = repo
>>> +        self._loaded = False
>>> +        self._dirty = False
>>> +        self._names = [] # branch names referenced from recfmt records
>>> +        self._records = array.array('c') # bytes with structs of
>>> type recfmt
>>> +
>>> +    def _load(self):
>>> +        """Load cached branch names."""
>>> +        try:
>>> +            data = self._repo.vfs.open(self.filename).read()
>>> +        except IOError:
>>> +            data = ''
>>
>> missing the usual.
>>
>>   if err.errno != errno.ENOENT:
>>       raise
>
> No. This is just a cache. I really don't care why the read failed.
> Problems with this file should fall back to normal operation, never
> block anything.

Why don't you just except Exception then?

I think restricting error catching is valuable and we should use the 
same approach than for any other cache.

>> + self._dirty = True
>>
>> Not sure why we default to dirty as we just read the data from disk?
>
> It will be set to False after the file successfully has been parsed.

Empty cache still seems to be something clean to me. I do not get what 
the value of initializing to False here.

>>> +        reporecslen = len(self._repo) * self.recsize
>>> +        if len(data) >= self.headersize:
>>> +            # header
>>> +            v, recsstart, recslen =
>>> struct.unpack_from(self.headerfmt, data)
>>> +            if v == self.formatversion and len(data) == recsstart +
>>> recslen:
>>> +                # between header and records: \0 separated branch names
>>> +                if recsstart != self.headersize:
>>> +                    self._names = \
>>> + data[self.headersize:recsstart].split('\0')
>>
>> Please use an intermediate variable instead of ugly \
>> continuation.
>
> Please don't say that. There is nothing in the Mercurial coding style
> about that. It says mostly pep8, and pep8 explicitly allows use of \
> when there is no obvious way to place () around it. If you want to
> enforce your own ugly style, please make Matt put it in the coding
> style. My attention span is not limited to one textual line and I
> strongly dislike the use of ugly intermediate redundant variables.

Given than there is only 28 such usages of "\" in the whole Mercurial 
code base and that I've already been asked to not use them by other 
reviewers (when I was young and foolish) I stick to my request to remove it.

>> Now I know how you handle branch name (but should stil be documented.
>> I see this as a minor issue that the cache has to be wiped for every
>> new branch.
>
> No. What makes you think that? It is just appended to the list and no
> indices will change.

Right, as I understood later in this patches, the whole content is 
written to disc everytime something change. So there is no data position 
to preserve.

However, This is going to be quite costly for big repo. I'm concerned 
about this. 1 millions revs means a 20MB caches. That is not small thing 
to flush to disk every so often.

>
>>
>>> +                # read records, cap at repo size
>>> +                self._records.fromstring(
>>> +                    buffer(data, recsstart, min(recslen, reporecslen)))
>>
>> wrong identation, should be aligned with the opening parents. You
>> could use and intermediate variable.
>
> That kind of indentation is explicitly allowed in pep8. Please don't
> call it wrong unless you change the rules so it become wrong.

The mercurial code base is mostly consistent in aligning line warping to 
the opening bracket. I guess this comes from Matt using emacs. This has 
been consistently enforced by multiple reviewers. Please stay consistent 
with the rest of the codebase.

>>> +                # only dirty if too many records (after strip)
>>> +                self._dirty = recslen > reporecslen
>>> +            else:
>>> +                self._repo.ui.debug('branch cache file was invalid\n')
>>
>> Consider reversing the logic and putting the short branch of the
>> if/else at start. This having having to refetch the context N line
>> higher when meeting this else.
>
> Considered ... but not liked. As it is now, the condition has no
> redundant negation, it focus on what is important, and the handling of
> the important case comes first.
>
> But the debug statement is arguably not important and I can remove it.

The debug statement is valuable. Keep it around.

>>> +        # pad to repo size
>>> +        if len(self._records) < reporecslen:
>>> +            self._records.extend(
>>> +                '\xff' * (reporecslen - len(self._records)))
>>> +
>>> +        self._branchnamesindex = dict((b, r)
>>> +                                      for r, b in
>>> enumerate(self._names))
>>> +        self._node = self._repo.changelog.node
>>> +        self._branchinfo = self._repo.changelog.branchinfo
>>> +        self._loaded = True
>>> +
>>> +    def branch(self, rev):
>>> +        """Return branch name of rev, using and updating persistent
>>> cache."""
>>> +        if not self._loaded:
>>> +            self._load()
>>> +
>>> +        node = self._node(rev)
>>> +        cachenode, branchidx = struct.unpack_from(self.recfmt,
>>> self._records,
>>> +                                                  rev * self.recsize)
>>
>> The runtime packing and de-packing make me shivers a bit but it seems
>> fairly reasonable.
>
> Unpacking everything up front and storing updates in a different format
> would make me shiver. This way of doing it is quite efficient in time
> and memory. Even more so if Python had mutable arrays of structs.

My statement above meant: I do not like it much, but it seems a valid 
and efficient option. We can keep it for now.

>
>>
>>> +        if cachenode == node and branchidx < len(self._names):
>>> +            return self._names[branchidx]
>>> +        b, _close = self._branchinfo(rev)
>>> +        if b in self._branchnamesindex:
>>> +            branchidx = self._branchnamesindex[b]
>>> +        else:
>>> +            branchidx = len(self._names)
>>> +            self._names.append(b)
>>> +            self._branchnamesindex[b] = branchidx
>>> +        struct.pack_into(self.recfmt, self._records, rev *
>>> self.recsize,
>>> +                         node, branchidx)
>>> +        self._dirty = True
>>> +        return b
>>> +
>>> +    def save(self):
>>> +        """Save branch cache if it is dirty."""
>>> +        if self._dirty:
>>> +            self._repo.ui.debug('writing branch cache file\n')
>>> +            try:
>>> +                f = self._repo.vfs.open(self.filename, 'w',
>>> atomictemp=True)
>>> +                s = '\0'.join(self._names)
>>> +                f.write(struct.pack(self.headerfmt, self.formatversion,
>>> +                                    self.headersize + len(s),
>>> +                                    len(self._records)))
>>> +                f.write(s)
>>> +                f.write(self._records)
>>> +                f.close()
>>> +            except IOError:
>>
>> Same here need to propage some of them
>
> Same: it is just a cache. Failing to update it should not cause a
> failure, no matter what the reason was.

Same same!

>
>>
>>> +                pass
>>> +            self._dirty = False
>>> +
>>> diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
>>> --- a/mercurial/localrepo.py
>>> +++ b/mercurial/localrepo.py
>>> @@ -297,8 +297,11 @@ class localrepository(object):
>>>           # - bookmark changes
>>>           self.filteredrevcache = {}
>>>
>>> +        self.revbranchcache = branchmap.revbranchcache(self)
>>> +
>>>       def close(self):
>>> -        pass
>>> +        if self.revbranchcache:
>>> +            self.revbranchcache.save()
>>
>> Should we do that sooner instead?
>
> Why and when?

Hooking to something like unlocking of the repo (if a lock exist) would 
let this be used server side (also read commands server) and limit the 
risk of race where old data rewrite newer one.

The fact it is used only by revset now makes it less important but this 
should probably have the same write cycle than the branchmap.

-- 
Pierre-Yves David