[PATCH 1 of 2 v4] localrepo: persistent caching of branch names

Fri Oct 17 11:40:17 CDT 2014

On 10/17/2014 07:11 AM, Pierre-Yves David wrote:
> On 10/16/2014 07:45 AM, Mads Kiilerich wrote:
>> On 10/16/2014 09:27 AM, Pierre-Yves David wrote:
>>> Your formatversion value looks temporary
>>
>> I can rename it to magic. That makes it more resilient than starting
>> with number 0 as many other files do. (Or we could just leave it out. It
>> is just a cache and errors will be detected and we would probably just
>> rebuild the cache after format change anyway.)
>
> Having this kind of version number to fail fast and gracefully is useful.
>
> Please start it to 0 or 1 (or something sensible that helps tracking 
> error down.

A magic number is even better at failing fast and gracefully and with 
less risk of incorrect acceptance of bad file formats.

>
>>>> +    def _load(self):
>>>> +        """Load cached branch names."""
>>>> +        try:
>>>> +            data = self._repo.vfs.open(self.filename).read()
>>>> +        except IOError:
>>>> +            data = ''
>>>
>>> missing the usual.
>>>
>>>   if err.errno != errno.ENOENT:
>>>       raise
>>
>> No. This is just a cache. I really don't care why the read failed.
>> Problems with this file should fall back to normal operation, never
>> block anything.
>
> Why don't you just except Exception then?

Because it just is IOErrors I don't care about. I do not want to 
enumerate IOErrors I don't care about (but EPERM would be another). 
Other kinds of errors should still be reported.

> I think restricting error catching is valuable and we should use the 
> same approach than for any other cache.

I see it more like how Python handles .pyc files. They are used and 
(re)written when possible, otherwise silently ignored.

>
>>> + self._dirty = True
>>>
>>> Not sure why we default to dirty as we just read the data from disk?
>>
>> It will be set to False after the file successfully has been parsed.
>
> Empty cache still seems to be something clean to me. 

Empty cache will load successfully and will be clean. It is only no 
cache that not is clean and will be written first time.

> I do not get what the value of initializing to False here.

The value is that we have exactly one place in the code where the cache 
has been correctly and cleanly loaded, but more than one place where it 
fails. Instead of marking it dirty when it fails I mark it as clean when 
it doesn't fail.

There might be different opinions on whether a failing load should be 
marked dirty or not. We could postpone the cleanup to the first 
following write, but I prefer to mark it dirty so we get it cleaned up 
ASAP. Postponing the cleanup will make the code two lines simpler so 
that would be fine with me too.

>>> Now I know how you handle branch name (but should stil be documented.
>>> I see this as a minor issue that the cache has to be wiped for every
>>> new branch.
>>
>> No. What makes you think that? It is just appended to the list and no
>> indices will change.
>
> Right, as I understood later in this patches, the whole content is 
> written to disc everytime something change. So there is no data 
> position to preserve.
>
> However, This is going to be quite costly for big repo. I'm concerned 
> about this. 1 millions revs means a 20MB caches. That is not small 
> thing to flush to disk every so often.

Yes. That is one reason to prefer the initial more fragile version that 
doesn't store the hashes. Right now the size will be comparable to the 
size of the changelog index.

It will however usually only be written on the first branch filtering 
operations after a repository change. And it will only be read/written 
when doing operations that tend to request branch information for "all" 
revisions and thus would be expensive anyway. ("Real" indices mapping 
from branch names to revisions would of course be even better for a 
whole category of queries.)

Instead, we could make the file format append only, possibly by keeping 
the branch names and indices in different files. That would however take 
the atomicity away and we would have to use locking and a different 
write-back scheme.

Different implementations can be considered later. This is just a cache.

>
>>
>>>
>>>> +                pass
>>>> +            self._dirty = False
>>>> +
>>>> diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
>>>> --- a/mercurial/localrepo.py
>>>> +++ b/mercurial/localrepo.py
>>>> @@ -297,8 +297,11 @@ class localrepository(object):
>>>>           # - bookmark changes
>>>>           self.filteredrevcache = {}
>>>>
>>>> +        self.revbranchcache = branchmap.revbranchcache(self)
>>>> +
>>>>       def close(self):
>>>> -        pass
>>>> +        if self.revbranchcache:
>>>> +            self.revbranchcache.save()
>>>
>>> Should we do that sooner instead?
>>
>> Why and when?
>
> Hooking to something like unlocking of the repo (if a lock exist) 
> would let this be used server side (also read commands server) and 
> limit the risk of race where old data rewrite newer one.

I understand there are plans/thoughts about moving all cache writing to 
unlocking / transaction commit. This cache can go there together with 
the other caches.

> The fact it is used only by revset now makes it less important but 
> this should probably have the same write cycle than the branchmap.

Except that branchmaps are 100% computed in one go and could be written 
immediately. This cache will only cache results that has been requested 
and we do not know when there will be no more results.

Other implementations that keep the cache 100% updated all the time can 
be considered later. This is just a cache and the file format and 
implementation can be changed.

/Mads