[PATCH 1 of 2 v4] localrepo: persistent caching of branch names

Fri Oct 17 17:44:49 CDT 2014

On 10/17/2014 09:40 AM, Mads Kiilerich wrote:
> On 10/17/2014 07:11 AM, Pierre-Yves David wrote:
>> On 10/16/2014 07:45 AM, Mads Kiilerich wrote:
>>> On 10/16/2014 09:27 AM, Pierre-Yves David wrote:
>>>> Your formatversion value looks temporary
>>>
>>> I can rename it to magic. That makes it more resilient than starting
>>> with number 0 as many other files do. (Or we could just leave it out. It
>>> is just a cache and errors will be detected and we would probably just
>>> rebuild the cache after format change anyway.)
>>
>> Having this kind of version number to fail fast and gracefully is useful.
>>
>> Please start it to 0 or 1 (or something sensible that helps tracking
>> error down.
>
> A magic number is even better at failing fast and gracefully and with
> less risk of incorrect acceptance of bad file formats.

It does about the same job at failing, a better job at detection cross 
format conflict and a less good job at readability.

But I guess this is worth that much fighting ☺

>>>>> +    def _load(self):
>>>>> +        """Load cached branch names."""
>>>>> +        try:
>>>>> +            data = self._repo.vfs.open(self.filename).read()
>>>>> +        except IOError:
>>>>> +            data = ''
>>>>
>>>> missing the usual.
>>>>
>>>>   if err.errno != errno.ENOENT:
>>>>       raise
>>>
>>> No. This is just a cache. I really don't care why the read failed.
>>> Problems with this file should fall back to normal operation, never
>>> block anything.
>>
>> Why don't you just except Exception then?
>
> Because it just is IOErrors I don't care about. I do not want to
> enumerate IOErrors I don't care about (but EPERM would be another).
> Other kinds of errors should still be reported.

I double checked the other cache handling code in Mercurial and they 
agree with your approach. You won this one ;-)

>>>> + self._dirty = True
>>>>
>>>> Not sure why we default to dirty as we just read the data from disk?
>>>
>>> It will be set to False after the file successfully has been parsed.
>>
>> Empty cache still seems to be something clean to me.
>
> Empty cache will load successfully and will be clean. It is only no
> cache that not is clean and will be written first time.

So if the repo is empty, we ensure we write and empty cache file? Seems 
a bit backward to me.

>> I do not get what the value of initializing to False here.
>
> The value is that we have exactly one place in the code where the cache
> has been correctly and cleanly loaded, but more than one place where it
> fails. Instead of marking it dirty when it fails I mark it as clean when
> it doesn't fail.
>
> There might be different opinions on whether a failing load should be
> marked dirty or not. We could postpone the cleanup to the first
> following write, but I prefer to mark it dirty so we get it cleaned up
> ASAP. Postponing the cleanup will make the code two lines simpler so
> that would be fine with me too.

I think I would rather see stuff being explicitly marked as dirty when 
failing. (from the explicit is better than implicit)

>>>> Now I know how you handle branch name (but should still be documented.
>>>> I see this as a minor issue that the cache has to be wiped for every
>>>> new branch.
>>>
>>> No. What makes you think that? It is just appended to the list and no
>>> indices will change.
>>
>> Right, as I understood later in this patches, the whole content is
>> written to disc everytime something change. So there is no data
>> position to preserve.
>>
>> However, This is going to be quite costly for big repo. I'm concerned
>> about this. 1 millions revs means a 20MB caches. That is not small
>> thing to flush to disk every so often.
>
> Yes. That is one reason to prefer the initial more fragile version that
> doesn't store the hashes. Right now the size will be comparable to the
> size of the changelog index.
>
> It will however usually only be written on the first branch filtering
> operations after a repository change. And it will only be read/written
> when doing operations that tend to request branch information for "all"
> revisions and thus would be expensive anyway. ("Real" indices mapping
> from branch names to revisions would of course be even better for a
> whole category of queries.)
>
> Instead, we could make the file format append only, possibly by keeping
> the branch names and indices in different files. That would however take
> the atomicity away and we would have to use locking and a different
> write-back scheme.
>
> Different implementations can be considered later. This is just a cache.

I'm still concerned about this. I dunno if we should get it in for the 
freeze and disable if too impactfull or if we should delay that for 
early 3.2 cycles.

>>>>> +                pass
>>>>> +            self._dirty = False
>>>>> +
>>>>> diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
>>>>> --- a/mercurial/localrepo.py
>>>>> +++ b/mercurial/localrepo.py
>>>>> @@ -297,8 +297,11 @@ class localrepository(object):
>>>>>           # - bookmark changes
>>>>>           self.filteredrevcache = {}
>>>>>
>>>>> +        self.revbranchcache = branchmap.revbranchcache(self)
>>>>> +
>>>>>       def close(self):
>>>>> -        pass
>>>>> +        if self.revbranchcache:
>>>>> +            self.revbranchcache.save()
>>>>
>>>> Should we do that sooner instead?
>>>
>>> Why and when?
>>
>> Hooking to something like unlocking of the repo (if a lock exist)
>> would let this be used server side (also read commands server) and
>> limit the risk of race where old data rewrite newer one.
>
> I understand there are plans/thoughts about moving all cache writing to
> unlocking / transaction commit. This cache can go there together with
> the other caches.

Writing on unlock seems a similar amount of work than writing on close. 
What about we do it too?

>
>> The fact it is used only by revset now makes it less important but
>> this should probably have the same write cycle than the branchmap.
>
> Except that branchmaps are 100% com
>
>>
>>>
>>>>puted in one go and could be written
> immediately. This cache will only cache results that has been requested
> and we do not know when there will be no more results.
>
> Other implementations that keep the cache 100% updated all the time can
> be considered later. This is just a cache and the file format and
> implementation can be changed.

If you use this for branchmaps, you will ensure it is fully updated at 
the same time as branchmap and can mostly stick to its update cycle.

-- 
Pierre-Yves David