[PATCH 2 of 7] revlog: make sure _cache only contain raw content

Wed Mar 29 08:39:29 EDT 2017

On 3/28/17 8:49 AM, Jun Wu wrote:
> # HG changeset patch
> # User Jun Wu <quark at fb.com>
> # Date 1490685282 25200
> #      Tue Mar 28 00:14:42 2017 -0700
> # Node ID 99cfe31d37df62b50e53a126f0eb31a1e352ac67
> # Parent  1e84f9bd4385a8f95ac1ec15dee14c723071ab34
> revlog: make sure _cache only contain raw content
>
> Previously, revlog._cache could cache both raw and non-raw content. That is
> wrong, because _cache will be used directly for delta application. If the
> non-raw content in _cache is used, it may corrupt delta application, or
> return wrong content when deltas applied successfully but incorrectly, and a
> flagprocessor says skip the hash check.
>
> This patch changes _cache assignments to make sure only only raw contents
> are assigned to it.

This patch overall makes sense to me. We definitely want to only be 
caching the same type of thing in a particular cache. I've reviewed 
revlog.py to find places were self._cache is assigned and it looks like 
you got both places where it's assigned to something other than null.

>
> diff --git a/mercurial/revlog.py b/mercurial/revlog.py
> --- a/mercurial/revlog.py
> +++ b/mercurial/revlog.py
> @@ -1284,11 +1284,11 @@ class revlog(object):
>           text = mdiff.patches(text, bins)
>   
> -        text, validatehash = self._processflags(text, self.flags(rev), 'read',
> -                                                raw=raw)
> +        newtext, validatehash = self._processflags(text, self.flags(rev),
> +                                                   'read', raw=raw)
>           if validatehash:
> -            self.checkhash(text, node, rev=rev)
> +            self.checkhash(newtext, node, rev=rev)
>   
>           self._cache = (node, rev, text)
> -        return text
> +        return newtext

Rather than introducing newtext, wouldn't it be sufficient to move the 
self._cache assignment line up to above the processflags call? Or do you 
want to only cache if the checkhash call succeeds?

If the later, I'd suggest renaming "text" to "rawtext" and reducing 
churn here. "newtext" doesn't really tell me about what is stored in it 
("processedtext" would but it much longer). "rawtext" is pretty short 
and descriptive though.

>   
>       def hash(self, text, p1, p2):
> @@ -1715,5 +1715,6 @@ class revlog(object):
>   
>           if type(text) == str: # only accept immutable objects
> -            self._cache = (node, curr, text)
> +            if raw or not flags:
> +                self._cache = (node, curr, text)
>           self._chainbasecache[curr] = chainbase
>           return node
>

Ugh, this _addrevision code is incredibly dense. A 151-line "helper" 
function with two sub-functions defined inside of it. What a joy!

However, I think this change should also update the comment about "only 
accept...". I'd structure it something like this:

# Only cache immutable objects objects that are raw or have no flags.
# Mutable values are dangerous to cache because they can be changed
# after they are passed to usor after they are returned from cache,
# so we would need to deep copy them on both input and output which
# could be expensive.
# Strings are the only immutable objects we expect to see, to check for them
# If raw is true, it means we are bypassing any flag processors, so we 
have cacheable data.
# If there are no flags set, that means no flag processors will be run, 
so we also have cacheable data
if type(text) == str and (raw or not flags):
     self._cache = (node, curr, text)

For my own edification, what are the types that can show up here? Do we 
accept arbitrary things that have string representations? When would we 
get something that isn't a string here? I hate the lack of types here, 
it leads to so much ambiguity about what is going on.

Any comments you can add to the code from what you have learned mucking 
around here would be appreciated.