[PATCH 2 of 7] revlog: make sure _cache only contain raw content

Jun Wu quark at fb.com
Wed Mar 29 11:13:04 EDT 2017


Excerpts from Ryan McElroy's message of 2017-03-29 13:39:29 +0100:
> On 3/28/17 8:49 AM, Jun Wu wrote:
> > # HG changeset patch
> > # User Jun Wu <quark at fb.com>
> > # Date 1490685282 25200
> > #      Tue Mar 28 00:14:42 2017 -0700
> > # Node ID 99cfe31d37df62b50e53a126f0eb31a1e352ac67
> > # Parent  1e84f9bd4385a8f95ac1ec15dee14c723071ab34
> > revlog: make sure _cache only contain raw content
> >
> > Previously, revlog._cache could cache both raw and non-raw content. That is
> > wrong, because _cache will be used directly for delta application. If the
> > non-raw content in _cache is used, it may corrupt delta application, or
> > return wrong content when deltas applied successfully but incorrectly, and a
> > flagprocessor says skip the hash check.
> >
> > This patch changes _cache assignments to make sure only only raw contents
> > are assigned to it.
> 
> This patch overall makes sense to me. We definitely want to only be 
> caching the same type of thing in a particular cache. I've reviewed 
> revlog.py to find places were self._cache is assigned and it looks like 
> you got both places where it's assigned to something other than null.
> 
> >
> > diff --git a/mercurial/revlog.py b/mercurial/revlog.py
> > --- a/mercurial/revlog.py
> > +++ b/mercurial/revlog.py
> > @@ -1284,11 +1284,11 @@ class revlog(object):
> >           text = mdiff.patches(text, bins)
> >   
> > -        text, validatehash = self._processflags(text, self.flags(rev), 'read',
> > -                                                raw=raw)
> > +        newtext, validatehash = self._processflags(text, self.flags(rev),
> > +                                                   'read', raw=raw)
> >           if validatehash:
> > -            self.checkhash(text, node, rev=rev)
> > +            self.checkhash(newtext, node, rev=rev)
> >   
> >           self._cache = (node, rev, text)
> > -        return text
> > +        return newtext
> 
> Rather than introducing newtext, wouldn't it be sufficient to move the 
> self._cache assignment line up to above the processflags call? Or do you 
> want to only cache if the checkhash call succeeds?
> 
> If the later, I'd suggest renaming "text" to "rawtext" and reducing 
> churn here. "newtext" doesn't really tell me about what is stored in it 
> ("processedtext" would but it much longer). "rawtext" is pretty short 
> and descriptive though.

Will rename to "rawtext".

> 
> >   
> >       def hash(self, text, p1, p2):
> > @@ -1715,5 +1715,6 @@ class revlog(object):
> >   
> >           if type(text) == str: # only accept immutable objects
> > -            self._cache = (node, curr, text)
> > +            if raw or not flags:
> > +                self._cache = (node, curr, text)
> >           self._chainbasecache[curr] = chainbase
> >           return node
> >
> 
> Ugh, this _addrevision code is incredibly dense. A 151-line "helper" 
> function with two sub-functions defined inside of it. What a joy!

This function is also broken in multiple other ways regarding on raw
handling. I'll send a separate patch.

> However, I think this change should also update the comment about "only 
> accept...". I'd structure it something like this:
> 
> # Only cache immutable objects objects that are raw or have no flags.
> # Mutable values are dangerous to cache because they can be changed
> # after they are passed to usor after they are returned from cache,
> # so we would need to deep copy them on both input and output which
> # could be expensive.
> # Strings are the only immutable objects we expect to see, to check for them
> # If raw is true, it means we are bypassing any flag processors, so we 
> have cacheable data.
> # If there are no flags set, that means no flag processors will be run, 
> so we also have cacheable data
> if type(text) == str and (raw or not flags):
>      self._cache = (node, curr, text)
> 
> For my own edification, what are the types that can show up here? Do we 
> accept arbitrary things that have string representations? When would we 
> get something that isn't a string here? I hate the lack of types here, 
> it leads to so much ambiguity about what is going on.
> 
> Any comments you can add to the code from what you have learned mucking 
> around here would be appreciated.

Thanks for the suggestion. This series looks relatively easy so I didn't
spend much time on adding comments. I'll do so for the "_addrevision"
change.


More information about the Mercurial-devel mailing list