[PATCH 4 of 5 flagprocessor v6] revlog: processflags()

Mon Jan 2 08:51:28 UTC 2017

On 12/24/2016 05:36 PM, Remi Chaintron wrote:
> # HG changeset patch
> # User Remi Chaintron <remi at fb.com>
> # Date 1482596881 18000
> #      Sat Dec 24 11:28:01 2016 -0500
> # Node ID c94d2907a470ca03b4a4a8da514b66d2f8952bce
> # Parent  b1d220e584e6fc861a40a5aeefa0c3b19ae09126
> revlog: processflags()
>
> Add the ability for revlog objects to process revision flags and apply
> registered transforms on read/write operations.

We are making good progress toward a final API here. But I've spotted a 
couple of correctness issue in the current implementation:

- In 'processhash' we are looping on the flags to composing their 
transform function. However, we only compose the text transformation (by 
feeding the result of the previous in the next one) but we do not 
compose the 'validatehash' result. Instead we just use the value of last 
"transform" called. Seems wrong, as soon as a flag imply a transform 
touching the text content, the hash won't match and we have to stop try 
to validate it, the order in which they are applied does not change that 
property. So we should also compose the 'validatehash' value (new = old 
and current)

- The current implementation around reverse is missing something. The 
idea around reverse is that if we apply three transforms:

       stored = x(y(z(original)))

   We need to apply the reverse to get the original text back. This mean 
running "reverse-transform" in reverse order.

       original = zâ»Â¹(yâ»Â¹(xâ»Â¹(stored)))

   However, the current code only have the order reversal. It reverse 
the order in which the function are run, but still run the same function 
in both case. Reversing one without the other does not make sense. So we 
either need to have the two functions available to process flag or we do 
not actually need to run in reverse in the actual design (as the 
transform from original to stored is to be applied by extension by an 
addrevision wrapping).

   According to the current design, the "transform" function is 
introduced by the extensionÂ¹ with a wrapping of "addrevision" (so 
outside of any core reach), and the reverse transform is stored in the 
transform mapping. The 'processflag' function only have access to the 
mapping so it is restricted to applying the transformation in a single 
direction only. So to summarize, that 'reverse' argument to 
'processflag' is inconsistent with the fact 'processflag' only have 
access to a single transform direction.

   The reason this did not explode in your local test cases seems to be 
related to the limited spot were 'reverse' is used and how this only 
triggers for case were no flag are in use or were "raw" is used.

   The fact it worked in the "raw" case, highlight that all the 
'applyraw' from '_flagtransforms' are actually not transforming anything 
and only return a validate hash function. You have another paragraph 
about "raw" lower in your description. I'll go into more details below 
that paragraph

At that point, I think we need to introduce some real testing to make 
sure the next iteration is actually providing correct result. For 
example we can add a small extensions in the tests, that adds some flags 
(and the end of the spectrum) and test they combination. Exceptionally, 
the flag should be added in "known flags" and "order" by the extensions 
to make sure nothing else than the test ever saw them used (so that we 
can reuse them for real later if needed). Here is some idea of transform 
we could use (in that order):

   FLAG1: no transform - no transformation and validatehash always True,
   FLAG2: gzip - use ungzipped version of a file for storage,
   FLAG3: base64 - store a base64 version of the file

Then we can have 9 tests cases for all the available combination.
(Of course, feel free to tests your own way if think of something better)

In addition to these issue in the implementation, I wrote various 
feedback only on smaller point. There is also an handful of nits that 
would not have blocked the changeset but as we are about to do an extra 
roundtrip we can as well fix them.

> This patch introduces the 'revlog.processflags()' method that looks at revision
> flags and applies transforms registered on them. Due to the need to handle
> non-commutative operations, flag transforms are applied in stable order but the
> order in which the transforms are applied is reversed between read and write operations.

If we end up with an implementation that keeps "reverse" (that is now 
unclear to me. I think we can improve the working. I know why we need 
the two directions, but I often needs to check which is "direct" and 
which is "reverse". What about we reword this as "action" and accept two 
values 'read' and 'write'? (or other way appropriate work: store, 
restore; code, decode; apply, â€¦)

> The 'flagtransform' namedtuple is also introduced, and is to be used by
> extensions to define the behavior of transforms via its 'apply' and 'applyraw'
> fields.
> The 'processflags()' method uses the 'raw' argument of 'revision()' and
> '_addrevision()' to determine which field to use when applying transforms.

Second wording point (yeah), "transform" does not feel right, because 
some of these function will not transform anything, What about using 
"process" to match the method name?

> The 'raw' argument usually indicates that the contents are not to be modified by
> the transform, and might not be validable for hash integrity. This is expected
> for specific workflows such as changegroup generation and debug commands that
> are interested in the actual contents of revlogs rather than the transformed
> contents after application of a transform.

So, this paragraph highlight the fact '_flagtransforms[flag][1]' is 
actually not transforming anything. Should we actually enforce that in 
the API? From what I'm seeing now:

  * write processing:
    - transform not handled by the API;
    - hashcheck must be skipped if a transformation happened, the API 
handle that.

  * read process:
    - transform handled by the API
    - result is expected to match the hash in pretty much all cases 
(censors is a bit special for now), hash checking handle by the API

  * read raw processing:
    - no transform is needed
    - hashcheck always skipped (skip is API handled)

  * write raw processing:
    - no transform is needed (data are provided transformed already),
    - hashcheck always skipped  (skip is API handled)

Did I got that right?

I so, should we simplify the API to stick to fact exposed here?
In particular, "write processing" and "write raw processing" seems to 
behave in the very same way, so we can probably merge the code path for 
write and drop "raw" to _addrevision .

I'm not saying we should change it now (we can simplify it later). But 
if that helps we could consider it. If we keep the ability to apply 
transformation for raw reads and writes, we should add a fourth FLAG to 
the test checking that.

>  diff --git a/mercurial/bundlerepo.py b/mercurial/bundlerepo.py
> --- a/mercurial/bundlerepo.py
> +++ b/mercurial/bundlerepo.py
> @@ -148,7 +148,9 @@
>              delta = self._chunk(chain.pop())
>              text = mdiff.patches(text, [delta])
>
> -        self.checkhash(text, node, rev=rev)
> +        text, validatehash = self.processflags(text, self.flags(rev), raw=raw)
> +        if validatehash is True:
> +            self.checkhash(text, node, rev=rev)

Hey, more of this "if x is True:" construct. Can you turn that in "if 
x:". We are at V6 of this series and I'm still consistently finding 
these, can you make sure  to double check the next iterations for theses?

>          self._cache = (node, rev, text)
>          return text
>
> diff --git a/mercurial/revlog.py b/mercurial/revlog.py
> --- a/mercurial/revlog.py
> +++ b/mercurial/revlog.py
> @@ -56,6 +56,10 @@
>  REVIDX_ISCENSORED = (1 << 15) # revision has censor metadata, must be verified
>  REVIDX_DEFAULT_FLAGS = 0
>  REVIDX_KNOWN_FLAGS = REVIDX_ISCENSORED
> +# stable order in which flags need to be processed and their transforms applied
> +REVIDX_FLAGS_ORDER = [
> +    REVIDX_ISCENSORED,
> +]
>
>  # max size of revlog with inline data
>  _maxinline = 131072
> @@ -65,6 +69,33 @@
>  LookupError = error.LookupError
>  CensoredNodeError = error.CensoredNodeError
>
> +# 'flagtransform' namedtuple type for extensions to use to register transforms
> +# on revlog flags.
> +# - The apply transform will be used as a default behavior.
> +# - The applyraw transform will be used for changegroup generation and debug
> +# methods.
> +flagtransform = collections.namedtuple('flagtransform', ['apply', 'applyraw'])
> +# Store flag transforms (cf. 'addflagtransform()' to register)
> +_flagtransforms = { }

(note: this would be affected by the rewording proposal made earlier)

Using name tuple here seems overkill, especially since we do not 
actually use the tuple names anywhere. Using a regular tuple seems fine 
given how restricted and small the API code is.

> +def addflagtransform(flag, transform):
> +    """Register transforms on revision data flags.
> +
> +    Invariant:
> +    - Flags need to be defined in REVIDX_KNOWN_FLAGS and REVIDX_FLAGS_ORDER.
> +    - Only one transform can be registered on a specific flag.
> +    - the transform must be of type 'flagtransform'
> +    """
> +    if not flag & REVIDX_KNOWN_FLAGS:
> +        raise error.Abort(_("cannot register transform on unknown flag."))

Thanks to Jun, we now have a "ProgrammingError" class that seems 
appropriate here. In addition, it is good to make sure error contains 
enough data for the "user" to be able to fix it. In this case, can you 
include the flag in the error message.

> +    if not flag in REVIDX_FLAGS_ORDER:
> +        raise error.Abort(_("flag need to be defined in REVIDX_FLAGS_ORDER."))

ditto (ProgrammingError + info about the flag)

> +    if _flagtransforms.get(flag, None) is not None:

Small nits: This looks like a simple membership test. You could directly do:

   if flag in _flagtransforms:

> +        raise error.Abort(_("cannot register multiple transforms on flag."))

ditto about including information about the problematic flag. however, 
unlike the others this one is probably better as an Abort because users 
have more chance to be exposed to them/

> +    if not isinstance(transform, flagtransform):
> +        raise error.Abort(_("invalid transform type."))

Meh, this is Python, manual type check are extremely rarely. Here we are 
just checking for a classical type error and we should just let the code 
crash as we do for every other things. So please drop the 'instance(â€¦)' 
branch.

> +    _flagtransforms[flag] = transform
> +
>  def getoffset(q):
>      return int(q >> 16)
>
> @@ -1248,7 +1279,11 @@
>              bins = bins[1:]
>
>          text = mdiff.patches(text, bins)
> -        self.checkhash(text, node, rev=rev)
> +
> +        text, validatehash = self.processflags(text, self.flags(rev), raw=raw)
> +        if validatehash:
> +            self.checkhash(text, node, rev=rev)
> +
>          self._cache = (node, rev, text)
>          return text
>
> @@ -1260,6 +1295,56 @@
>          """
>          return hash(text, p1, p2)
>
> +    def processflags(self, text, flags, raw=False, reverse=False):

That method is for revlog's internal usage only, should be '_processflags'.

> +        """Process revision flags and apply registered transforms.
> +
> +        ``text`` - the revision data to process
> +        ``flags`` - the revision flags
> +        ``raw`` - an optional argument describing if the raw transform should be
> +        applied.
> +        ``reverse`` - an optional argument describing whether the flags should
> +        be processed in order or reverse order.

(note: if we reword "reverse", that will be affected)

> +        This method processes the flags in the order (or reverse order if
> +        ``reverse`` is False) defined by REVIDX_FLAGS_ORDER, applying the
> +        transforms registered for present flags. The order of flags defined in
> +        REVIDX_FLAGS_ORDER needs to be stable for non-commutative
> +        operations.
> +
> +        If a ``flagtransform`` namedtuple has been registered on a flag, the
> +        ``raw`` argument is used to determine which transform apply. If ``raw``
> +        is True, the ``applyraw`` field is used, ``apply`` if not (default).
> +        If the ``raw`` argument is set, it indicates we are currently processing
> +        flags in the context of changegroup generation or debug commands.
> +
> +        Returns a 2-tuple of ``(text, validatehash)`` where ``text`` is the
> +        processed text and ``validatehash`` is a bool indicating whether the
> +        returned text should be checked for hash integrity.
> +        """
> +        # Check all flags are known.
> +        if flags & ~REVIDX_KNOWN_FLAGS:
> +            raise RevlogError(_('incompatible revision flag %x') %
> +                              (flags & ~REVIDX_KNOWN_FLAGS))
> +        validatehash = True
> +        # Depending on the operation (read or write), the order might be
> +        # reversed due to non-commutative transforms.
> +        orderedflags = REVIDX_FLAGS_ORDER
> +        if reverse:
> +            orderedflags = reversed(orderedflags)
> +
> +        for flag in orderedflags:
> +            # If a transform has been registered for a known flag, apply and
> +            # update result tuple.
> +            if flag & flags:
> +                transform = _flagtransforms.get(flag, None)
> +                if transform is not None:
> +                    apply, applyraw = transform
> +                    f = apply
> +                    if raw:
> +                        f = applyraw
 > +                    text, validatehash = f(self, text)
 > +        return text, validatehash

Here is the code were the "validatehash" composition is missing. This 
should be:

   text, vhash = f(self, text)
   validatehash = validatehash and vhash

In addition, if we find out we actually needs a read/write switch, this 
is the spot were we'll need to carry the read/write information.

Small nits:
Given that we only use one value, we could directly use:

     apply = transform[0]
     if raw:
         apply = transform[1]

> +
>      def checkhash(self, text, node, p1=None, p2=None, rev=None):
>          """Check node hash integrity.
>
> @@ -1447,7 +1532,11 @@
>                  btext[0] = mdiff.patch(basetext, delta)
>
>              try:
> -                self.checkhash(btext[0], node, p1=p1, p2=p2)
> +                btext[0], validatehash = self.processflags(btext[0], flags,
> +                                                           raw=raw,
> +                                                           reverse=True)

Here the part I was referring earlier about "reverse" usage.

This is a piece of the 'buildtext' closure in revlog._addrevision. 
Despite being in "_addrevision the action performed here is a "read" not 
a "write". _addrevision can be provided a pure blob of text to store, 
but i can also be provided with a base + delta pair and be asked to 
reconstruct the text itself. As this is a read, we need to apply the 
"transform" in the same order as for the other cases. However, we are a 
low level, "after" the transforms happened so we should not actually 
apply any text transform. This highlight the apparent full overlap 
between "write processing" and "write raw processing".

The above is a bit hairy, so I would not be surprise one got out of it 
confused but here is a couple of shorter "conclusion"

- if actual transform are applied in _addrevision, they need to be 
applied in the same order as the other (no reverse).

- even if no transform is applied, we still need to decide if hash can 
be validated. Since hash validation are composed we don't care about the 
order (so reverse is irrelevant).

> +                if validatehash is True:
> +                    self.checkhash(btext[0], node, p1=p1, p2=p2)

Another 'if x is True:'

>                  if flags & REVIDX_ISCENSORED:
>                      raise RevlogError(_('node %s is not censored') % node)
>              except CensoredNodeError:

Cheers,

-- 
Pierre-Yves David

[1] If flag start being handled by core (without an extension) current 
design says it will still have custom code in core deciding to introduce 
it and applying the transform.