D3990: linelog: add a Python implementation of the linelog datastructure

indygreg (Gregory Szorc) phabricator at mercurial-scm.org
Wed Aug 1 14:00:02 EDT 2018


indygreg requested changes to this revision.
indygreg added a comment.
This revision now requires changes to proceed.


  I wrote several comments. But overall this code seems very reasonable as a first implementation. Especially if we want to maintain backwards compatibility with the C implementation for the initial import. My mind was blown when I realized linelog was an interpreted bytecode. Crazy town.
  
  Most of my comments can be deferred to follow-ups. I'd be tempted to defer all the performance-related changes to follow-ups so we can measure the impact they have. I'm actually curious about that.
  
  I probably could grant review. I figured I'd send this back to you in case you wanted to make some changes. I also think this may want another set of eyes because it is a big piece of code. And I'm not that great with algorithms and kind of skimmed some of the lower-level logic around operation traversal. There's a lot to digest!

INLINE COMMENTS

> linelog.py:30
> +
> +_llentry = struct.Struct('!II')
> +

Nit: I think we use `>` to indicate big-endian elsewhere.

Also, I'm a fan of using little-endian for on-disk formats to save conversion operations since x86 is little-endian. Not that it matters given the overhead of Python. But it can come into play when e.g. implementing these things in Rust. I'm inclined to ignore it for now. As long as we have a mechanism for versioning the on-disk and exchanged formats.

> linelog.py:35
> +
> + at attr.s
> +class lineinfo(object):

We may want to define `slots=True` on this and `annotateresult` so objects take up less space. Could be done as a follow-up.

> linelog.py:55
> +
> +    __metaclass__ = abc.ABCMeta
> +

`abc` requires module import time computation, which adds overhead.

I'd encourage you to use `interfaceutil` add supplement `test-check-interfaces.py` to perform the interface conformance tests not at run time.

> linelog.py:77-88
> +    def execute(self, rev, pc, emit):
> +        """Execute this instruction.
> +
> +        Args:
> +          rev: The revision we're annotating.
> +          pc: The current offset in the linelog program.
> +          emit: A function that accepts a single lineinfo object.

While I like the abstraction of instructions, given the simplicity of the language and the overhead of function calls in Python, I wonder if we'd be better off with the execution logic inlined. The performance speedup is already significant with this code. So deferring on performance optimization seems reasonable.

> linelog.py:90-95
> +class _jge(_llinstruction):
> +    """If the current rev is greater than or equal to op1, jump to op2."""
> +
> +    def __init__(self, op1, op2):
> +        self._cmprev = op1
> +        self._target = op2

It feels like we may want to use `attrs` with `slots=True` for these types.

> linelog.py:210
> +    op1 = op1 >> 2
> +    if opcode == 0:
> +        if op1 == 0:

Constants might be a bit nicer to read.

> linelog.py:222-223
> +
> +class linelog(object):
> +    def __init__(self, program=None, maxrev=0):
> +        if program is None:

I think these want docstrings.

> linelog.py:239-240
> +    def __repr__(self):
> +        return '<linelog at %s: maxrev=%d size=%d>' % (
> +            hex(id(self)), self._maxrev, len(self._program))
> +

Nit: maybe display the instruction count instead / as well?

> linelog.py:247
> +
> +    @classmethod
> +    def fromdata(cls, buf):

We don't really use `@classmethod` in Mercurial. Consider breaking out into a normal function.

> linelog.py:253
> +                    len(buf), _llentry.size))
> +        expected = (len(buf) / _llentry.size)
> +        fakejge = _decodeone(buf, 0)

Nit: drop the parens

> linelog.py:265
> +        instructions = [_eof(0, 0)]
> +        for offset in range(1, numentries):
> +            instructions.append(_decodeone(buf, offset * _llentry.size))

Use `pycompat.xrange` (we should probably establish a lint for this).

> linelog.py:266
> +        for offset in range(1, numentries):
> +            instructions.append(_decodeone(buf, offset * _llentry.size))
> +        return cls(instructions, maxrev=maxrev)

Would it be faster to implement this as a list comprehension? I can't recall if Python optimizes away the overhead of `list.append` in that case. Could be done as a follow-up.

> linelog.py:269-271
> +    def encode(self):
> +        hdr = _jge(self._maxrev, len(self._program)).encode()
> +        return hdr + ''.join(i.encode() for i in self._program[1:])

I'm assuming programs can get a bit large? We may want to turn this into a generator of chunks.

> linelog.py:370-376
> +    # Stateful methods which depend on the value of the last
> +    # annotation run. This API is for compatiblity with the original
> +    # linelog, and we should probably consider refactoring it.
> +    @property
> +    def annotateresult(self):
> +        """Return the last annotation result. C linelog code exposed this."""
> +        return [(l.rev, l.linenum) for l in self._lastannotate.lines]

I agree. I'm not a fan of the API. But this can be cleaned up later.

> linelog.py:378-379
> +
> +    def getoffset(self, line):
> +        return self._lastannotate.lines[line]._offset
> +

`self._lastannotate` may be `None`. I assume this is part of the API for the same reason as `annotateresult`.

> linelog.py:393
> +        # then, something is badly broken.
> +        for step in range(len(self._program)):
> +            inst = self._program[pc]

`pycompat.xrange`.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D3990

To: durin42, #hg-reviewers, indygreg
Cc: indygreg, mercurial-devel


More information about the Mercurial-devel mailing list