[PATCH 2 of 8 V7] bookmarks: introduce binary encoding

Gregory Szorc gregory.szorc at gmail.com
Fri Nov 11 00:20:13 EST 2016


On Wed, Nov 2, 2016 at 7:06 AM, Stanislau Hlebik <stash at fb.com> wrote:

> # HG changeset patch
> # User Stanislau Hlebik <stash at fb.com>
> # Date 1478016027 25200
> #      Tue Nov 01 09:00:27 2016 -0700
> # Branch stable
> # Node ID f3da841e3d47ccbb0be3892b521607c09fdeab13
> # Parent  a56a624a8a42b93ec980d2c284756a38719dffe6
> bookmarks: introduce binary encoding
>
> Bookmarks binary encoding will be used for `bookmarks` bundle2 part.
> The format is: <4 bytes - bookmark size, big-endian><bookmark>
>                <1 byte - 0 if node is empty 1 otherwise><20 bytes node>
> BookmarksEncodeError and BookmarksDecodeError maybe thrown if input is
> incorrect.
>
> diff --git a/mercurial/bookmarks.py b/mercurial/bookmarks.py
> --- a/mercurial/bookmarks.py
> +++ b/mercurial/bookmarks.py
> @@ -7,8 +7,10 @@
>
>  from __future__ import absolute_import
>
> +import StringIO
>

This module isn't available on Python 3. Use util.stringio or (preferably)
io.BytesIO (since it assumes it is operating on bytes and raises if it sees
unicode types).


>  import errno
>  import os
> +import struct
>
>  from .i18n import _
>  from .node import (
> @@ -23,6 +25,70 @@
>      util,
>  )
>
> +_NONEMPTYNODE = struct.pack('?', False)
> +_EMPTYNODE = struct.pack('?', True)
>

TIL "?" is a valid format character for struct!


> +
> +def _unpackbookmarksize(stream):
> +    """Returns 0 if stream is empty.
> +    """
> +
> +    expectedsize = struct.calcsize('>i')
> +    encodedbookmarksize = stream.read(expectedsize)
> +    if len(encodedbookmarksize) == 0:
> +        return 0
>

if not encodedbookmarksize:


> +    if len(encodedbookmarksize) != expectedsize:
> +        raise ValueError(
> +            _('cannot decode bookmark size: '
> +              'expected size: %d, actual size: %d') %
> +            (expectedsize, len(encodedbookmarksize)))
> +    return struct.unpack('>i', encodedbookmarksize)[0]
> +
> +def encodebookmarks(bookmarks):
> +    """Encodes bookmark to node mapping to the byte string.
> +
> +    Format: <4 bytes - bookmark size><bookmark>
> +            <1 byte - 0 if node is empty 1 otherwise><20 bytes node>
> +    Node may be 20 byte binary string, 40 byte hex string or empty
> +    """
>

I could make the argument for grouping the data by all sizes+nodes then all
the names. This allows nice byte aligned packing and unpacking and some
opportunities for indexing. It may also yield slightly better compression
as the bookmark names will all be next to each other (although you'd have
to measure that because compression window sizes may save us). But I'm
having difficulty convincing myself it is worth it. Of course, having this
data layout would also prohibit streaming. I think it is worth mentioning
the alternative but probably not worth switching.


> +
> +    for bookmark, node in bookmarks.iteritems():
>

You'll want to use a sorted() here or eventual tests for multiple bookmarks
will be unstable.


> +        yield struct.pack('>i', (len(bookmark)))
> +        yield encoding.fromlocal(bookmark)
> +        if node:
> +            if len(node) != 20 and len(node) != 40:
> +                raise ValueError(_('node must be 20 or bytes long'))
> +            if len(node) == 40:
> +                node = bin(node)
>

I think the API should accept a single node type: the 20 byte binary node.


> +            yield _NONEMPTYNODE
> +            yield node
> +        else:
> +            yield _EMPTYNODE
>

I'm on the fence regarding this single byte flag. On one hand, being
explicit is nice. On the other, a sequence of \x00 identifies the null or
empty node in so many other places. What's your reasoning here?


> +
> +def decodebookmarks(buf):
> +    """Decodes buffer into bookmark to node mapping.
> +
> +    Node is either 20 bytes or empty.
> +    """
> +
> +    stream = StringIO.StringIO(buf)
> +    bookmarks = {}
> +    bookmarksize = _unpackbookmarksize(stream)
> +    while bookmarksize:
> +        bookmark = stream.read(bookmarksize)
> +        if len(bookmark) != bookmarksize:
> +            raise ValueError(
> +                _('cannot decode bookmark: expected size: %d, '
> +                'actual size: %d') % (bookmarksize, len(bookmark)))
> +        bookmark = encoding.tolocal(bookmark)
> +        packedemptynodeflag = stream.read(struct.calcsize('?'))
> +        emptynode = struct.unpack('?', packedemptynodeflag)[0]
>

Yes, `struct` does cache things. But I don't like relying on it inside a
loop. Please create a `struct.Struct` instance outside the loop.


> +        node = ''
> +        if not emptynode:
> +            node = stream.read(20)
> +        bookmarks[bookmark] = node
>

Alternatively, we could yield key, value pairs. That would facilitate
streaming and allow the part to adapt to storing multiple values for the
same bookmark. (YAGNI applies so if multiple nodes per bookmark isn't
something you think we would need in the immediate future, don't bother.)
New bundle parts aren't that hard to invent.


> +        bookmarksize = _unpackbookmarksize(stream)
> +    return bookmarks
> +
>  def _getbkfile(repo):
>      """Hook so that extensions that mess with the store can hook bm
> storage.
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20161110/27479814/attachment.html>


More information about the Mercurial-devel mailing list