[PATCH 1 of 6 RFC] radixlink: introduce a new data structure

Augie Fackler raf at durin42.com
Wed May 24 17:11:51 EDT 2017


On Sun, May 21, 2017 at 06:31:08PM -0700, Jun Wu wrote:
> # HG changeset patch
> # User Jun Wu <quark at fb.com>
> # Date 1495396237 25200
> #      Sun May 21 12:50:37 2017 -0700
> # Node ID fec85b1af16b0360e7bd78cd26d4d21edf678962
> # Parent  4e51b2a99847904f1cc5a9c74784f19d69e829d0
> # Available At https://bitbucket.org/quark-zju/hg-draft
> #              hg pull https://bitbucket.org/quark-zju/hg-draft -r fec85b1af16b
> radixlink: introduce a new data structure

Clever. I like it.

>
> The radixlink structure is designed to address several hard-to-solve
> performance issues in Mercurial. It is an on-disk radix-tree as index,
> linking to linked list as data (so it could store multiple values per key).
>
> For long, core Mercurial has no real key-value data structure, and various
> operations have suboptimal time complexity:
>
>   1. changelog.rev(node): O(len(repo))
>   2. obsstore.precursors[node]: O(len(markers))
>   3. adjustlinkrev slow path (a complex problem)
>
> The new data structure could make the first two O(len(node)), and provide a
> cache for linkrevs where the keys could be "filenode + path + '\0'".
>
> The next patches will try to solve the obsstore performance issue.
>
> diff --git a/mercurial/radixlink.py b/mercurial/radixlink.py
> new file mode 100644
> --- /dev/null
> +++ b/mercurial/radixlink.py
> @@ -0,0 +1,244 @@
> +# radixlink.py - on-disk radixtree index pointing to linked list data
> +#
> +# Copyright 2017 Facebook, Inc.
> +#
> +# This software may be used and distributed according to the terms of the
> +# GNU General Public License version 2 or any later version.
> +from __future__ import absolute_import
> +
> +import os
> +import struct
> +
> +from . import (
> +    error,
> +)
> +
> +def _enlarge(buf, size):
> +    """enlarge a bytearray to at least given size"""
> +    l = len(buf)
> +    if l < size:
> +        buf += bytearray(size - l)
> +
> +def _tob16(buf):
> +    """convert buffer to base16 bytearray"""
> +    result = bytearray(len(buf) * 2)
> +    for i, b in enumerate(bytearray(buf)):
> +        result[i * 2] = b >> 4
> +        result[i * 2 + 1] = b & 0xf
> +    return result
> +
> +class radixlink(object):
> +    """multimap with O(len(key) + len([value])) lookup and O(len(key)) insertion
> +
> +    A "radixlink" structure consists of 2 files: indexfile, and linkfile.
> +
> +    The indexfile is the radix tree with pointers to linkfile:
> +
> +        indexdata        := index-header + index-entry-list
> +        index-header     := source of truth offset (8B)

does "source of truth offset" mean "number of bytes in the source of
truth file as of when this index was valid"?

> +        index-entry-list := '' | index-entry-list + index-entry
> +        index-entry      := radix-entry | leaf-entry
> +        radix-entry      := '\0' + index-offset (4B) * 16
> +        leaf-entry       := '\1' + link-offset (4B) + key-length (2B) + key

[...]
> +    @property
> +    def truthoffset(self):
> +        """position of the append-only source of truth the index contains"""
> +        if len(self.indexdata) >= self.indexheader.size:
> +            return self.indexheader.unpack_from(self.indexdata)[0]
> +        return 0
> +
> +    @truthoffset.setter
> +    def truthoffset(self, newoffset):
> +        if self.truthoffset == newoffset:
> +            return
> +        self.indexheader.pack_into(self.indexdata, 0, newoffset)
> +        self._dirty = True

> diff --git a/tests/test-radixlink.py b/tests/test-radixlink.py
> new file mode 100644
> --- /dev/null
> +++ b/tests/test-radixlink.py

If we're gonna have a new test, let's use our silenttestrunner instead
of .py and .out files. It's a lot easier for newcomers to
understand. :)

> @@ -0,0 +1,61 @@
> +from __future__ import absolute_import
> +
> +import os
> +import random
> +
> +from mercurial import (
> +    error,
> +    radixlink,
> +    vfs,
> +)
> +
> +randint = random.randint
> +
> +def ensure(expr):
> +    if not expr:
> +        raise RuntimeError, 'unexpected'

nit: `raise RuntimeError('unexpected')` is a more idiomatic spelling
(does that not work for some reason?)

> +
> +fs = vfs.vfs(os.environ['TESTTMP'])


More information about the Mercurial-devel mailing list