[PATCH 02 of 22] radixlink: introduce a new data structure
Jun Wu
quark at fb.com
Tue Jun 6 21:17:49 EDT 2017
Excerpts from Yuya Nishihara's message of 2017-06-07 00:53:09 +0900:
> [...]
>
> I just had a look at _get() and _splitleafentry() code paths. The idea
> generally looks good to me. I have random thoughts on both data structure
> and API. Some are just nitpick and some are crazy.
>
> > +class radixlink(object):
> > + """multimap. keys are variable-length bytes, values are lists of uint32s.
>
> Do we really need to support variable-length key? IIUC, that's the main reason
> why both radix-entry and leaf-entry have link-offset field.
For linkrev use-case, we might need to store paths.
> > + link-entry := next-link-offset (4B) + value (4B)
> Nit: I slightly prefer "link-entry := value + next-link-offset" for
> consistency with the other entries.
(mentioned below that we might want to support multiple types in link-entry)
> Crazy idea: If we can have some restrictions on the key format (e.g. fixed
> length, maximum length, etc.), maybe we could shrink the minimum size of the
> entry and embed a few values onto leaf-entry.
>
> radix-entry := index-offset * 16
> leaf-entry := link-offset + b16-key-length (1byte)
> + number-of-inline-value (1byte) + b8-key + padding
> padding := inline-value (4byte) * (as much as possible)
> link-entry := link-offset + value
Sorry - my commit message about revlog.c using pointers was wrong.
revlog.c does a really well job in terms of both performance and space
usage. It uses int[16] for an index entry. Stores +index or -rev. And does
not have a separate leaf format.
We might want to just dump indexObject->nt, if we put endian-ness in the
file name.
(some format changing ideas mentioned below)
> Suppose our data structure is at least 8-byte aligned, we have 3bits from
> LSB which can be (ab)used as flags (as glibc malloc does.)
Space-wise, the leaf-entry padding is the main issue. Removing that might
cut the index file size in half.
If we want to further reduce the size, possible changes are:
a) Do not store full key, use int32. This makes the interface more
complex - people have to provide a function to convert int32 to a key.
b) Do not allow key1 to be a prefix of key2. This removes "link-offset" in
radix entry.
c) Support different kinds of values other than a linked list, like just
an int32 representing something.
"a" will make us closer to revlog.c but it requires some low-level changes
to work with obsstore (that I don't want to do it now for fm1 obsstore).
"b" seems a weird limitation. But may be worthy to catch up with revlog.c.
"c" could be implemented later. "link" and "index" are relatively separate.
> *-entry := *-offset (29bits) + flags (3bits) + ...
> flags := { RADIX (0), LEAF (1), LINK (2) }
>
> > + def __len__(self):
> > + """raw buffer length, could be used to do "isdirty" check"""
> > + return len(self._data)
>
> API: This would be confusing with the length of the entries since it supports
> __getitem__(). I prefer more explicit name, e.g. rawsize().
Good idea.
> > + def _get(self, key):
> > + """lookup a key, return the value list, or an empty array"""
> > + ioff = self.headersize
> > + bkey = _tob16(key)
>
> Future API: we'll need to support odd-length key in order to replace
> revlog._partialmatch().
Good reminder. Now radixlink is 70% slower than revlog.c If we make "a", "b"
and "c" changes, the performance will be closer and the major difference
will be endian-ness. I'll try to get some perf data on that. If it's closer
enough, we can make a switch.
> > + def insert(self, key, value):
>
> (not yet reviewed)
Note: format may change given the above "a", "b", "c" points.
> > + def _followindex(self, offset, b16list, b16offset, splitleaf=False):
> > + """follow an index-entry at offset
> > +
> > + For a radix-entry, lookup b16list[b16offset], and return new pointers.
> > + For a leaf-entry, check the remaining of b16list.
> > +
> > + If splitleaf is True, split a leaf-entry where b16list does not match.
> > + Useful for adding new index-entries.
> > +
> > + return (nextoffset, b16offset, poffset) or raise. poffset is the offset
>
> Nit: maybe "or raise" would be outdated?
struct.Struct.unpack_from may raise if it reads outside the boundary.
> [...]
> Nit: Some comment would be helpful to understand why we re-read and assert klen.
Will do.
> [...]
> Nit: Since we dropped support for Python 2.6, maybe we could use
> memoryview to simulate pointer arithmetic.
Good idea. I thought memoryview was read-only.
> [...]
> Nit: I believe we'll soon need "hg debugradixlink" command. How about splitting
> the module into core radixlink + pure impl + cext impl?
The radixlink type needs to be in C to make overhead minimal (so __getitem__
won't need a hash lookup). I think it might be fine to duplicate struct
definitions in the debug command.
More information about the Mercurial-devel
mailing list