[PATCH 02 of 22] radixlink: introduce a new data structure

Yuya Nishihara yuya at tcha.org
Wed Jun 7 11:21:58 EDT 2017


On Tue, 6 Jun 2017 18:17:49 -0700, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2017-06-07 00:53:09 +0900:
> > > +class radixlink(object):
> > > +    """multimap. keys are variable-length bytes, values are lists of uint32s.
> > 
> > Do we really need to support variable-length key? IIUC, that's the main reason
> > why both radix-entry and leaf-entry have link-offset field.
> 
> For linkrev use-case, we might need to store paths.

Store path as a key? What I meant is "b) Do not allow key1 to be a prefix
of key2."

> > > +        link-entry    := next-link-offset (4B) + value (4B)
> > Nit: I slightly prefer "link-entry := value + next-link-offset" for
> > consistency with the other entries.
> 
> (mentioned below that we might want to support multiple types in link-entry)

Got why you made a separate entry for value node, thanks.

> revlog.c does a really well job in terms of both performance and space
> usage. It uses int[16] for an index entry. Stores +index or -rev. And does
> not have a separate leaf format.

Yea, it's crazily space efficient.

> We might want to just dump indexObject->nt, if we put endian-ness in the
> file name.
> 
> (some format changing ideas mentioned below)
> 
> > Suppose our data structure is at least 8-byte aligned, we have 3bits from
> > LSB which can be (ab)used as flags (as glibc malloc does.)
> 
> Space-wise, the leaf-entry padding is the main issue. Removing that might
> cut the index file size in half.
> 
> If we want to further reduce the size, possible changes are:
> 
>   a) Do not store full key, use int32. This makes the interface more
>      complex - people have to provide a function to convert int32 to a key.
>   b) Do not allow key1 to be a prefix of key2. This removes "link-offset" in
>      radix entry.
>   c) Support different kinds of values other than a linked list, like just
>      an int32 representing something.
> 
> "a" will make us closer to revlog.c but it requires some low-level changes
> to work with obsstore (that I don't want to do it now for fm1 obsstore).
> 
> "b" seems a weird limitation. But may be worthy to catch up with revlog.c.
> 
> "c" could be implemented later. "link" and "index" are relatively separate.

I was thinking that the size of the obsstore index would be quite big, and
reading/writing/jumping-around-in-memory it would be somewhat costly. If you
have an idea to mitigate these costs, shrinking the cache size wouldn't be
important.

> > [...]
> > Nit: I believe we'll soon need "hg debugradixlink" command. How about splitting
> > the module into core radixlink + pure impl + cext impl?
> 
> The radixlink type needs to be in C to make overhead minimal (so __getitem__
> won't need a hash lookup). I think it might be fine to duplicate struct
> definitions in the debug command.

I meant mercurial/radixlink.py could import pure/ attributes to implement
dumpradixlink().


More information about the Mercurial-devel mailing list