RFC: bitmap storage for precursors and phases

Stanislau Hlebik stash at fb.com
Fri Feb 17 16:59:48 EST 2017


Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800:
> On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu <quark at fb.com> wrote:
> 
> > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +0000:
> > > As I said before we will load all non-public revs in one set and all
> >
> > The problem is, loading a Python set from disk is O(size-of-the-set).
> >
> > Bitmap's loading cost should be basically 0 (with mmap). I think that's why
> > we want bitmap at the first place. There are other choices like packfile
> > index, hash tables, but bitmap is the simplest and most efficient.
> >
> 
> Hey folks,
> 
> I haven't yet seen mention of some considerations that seem very important
> in driving the decision-making, so I'd appreciate it if someone could fill
> me in.
> 
> Firstly, what's our current understanding of the sizes and compositions of
> these sets of numbers? In theory, we have a lot of data from practical
> application at Facebook, but nobody's brought it up.

I assume that both sets (set for nonpublic commits and set for
obsstore) are going to be very small comparing to the repo size. I
expect both sets < 1% of the repo size. And the sets is going to be
sparse.

> 
> To clarify why this matters, if an obsstore contains say 100 entries, then
> a very naive implementation should be fine. If it's related to something
> else, such as the size of, amount of activity in, or local age of a repo,
> then maybe we have a different set of decisions to make.
> 
> Secondly, what operations are being performed on these sets today, and do
> we know if those operations are going to change?
> 
> For instance, if a may be extended via new elements added to its end, then
> an mmap-based implementation immediately becomes awkward. If a set is
> traversed or we perform operations like union or intersection, then a
> bitmap might make sense if the set is very small and dense, but not if it's
> big and sparse.

We need three operations: iteration over the whole
set, testing if an element belongs to the set and adding new elements.
New elements can be added to the end of the set.
Not sure if more operations will be added, probably not.

Agree, we can't use bitmaps for iteration.
But potentially we can clean all of the places where iteration is used.
Most of them are pretty straight-forward except for one tricky case in
scmutil.filteredhash() function.

> 
> (Jun's idea that mmap is effectively free is also not borne out in
> practice, but that's a second-order issue here.)
> 
> So tell us what the needed API is, what the characteristics of the data
> are, and that'll help understand how to make a good decision.


More information about the Mercurial-devel mailing list