RFC: bitmap storage for precursors and phases

Mon Feb 20 03:36:29 EST 2017

Excerpts from Augie Fackler's message of 2017-02-19 21:06:53 -0500:
> On Fri, Feb 17, 2017 at 09:59:48PM +0000, Stanislau Hlebik wrote:
> > Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800:
> > > On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu <quark at fb.com> wrote:
> > >
> > > > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +0000:
> > > > > As I said before we will load all non-public revs in one set and all
> > > >
> > > > The problem is, loading a Python set from disk is O(size-of-the-set).
> > > >
> > > > Bitmap's loading cost should be basically 0 (with mmap). I think that's why
> > > > we want bitmap at the first place. There are other choices like packfile
> > > > index, hash tables, but bitmap is the simplest and most efficient.
> > > >
> > >
> > > Hey folks,
> > >
> > > I haven't yet seen mention of some considerations that seem very important
> > > in driving the decision-making, so I'd appreciate it if someone could fill
> > > me in.
> > >
> > > Firstly, what's our current understanding of the sizes and compositions of
> > > these sets of numbers? In theory, we have a lot of data from practical
> > > application at Facebook, but nobody's brought it up.
> >
> > I assume that both sets (set for nonpublic commits and set for
> > obsstore) are going to be very small comparing to the repo size. I
> > expect both sets < 1% of the repo size. And the sets is going to be
> > sparse.
> 
> I replied elsewhere in the thread, but in my clone of hg it's on the
> order of 25-30% of the history, so assuming it's going to be very
> sparse is probably unwise.

In that case it's better to use bitmaps. But to do it we need to get rid
of filteredrevs iteration in scmutil.filteredhash() function.