RFC: bitmap storage for precursors and phases

Jun Wu quark at fb.com
Fri Feb 17 22:14:12 EST 2017


Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800:
> On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu <quark at fb.com> wrote:
> 
> > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +0000:
> > > As I said before we will load all non-public revs in one set and all
> >
> > The problem is, loading a Python set from disk is O(size-of-the-set).
> >
> > Bitmap's loading cost should be basically 0 (with mmap). I think that's why
> > we want bitmap at the first place. There are other choices like packfile
> > index, hash tables, but bitmap is the simplest and most efficient.
> >
> 
> Hey folks,
> 
> I haven't yet seen mention of some considerations that seem very important
> in driving the decision-making, so I'd appreciate it if someone could fill
> me in.
> 
> Firstly, what's our current understanding of the sizes and compositions of
> these sets of numbers? In theory, we have a lot of data from practical
> application at Facebook, but nobody's brought it up.
> 
> To clarify why this matters, if an obsstore contains say 100 entries, then
> a very naive implementation should be fine. If it's related to something
> else, such as the size of, amount of activity in, or local age of a repo,
> then maybe we have a different set of decisions to make.

First, I agree that if N is small, O(N) or O(log(N)), or O(1), do not matter
much in practice. I tend to always reason about future-proof solutions,
which may or may not be a good habit - I'll adjust in the future.

I think there are multiple topics being discussed:

  1. How to solve the overhead loading hiddenrevs with minimal changes?
  2. Why is the bitmap format interesting (future use-cases)?

For 1,

  I once tried to make one bit of revlog flag as the source of truth of
  "hidden", which looks a nice solution. But it got rejected because revlog
  has to be append-only. Moving the flag bits out formed the bitmap idea
  naturally.

  Bitmap fits here but it does not mean we have to use it to solve this
  particular problem. For example, stateful chg will get us some easy wins,
  and see the problem analysis and proposed solution below.

  Note that we actually have a simple caching format for the "hiddenrevs".
  See repoview.py, computehidden.

  But what's wrong with "computehidden" is the choice of the cache key - it
  calls "hideablerevs", which takes 0.5 seconds in obsolete.getrevs(repo,
  'obsolete'), which reads the giant obsstore (in hg-committed) and do some
  complex computation. The result is then hashed and used as a cache key of
  "computehidden".

  So, if we do not want to know the "obsolete()" revisions, but just want to
  know what's hidden. Changing the cache key would be a very straightforward
  and effective solution which does not require a new file format.

  That'll help commands like "hg st", "hg log -T foo", etc. But it won't
  help complex "log"s which do require the "obsolete()" revisions. To speed
  up the latter, we can probably just copy the caching infrastructure from
  repoview.py.

For 2,

  len(filteredrevs) in my hg-committed is 2155. It's small. I tend to think
  about solutions that scale longer and are not too complex to build. That
  may or may not be a good habit.
  
  That said, I agree if the set is small and perf, then whatever easier to
  implement makes sense.
  
  I have some crazy ideas about the future where a giant bitmap is
  practically useful. For example, source controlling "visible heads" and
  the obsstore, then adding some C code to quickly fill a bitmap (requires
  "set" and "setancestors/descendants"), then we can show the user whenever
  their repo used to look like back in any time by using the bitmap filter.
  That'll help investigate hard issues, and implement "hg undo".

> Secondly, what operations are being performed on these sets today, and do
> we know if those operations are going to change?
>
> For instance, if a may be extended via new elements added to its end, then
> an mmap-based implementation immediately becomes awkward. If a set is

Not that bad if we preallocate space.

> traversed or we perform operations like union or intersection, then a
> bitmap might make sense if the set is very small and dense, but not if it's
> big and sparse.

Not necessarily true if we use an advanced variant, like the "roaringbitmap"
mentioned by you last year [1].

[1]: https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-September/087864.html

> (Jun's idea that mmap is effectively free is also not borne out in
> practice, but that's a second-order issue here.)
> 
> So tell us what the needed API is, what the characteristics of the data
> are, and that'll help understand how to make a good decision.



More information about the Mercurial-devel mailing list