RFC: bitmap storage for precursors and phases

Sean Farley sean at farley.io
Tue Feb 21 18:43:52 EST 2017


Jun Wu <quark at fb.com> writes:

> Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800:
>> On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu <quark at fb.com> wrote:
>> 
>> > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +0000:
>> > > As I said before we will load all non-public revs in one set and all
>> >
>> > The problem is, loading a Python set from disk is O(size-of-the-set).
>> >
>> > Bitmap's loading cost should be basically 0 (with mmap). I think that's why
>> > we want bitmap at the first place. There are other choices like packfile
>> > index, hash tables, but bitmap is the simplest and most efficient.
>> >
>> 
>> Hey folks,
>> 
>> I haven't yet seen mention of some considerations that seem very important
>> in driving the decision-making, so I'd appreciate it if someone could fill
>> me in.
>> 
>> Firstly, what's our current understanding of the sizes and compositions of
>> these sets of numbers? In theory, we have a lot of data from practical
>> application at Facebook, but nobody's brought it up.
>> 
>> To clarify why this matters, if an obsstore contains say 100 entries, then
>> a very naive implementation should be fine. If it's related to something
>> else, such as the size of, amount of activity in, or local age of a repo,
>> then maybe we have a different set of decisions to make.
>
> First, I agree that if N is small, O(N) or O(log(N)), or O(1), do not matter
> much in practice. I tend to always reason about future-proof solutions,
> which may or may not be a good habit - I'll adjust in the future.
>
> I think there are multiple topics being discussed:
>
>   1. How to solve the overhead loading hiddenrevs with minimal changes?
>   2. Why is the bitmap format interesting (future use-cases)?
>
> For 1,
>
>   I once tried to make one bit of revlog flag as the source of truth of
>   "hidden", which looks a nice solution. But it got rejected because revlog
>   has to be append-only. Moving the flag bits out formed the bitmap idea
>   naturally.
>
>   Bitmap fits here but it does not mean we have to use it to solve this
>   particular problem. For example, stateful chg will get us some easy wins,
>   and see the problem analysis and proposed solution below.
>
>   Note that we actually have a simple caching format for the "hiddenrevs".
>   See repoview.py, computehidden.
>
>   But what's wrong with "computehidden" is the choice of the cache key - it
>   calls "hideablerevs", which takes 0.5 seconds in obsolete.getrevs(repo,
>   'obsolete'), which reads the giant obsstore (in hg-committed) and do some
>   complex computation. The result is then hashed and used as a cache key of
>   "computehidden".
>
>   So, if we do not want to know the "obsolete()" revisions, but just want to
>   know what's hidden. Changing the cache key would be a very straightforward
>   and effective solution which does not require a new file format.
>
>   That'll help commands like "hg st", "hg log -T foo", etc. But it won't
>   help complex "log"s which do require the "obsolete()" revisions. To speed
>   up the latter, we can probably just copy the caching infrastructure from
>   repoview.py.
>
> For 2,
>
>   len(filteredrevs) in my hg-committed is 2155. It's small. I tend to think

Another data point for my hg-committed:

len(filteredrevs) > 8k

len(obs markers) > 120k


More information about the Mercurial-devel mailing list