RFC: Incoming/outgoing discovery with fewer roundtrips / more bandwidth

Wed Sep 30 07:09:20 CDT 2009

On Tue, Sep 29, 2009 at 9:40 AM, Benoit Boissinot
<benoit.boissinot at ens-lyon.org> wrote:
> On Tue, Sep 29, 2009 at 08:46:38AM +0200, Dirkjan Ochtman wrote:
>> On Tue, Sep 29, 2009 at 08:28, Peter Arrenbrecht
>> <peter.arrenbrecht at gmail.com> wrote:
>> > But I think I have an idea on how to solve all of the glaring problems
>> > I have so far discovered. It revolves around:
>> >
>> >  * Taking samples along graph paths starting at the heads, not just
>> > from the somewhat arbitrary order of revs in a repo.
>> >  * Doing this breadth-first, maybe with higher sample density near the
>> > repo heads, until we have gathered enough samples for this roundtrip.
>> >  * Having, per sample, an upper bound on the lengths of the paths that
>> > go from this path towards tip or the next sample. This limits the
>> > amount of data we have to send for detailed pruning.
>> >  * Using an iterative approach where the two parties exchange samples
>> > until all uncharted regions have been discovered sufficiently. Each
>> > sample may yield hints on how to restrict the range of further samples
>> > to only still uncharted regions. I expect that in most cases the
>> > initial sample will already do the trick.
>>
>> Oh, and I think Benoit had some ideas for using an algorithm to
>> communicate the graph more efficiently, I think he used something
>> called dominators or dominance.
>
> Yep, I've been thinking (with ideas from others: Matt, Bryan, etc.)
> about that since at least the 2005 London sprint :)
>
> Some of the ideas that floated around:
>
> - use a rolling checksum (ala rsync) on the list of nodes
> - use a bloom filter
>
> Other than that, there is at least a very obvious "fix" for the
> protocol:
>
> Currently it is asymetric, that means that even if you have no
> local changes (the local repo is a subset of upstream) you will do a
> bunch of roundtrips to discover the difference with upstream.
> I think a simple solution where we explore both DAGs in parallel, it
> would dramatically improve the common case (very few or no local
> changes, many csets upstream).
>
>
> I played with dominator (basically extending the current algorithm to
> skip branches until they are re-united). For example (history goes from
> left to right):
>
>    3
>   / \
>  1-2-4
>  /     \
> 0-----5-6
>
> would return (6, 5, 4, 0) and if the client knows about 0, it skips the
> branch exploration. You can do that for every 2**n dominator.
> The problem is that it doesn't play that nice with a Linux-type DAG.
>
> I think there is a way to exploit a typical pull-based (Linux-type)
> tree, but I haven't found the right way yet.

The Linux tree is giving me headaches, too. The Netbeans one is even
worse. They force my repo sampler to take a lot of samples at less
than the skip I'm aiming for (where I define the "skip" to be the max
length of the path from a given node to any descendant, but stopping
at samples nearer the heads of the tree). Meaning the samples are too
dense and don't cover as much of the tree as I want them to.

Some more thinking is in order.

Meanwhile, if anyone wants to take a look, I have attached the code
for the sampling I'm using right now.

My next step will probably be to build a test harness where we can
easily define interesting DAGs, run them through discovery protocols,
and visualize their workings (probably using the glog code).

BTW, for people wanting to play with their own ideas, cloning the
Linux kernel and Netbeans repos takes ages. But to play with
discovery, all we need is their 00changelog.i files. So you can just
kill an `hg clone` after it says it's getting the manifest. Then clean
up the repo:

 * In .hg, delete all but 00changelog.i, requires, and store
 * In .hg/store, move00changelog.i.a to00changelog.i, then delete
everything else.

> And if we change that, we need to change changegroup too, to send the
> common nodes instead of the "first remote nodes that we don't know
> about".

I fully expect to have to completely redo the protocol that leads up
to the final bundle stream for pull/in and push/out.

-parren
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dagutil.py
Type: text/x-python
Size: 7801 bytes
Desc: not available
Url : http://selenic.com/pipermail/mercurial-devel/attachments/20090930/c5532d57/attachment.py 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: discovery.py
Type: text/x-python
Size: 2303 bytes
Desc: not available
Url : http://selenic.com/pipermail/mercurial-devel/attachments/20090930/c5532d57/attachment-0001.py