RFC: Incoming/outgoing discovery with fewer roundtrips / more bandwidth

Dirkjan Ochtman dirkjan at ochtman.nl
Tue Sep 29 01:44:18 CDT 2009


On Tue, Sep 29, 2009 at 08:28, Peter Arrenbrecht
<peter.arrenbrecht at gmail.com> wrote:
> So I tried to come up with a different approach.

Awesome.

> In essence, it trades bandwidth for roundtrips. For this to be
> reasonable, I'm assuming that the majority of Hg users today have at
> least broadband connections to their repos, but often still suffer
> from noticeable latency (my ping time to intevation.org varies,
> yesterday it was 70ms, today it's 40ms; but it seems to me that a
> single roundtrip uses rather more than just this latency). It also
> tries to minimize upstream bandwidth at the cost of downstream from
> the client's point of view (because broadband is often very
> asymmetrical and I assume servers like bitbucket.org have good
> outgoing bandwidth).
>
> The approach is based on sending samples of what one party knows about
> a repo in order to quickly come to an agreement. I have done a very
> rough exploration of the basic idea which shows that with about 100
> samples I can already do the same pull as above with only these
> packets:
>
>  -> sample (1.7K)
>  <- set to be pruned (23K)
>  -> pruning result (7.7K)
>  <- bundle stream
>
> This very simplistic prototype just sends a sample of revs taken at
> even distances from the repo. As such, it does not scale with history
> size and has a bunch of other serious flaws. It just serves to outline
> the data sizes I'm thinking of sending to and fro.

Right. I've been recently thinking of this as well. I think it also
means we need to adapt the wire protocol (because IIRC the current
version sends heads as GET parameters, and maximum length for the
request line -- that works with servers, caches, etc -- is limited).

> But I think I have an idea on how to solve all of the glaring problems
> I have so far discovered. It revolves around:
>
>  * Taking samples along graph paths starting at the heads, not just
> from the somewhat arbitrary order of revs in a repo.
>  * Doing this breadth-first, maybe with higher sample density near the
> repo heads, until we have gathered enough samples for this roundtrip.
>  * Having, per sample, an upper bound on the lengths of the paths that
> go from this path towards tip or the next sample. This limits the
> amount of data we have to send for detailed pruning.
>  * Using an iterative approach where the two parties exchange samples
> until all uncharted regions have been discovered sufficiently. Each
> sample may yield hints on how to restrict the range of further samples
> to only still uncharted regions. I expect that in most cases the
> initial sample will already do the trick.
>
> I would like to build a prototype of this, but only if there is basic
> agreement that an approach trading some bandwidth for roundtrips could
> eventually be accepted into Mercurial.

Sounds great. One idea I had is to do a kind of backing off algorithm,
e.g. send the 32 tipmost nodes, then prune down the rest to explore
general vicinity, e.g., 1-32, 64, 128, 512, 1024, etc for the first
request. I think a prototype needs to be a good, documented
exploration of where the costs are in our current method, what the
common scenarios are for both push and pull, and what algorithms we
have available to optimize those scenarios without hurting any other
ones too much (i.e. we need good bandwidth vs. round-trips/latency
stats here).

> We could consider having this as an alternative to the existing
> algorithm for people with broadband. But I'd rather not. This would be
> a fairly important code path and having alternatives might mean that
> one of them will be much less tested than the other in practice.

Agreed. I think we should be able to make a good trade-off here.

Cheers,

Dirkjan



More information about the Mercurial-devel mailing list