Next steps for discovery

Peter Arrenbrecht peter.arrenbrecht at gmail.com
Tue Apr 26 02:40:48 CDT 2011


On Sun, Mar 13, 2011 at 2:19 PM, Peter Arrenbrecht
<peter.arrenbrecht at gmail.com> wrote:
> Benoit and I just discussed what we want to do for discovery next:
>
> 1. Implement command batching in wireproto.
> 2. Wait for real-life logs to come in so we can test how much effort
> is still reasonable.
> 3. Try first with sampling only from the heads towards roots (cheaper
> because we don't have to invert the graph). Use a cheaper way to
> eliminate descendents of known-to-be-missing nodes from the unknown
> region (iterate revlog array instead of pure graph ops).
> 4. See if we really need the endgame shortcutting tweak (makes the
> server's API less generic); usually saves one roundtrip (out of few,
> so it might be nice).
> 5. If we do need to sample from the bottom, see how we can speed up
> the current implementation, maybe by introducing a cache for
> descendant information.

Re 5: I now have a version that builds the inverse graph lazily
(exploiting the topological ordering of the revlog). It performs very
well in my tests. I also expect it to make a huge difference in
practice where repos have a lot of common nodes and relatively few
mutually unknown ones. Three key observations here:

 * We build the initial sample from heads to roots only, so we don't
need the inverse graph here at all.

 * For every response from the server, we need to trace descendants
for missing nodes (sampled nodes the server reports as unknown). In
the typical case we will have relatively few such nodes, and they will
live close to the end of the revlog. So building the inverse graph for
them is quick.

 * We build every subsequent sample over the remaining undecided
region from heads to roots *and vice versa*. So we need to inverse the
this region to find its roots and trace their descendants. The
expectation here is that the initial sample already sufficiently
reaches into the common region so that remaining undecided regions are
concentrated near the end of the revlog. This could probably break for
repos with tons of heads, or repos with very many new nodes. We might
consider placing a limit on the revlog length we would need to inverse
before starting to send full samples (trading more roundtrips for less
client CPU), but at the moment I'd rather not.

Finally, I'm still hoping I can get some halfway realistic scenarios
out of the discovery logs I received from Brodie.

-parren


More information about the Mercurial-devel mailing list