[PATCH 1 of 9 changegroup-apis] exchange: use early return

Wed Aug 3 04:56:42 EDT 2016

On 08/02/2016 04:50 AM, Gregory Szorc wrote:
> On Mon, Aug 1, 2016 at 6:44 PM, Pierre-Yves David
> <pierre-yves.david at ens-lyon.org <mailto:pierre-yves.david at ens-lyon.org>>
> wrote:
> 
>     On 08/01/2016 08:20 PM, Gregory Szorc wrote:
>     > This is the first half of the series. The second half can be found at
>     > https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/26eeb1e66e88 if
>     > someone wants to review the whole thing.
> 
>     I've made a first review pass on this series. As much as I like the idea
>     of cleaning this up, the proposed implementation makes me uncomfortable.
>     The main points that troubles me are:
> 
>     * the extra new class is a bit of a mix of (1) outgoing object (2) cg1
>     bundler object. This makes a the boundary of these object more fuzzy
> 
> 
> Not sure what you mean here. An outgoing object is basically a fancy
> tuple holding the set of common and missing nodes.

My point here is that Outgoing is a class meant to represent this
information, across various user. We should aim at increasing it usage
and visibility instead of hiding it under custom API exposing part of
it. For example, the 'emitter.changesetcount' should probably just be
`emiter.outgoing` with outgoing exposed elsewhere (eg: pushoperation)
programmer will eventually just go "ho, an outgoing object, I know that".

> I'm not sure what a "cg1 bundler" is. A non-bundle2 is just a v1 changegroup with a 6 byte
> header.
> 
> If you are referring to returning a cg<N>unpacker instance, I'm not a
> huge fan of that. I only did that for parity with existing consumers
> because I didn't want to scope bloat to create a 40+ patch series.

My point is that we already have class emiting changegroup data. I'm
aware this area is not great. And I trust you in you will to clean up
all this. However, it is currently unclear to me why we cannot improve
them to fit your needs instead of introducing a new class.

> I do intend to do some follow-up work here because our handling of
> changegroup generation is just bonkers. For example, sending a bundle 2
> to the wire protocol with the current code involves:
> 
> 1. Construct a cg<N>packer and call generate() to obtain a generator of
> changegroup chunks
> 2. Store a reference to the generator on a bundle2 part
> 3. Convert generator to a util.chunkbuffer (a file-like object) as part
> of bundlepart._payloadchunks() when serializing the bundle2 part
> 4. Emit a generator of 4k chunks because that's what the
> bundlepart.getchunks() does
> 5. Feed this generator of chunks into a util.chunkbuffer holding the
> entire bundle2 data (in exchange.getbundle())
> 6. Construct a zlib compressor
> 7. read(4096) from the util.chunkbuffer into the zlib compressor
> 8. Emit a generator of zlib compressed chunks to be sent to WSGI

To make sure I'm clear, I'm not challenging you diagnostic that there is
improvement to be made. In your example. there is multiple things
unrelated to the changegroup generation. I'm usually a proponent of
improving existing layers instead of introducing a new ones (does not
mean you cannot convince me that you approach here is right).

> util.chunkbuffer overhead has appeared in profiling results when cloning
> Firefox. The thing is that none of these util.chunkbuffer are really
> necessary for the wire protocol case: we could feed a generator of
> chunks of various sizes into zlib then emit whatever chunks it spits
> out, possibly normalizing them to equal sizes if we really wanted.

Yep, there is a lot of small improvement we could do. As you point out
the step (5) (6) could probably be easily simplified. But this seems
unrelated to the current rework of changegroup emition. Is it?

> I think the end state of this series involves a "changegroup emitter"
> being fed into some kind of "bundler," preferably as a generator. Using
> the cg1unpacker for this purpose is kind of silly, as it necessitates a
> util.chunkbuffer. The good news is the "changegroupemitter" as it stands
> can easily be represented as both a generator of chunks or a file
> handle. So by introducing this object and requiring callers to use it,
> we can start transitioning callers to not use util.chunkbuffer without
> having to change APIs on changegroup.py. It's much more flexible.

This paragraph highlight what seems to be the most important thing to
address in this discussion. You seems to try to avoid touching the
existing changegroup API as much as possible and I do not understand
why. My general feeling is that it should be possible to upgrade the
existing class to fit your needs, but your seems to think otherwise. I'm
curious to find out why as you usually have pretty good reason for doing
things?

>     and I not convinced yet your cleanup could not happen with the
>     existing objects,
> 
> 
> The existing functions are horrible and not very useful. As this series
> demonstrates, we had something like 6 functions producing either:
> 
> * a generator of changegroup chunks
> * a cg<N>unpacker instance
> 
> from
> 
> * an outgoing instance
> * a set of bases and heads
> * a set common nodes and heads
> 
> sometimes with knobs to control the changegroup version.
> 
> The API was unclear and far too low-level (e.g. sometimes you wanted to
> access the number of changesets or the nodes in the changegroup
> afterwards and the API hid these from you). To be honest, it wasn't
> clear to me that this was how things worked until I wrote the series.
> That's not a good sign.

(Reminder: I'm aware the current code could use improvement)

You new code seems to keep the 3 options for specifying the outgoing
set. So I assume you focussed on getting a simpler return type. Can you
elaborate on why do we need multiple return type in the first place?

Also, could we simplify the other side of this. For example we could
just keep the outgoing instance option to keep this layer simpler.

> The new API solves makes thing simpler by allowing you to construct a
> stateful object to representing a generated changegroup and then get a
> changegroup representation various ways. It is much more explicit.

"Stateful" seems to be an important key to lift my confusion. Can you
elaborate?

>     * many @class method usage while our coding style usually avoid them
>     along other "class as namespace" usage (as per Matt taste),
> 
> 
> That's fine. These can be pulled out into module-level functions. e.g.
> emitterfromrootsandheads(repo, roots, heads) -> changegroupemitter
>  
> 
> 
>     * the
>     superlongnameforeverything.thatcombinewithclassnamspacing.foroverninethousand
>     charline we could probably me more compact using a abbreviation for
>     common parts as we do elsewhere in the code.
> 
> 
> I agree the names are a bit long. But I'm not a fan of abbreviations
> because they make readability hard, especially for those who don't speak
> English natively.

We use common abbreviation in the Mercurial codebase in many place: rev,
ctx, cg, obs, etc… I'm non-native speaker and code well with that. We
also have a wiki page with some details on those. We could update it and
point people to it more.

https://www.mercurial-scm.org/wiki/CodingStyle#Naming_conventions

> I think remaining "changegroupemitter" to "emitter" (because
> "changegroup" is redundant with the module name) and moving the
> @classmethods to module functions will make things much better. e.g.
> 
> emitter = changegroup.changegroupemitter.fromheadsandcommon(repo, heads, common)
> 
> emitter = changegroup.emitterfromheadsandcommon(repo, heads, common)

I agree, that would be a good step forward

> 
> Or if wanted to write a complex __init__ function, we could have it
> accept all arguments. But I don't like complexity and prefer separate
> "constructor" functions.

It seems like we have 4 possible arguments: outgoing, common, base,
heads. The complexity seems manageable. We could maybe have a factoring
function for outgoing working that way and just delegate to it in the
constructor.

Cheers,

-- 
Pierre-Yves David