[PATCH 1 of 9 changegroup-apis] exchange: use early return

Sat Aug 6 18:02:15 EDT 2016

On 08/04/2016 05:53 AM, Gregory Szorc wrote:
> On Wed, Aug 3, 2016 at 1:56 AM, Pierre-Yves David
> <pierre-yves.david at ens-lyon.org <mailto:pierre-yves.david at ens-lyon.org>>
> wrote:
>
>     On 08/02/2016 04:50 AM, Gregory Szorc wrote:
>     > On Mon, Aug 1, 2016 at 6:44 PM, Pierre-Yves David
>     > <pierre-yves.david at ens-lyon.org
>     <mailto:pierre-yves.david at ens-lyon.org>
>     <mailto:pierre-yves.david at ens-lyon.org
>     <mailto:pierre-yves.david at ens-lyon.org>>>
>     > wrote:
>     >
>     >     On 08/01/2016 08:20 PM, Gregory Szorc wrote:
>     >     > This is the first half of the series. The second half can be found at
>     >     > https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/26eeb1e66e88 if
>     >     > someone wants to review the whole thing.
>     >
>     >     I've made a first review pass on this series. As much as I like the idea
>     >     of cleaning this up, the proposed implementation makes me uncomfortable.
>     >     The main points that troubles me are:
>     >
>     >     * the extra new class is a bit of a mix of (1) outgoing object (2) cg1
>     >     bundler object. This makes a the boundary of these object more fuzzy
>     >
>     >
>     > Not sure what you mean here. An outgoing object is basically a fancy
>     > tuple holding the set of common and missing nodes.
>
>     My point here is that Outgoing is a class meant to represent this
>     information, across various user. We should aim at increasing it usage
>     and visibility instead of hiding it under custom API exposing part of
>     it. For example, the 'emitter.changesetcount' should probably just be
>     `emiter.outgoing` with outgoing exposed elsewhere (eg: pushoperation)
>     programmer will eventually just go "ho, an outgoing object, I know
>     that".
>
>
> I agree that discovery.outgoing is a decent type to represent changesets
> that will be "sent" to another location or will represent the contents
> of a changegroup. I think most of the code currently in
> changegroup.changegroupsubset() should move to the discovery module. I
> *don't* think discovery.outgoing instances should gain APIs to emit
> changegroups, however. Instead, outgoing instances are things to be
> passed to a changegroup generator instructing it how to behave.

We are on the same page here. the outgoing object is meant to be 
descriptive only. If might get more data to get more descriptive, but 
adding any generation to it should be avoided.

>     > I'm not sure what a "cg1 bundler" is. A non-bundle2 is just a v1 changegroup with a 6 byte
>     > header.
>     >
>     > If you are referring to returning a cg<N>unpacker instance, I'm not a
>     > huge fan of that. I only did that for parity with existing consumers
>     > because I didn't want to scope bloat to create a 40+ patch series.
>
>     My point is that we already have class emiting changegroup data. I'm
>     aware this area is not great. And I trust you in you will to clean up
>     all this. However, it is currently unclear to me why we cannot improve
>     them to fit your needs instead of introducing a new class.
>
>
> I wanted to establish a common class to represent "pending changegroup
> data" so that consumers all operated on a common type and could access
> changegroup data any way they saw fit *and* in an explicit manner.. I
> view this as simpler than requiring clients to hold onto a reference to
> sets of
> nodes (or an outgoing instance) along with a generator or chunkbuffer.
> The explicit part is important: the changegroup functions today have
> very opaque return values. e.g. sometimes a "cg" variable refers to a
> generator of chunks. Sometimes it is a chunkbuffer. By having a class
> where you call "getX()" it is much easier to understand what the code is
> doing.

My main objection here is about adding yet another type of object. The 
changegroup generation stack is already quite complicated with at least 
three thing I can think on the top of my head (packer, unpacker, 
bundler). I do not see the addition of a new one as an improvement 
(unless we manage to kill another one).

In the case you are mentionning, these iterators/chunkbuffer come from 
somewhere. We could probably improve the object that provide what their 
consumer needs.

>     > I do intend to do some follow-up work here because our handling of
>     > changegroup generation is just bonkers. For example, sending a bundle 2
>     > to the wire protocol with the current code involves:
>     >
>     > 1. Construct a cg<N>packer and call generate() to obtain a generator of
>     > changegroup chunks
>     > 2. Store a reference to the generator on a bundle2 part
>     > 3. Convert generator to a util.chunkbuffer (a file-like object) as part
>     > of bundlepart._payloadchunks() when serializing the bundle2 part
>     > 4. Emit a generator of 4k chunks because that's what the
>     > bundlepart.getchunks() does
>     > 5. Feed this generator of chunks into a util.chunkbuffer holding the
>     > entire bundle2 data (in exchange.getbundle())
>     > 6. Construct a zlib compressor
>     > 7. read(4096) from the util.chunkbuffer into the zlib compressor
>     > 8. Emit a generator of zlib compressed chunks to be sent to WSGI
>
>     To make sure I'm clear, I'm not challenging you diagnostic that there is
>     improvement to be made. In your example. there is multiple things
>     unrelated to the changegroup generation. I'm usually a proponent of
>     improving existing layers instead of introducing a new ones (does not
>     mean you cannot convince me that you approach here is right).
>
> I argue I improved the changegroup API layer by consolidating 8 highly
> redundant functions into a single new class that does everything they
> did and more :)

To me it sounded a bit like new class with 8 redundant methods and a 
task of existing object. I would like to be optimistic a believe we can 
consolidate the API without making more layers.

>     > util.chunkbuffer overhead has appeared in profiling results when cloning
>     > Firefox. The thing is that none of these util.chunkbuffer are really
>     > necessary for the wire protocol case: we could feed a generator of
>     > chunks of various sizes into zlib then emit whatever chunks it spits
>     > out, possibly normalizing them to equal sizes if we really wanted.
>
>     Yep, there is a lot of small improvement we could do. As you point out
>     the step (5) (6) could probably be easily simplified. But this seems
>     unrelated to the current rework of changegroup emition. Is it?
>
>
> It's difficult to do now because it isn't clear what each changegroup
> function does or when it is appropriate to call.

(reminder, I'm not challenging that the current API is confusing.)

>     > I think the end state of this series involves a "changegroup emitter"
>     > being fed into some kind of "bundler," preferably as a generator. Using
>     > the cg1unpacker for this purpose is kind of silly, as it necessitates a
>     > util.chunkbuffer. The good news is the "changegroupemitter" as it stands
>     > can easily be represented as both a generator of chunks or a file
>     > handle. So by introducing this object and requiring callers to use it,
>     > we can start transitioning callers to not use util.chunkbuffer without
>     > having to change APIs on changegroup.py. It's much more flexible.
>
>     This paragraph highlight what seems to be the most important thing to
>     address in this discussion. You seems to try to avoid touching the
>     existing changegroup API as much as possible and I do not understand
>     why. My general feeling is that it should be possible to upgrade the
>     existing class to fit your needs, but your seems to think otherwise. I'm
>     curious to find out why as you usually have pretty good reason for doing
>     things?
>
>
> I avoided changing the cg<N>packer and cg<N>unpacker classes because I
> didn't want to write a 40 part series :)

You are talking about 40 patches series as if it was something bad. I 
know you can do patch series that are x5 longer ;-)

> The cg<N>unpacker class in particular is a bit gnarly because it
> performs multiple tasks:
>
> 1) Exposes a file API from a possibly compressed input file object
> (read, seek, tell, close)
> 2) Defines the composition of the changegroup framing protocol
> 3) Converts a file object to a generator of changegroup chunks
> 4) Applies a changegroup to a repo
>
> This class is doing too much and needs to be split. To be honest, I'm
> not sure how. I do know refactoring it will be a long series :)

I think I'm starting to understant your issue here. I'll have to have 
closer look at what all these class are doing. (or we could discuss it 
if you have it all loaded in your head).

>     >     and I not convinced yet your cleanup could not happen with the
>     >     existing objects,
>     >
>     >
>     > The existing functions are horrible and not very useful. As this series
>     > demonstrates, we had something like 6 functions producing either:
>     >
>     > * a generator of changegroup chunks
>     > * a cg<N>unpacker instance
>     >
>     > from
>     >
>     > * an outgoing instance
>     > * a set of bases and heads
>     > * a set common nodes and heads
>     >
>     > sometimes with knobs to control the changegroup version.
>     >
>     > The API was unclear and far too low-level (e.g. sometimes you wanted to
>     > access the number of changesets or the nodes in the changegroup
>     > afterwards and the API hid these from you). To be honest, it wasn't
>     > clear to me that this was how things worked until I wrote the series.
>     > That's not a good sign.
>
>     (Reminder: I'm aware the current code could use improvement)
>
>     You new code seems to keep the 3 options for specifying the outgoing
>     set. So I assume you focussed on getting a simpler return type. Can you
>     elaborate on why do we need multiple return type in the first place?
>
>
> Flexibility for callers. Now they can do things like obtain multiple
> changegroup version representations from the same type. Perhaps you want
> to write a server extension that stashes both a v1 and v2 bundle from
> pushed data. It's now much simpler to do that.

I believe it would be better to clean up the current code before adding 
more advance feature to it. We might end not not needing this in core if 
the API is nice enough

Cheers,

-- 
Pierre-Yves David