[PATCH 1 of 9 changegroup-apis] exchange: use early return

Wed Aug 3 23:53:34 EDT 2016

On Wed, Aug 3, 2016 at 1:56 AM, Pierre-Yves David <
pierre-yves.david at ens-lyon.org> wrote:

> On 08/02/2016 04:50 AM, Gregory Szorc wrote:
> > On Mon, Aug 1, 2016 at 6:44 PM, Pierre-Yves David
> > <pierre-yves.david at ens-lyon.org <mailto:pierre-yves.david at ens-lyon.org>>
> > wrote:
> >
> >     On 08/01/2016 08:20 PM, Gregory Szorc wrote:
> >     > This is the first half of the series. The second half can be found
> at
> >     >
> https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/26eeb1e66e88 if
> >     > someone wants to review the whole thing.
> >
> >     I've made a first review pass on this series. As much as I like the
> idea
> >     of cleaning this up, the proposed implementation makes me
> uncomfortable.
> >     The main points that troubles me are:
> >
> >     * the extra new class is a bit of a mix of (1) outgoing object (2)
> cg1
> >     bundler object. This makes a the boundary of these object more fuzzy
> >
> >
> > Not sure what you mean here. An outgoing object is basically a fancy
> > tuple holding the set of common and missing nodes.
>
> My point here is that Outgoing is a class meant to represent this
> information, across various user. We should aim at increasing it usage
> and visibility instead of hiding it under custom API exposing part of
> it. For example, the 'emitter.changesetcount' should probably just be
> `emiter.outgoing` with outgoing exposed elsewhere (eg: pushoperation)
> programmer will eventually just go "ho, an outgoing object, I know that".
>

I agree that discovery.outgoing is a decent type to represent changesets
that will be "sent" to another location or will represent the contents of a
changegroup. I think most of the code currently in
changegroup.changegroupsubset() should move to the discovery module. I
*don't* think discovery.outgoing instances should gain APIs to emit
changegroups, however. Instead, outgoing instances are things to be passed
to a changegroup generator instructing it how to behave.

>
> > I'm not sure what a "cg1 bundler" is. A non-bundle2 is just a v1
> changegroup with a 6 byte
> > header.
> >
> > If you are referring to returning a cg<N>unpacker instance, I'm not a
> > huge fan of that. I only did that for parity with existing consumers
> > because I didn't want to scope bloat to create a 40+ patch series.
>
> My point is that we already have class emiting changegroup data. I'm
> aware this area is not great. And I trust you in you will to clean up
> all this. However, it is currently unclear to me why we cannot improve
> them to fit your needs instead of introducing a new class.
>

I wanted to establish a common class to represent "pending changegroup
data" so that consumers all operated on a common type and could access
changegroup data any way they saw fit *and* in an explicit manner.. I view
this as simpler than requiring clients to hold onto a reference to sets of
nodes (or an outgoing instance) along with a generator or chunkbuffer. The
explicit part is important: the changegroup functions today have very
opaque return values. e.g. sometimes a "cg" variable refers to a generator
of chunks. Sometimes it is a chunkbuffer. By having a class where you call
"getX()" it is much easier to understand what the code is doing.

>
> > I do intend to do some follow-up work here because our handling of
> > changegroup generation is just bonkers. For example, sending a bundle 2
> > to the wire protocol with the current code involves:
> >
> > 1. Construct a cg<N>packer and call generate() to obtain a generator of
> > changegroup chunks
> > 2. Store a reference to the generator on a bundle2 part
> > 3. Convert generator to a util.chunkbuffer (a file-like object) as part
> > of bundlepart._payloadchunks() when serializing the bundle2 part
> > 4. Emit a generator of 4k chunks because that's what the
> > bundlepart.getchunks() does
> > 5. Feed this generator of chunks into a util.chunkbuffer holding the
> > entire bundle2 data (in exchange.getbundle())
> > 6. Construct a zlib compressor
> > 7. read(4096) from the util.chunkbuffer into the zlib compressor
> > 8. Emit a generator of zlib compressed chunks to be sent to WSGI
>
> To make sure I'm clear, I'm not challenging you diagnostic that there is
> improvement to be made. In your example. there is multiple things
> unrelated to the changegroup generation. I'm usually a proponent of
> improving existing layers instead of introducing a new ones (does not
> mean you cannot convince me that you approach here is right).
>

I argue I improved the changegroup API layer by consolidating 8 highly
redundant functions into a single new class that does everything they did
and more :)

>
> > util.chunkbuffer overhead has appeared in profiling results when cloning
> > Firefox. The thing is that none of these util.chunkbuffer are really
> > necessary for the wire protocol case: we could feed a generator of
> > chunks of various sizes into zlib then emit whatever chunks it spits
> > out, possibly normalizing them to equal sizes if we really wanted.
>
> Yep, there is a lot of small improvement we could do. As you point out
> the step (5) (6) could probably be easily simplified. But this seems
> unrelated to the current rework of changegroup emition. Is it?
>

It's difficult to do now because it isn't clear what each changegroup
function does or when it is appropriate to call.

>
>
> > I think the end state of this series involves a "changegroup emitter"
> > being fed into some kind of "bundler," preferably as a generator. Using
> > the cg1unpacker for this purpose is kind of silly, as it necessitates a
> > util.chunkbuffer. The good news is the "changegroupemitter" as it stands
> > can easily be represented as both a generator of chunks or a file
> > handle. So by introducing this object and requiring callers to use it,
> > we can start transitioning callers to not use util.chunkbuffer without
> > having to change APIs on changegroup.py. It's much more flexible.
>
> This paragraph highlight what seems to be the most important thing to
> address in this discussion. You seems to try to avoid touching the
> existing changegroup API as much as possible and I do not understand
> why. My general feeling is that it should be possible to upgrade the
> existing class to fit your needs, but your seems to think otherwise. I'm
> curious to find out why as you usually have pretty good reason for doing
> things?
>

I avoided changing the cg<N>packer and cg<N>unpacker classes because I
didn't want to write a 40 part series :)

The cg<N>unpacker class in particular is a bit gnarly because it performs
multiple tasks:

1) Exposes a file API from a possibly compressed input file object (read,
seek, tell, close)
2) Defines the composition of the changegroup framing protocol
3) Converts a file object to a generator of changegroup chunks
4) Applies a changegroup to a repo

This class is doing too much and needs to be split. To be honest, I'm not
sure how. I do know refactoring it will be a long series :)

>
> >     and I not convinced yet your cleanup could not happen with the
> >     existing objects,
> >
> >
> > The existing functions are horrible and not very useful. As this series
> > demonstrates, we had something like 6 functions producing either:
> >
> > * a generator of changegroup chunks
> > * a cg<N>unpacker instance
> >
> > from
> >
> > * an outgoing instance
> > * a set of bases and heads
> > * a set common nodes and heads
> >
> > sometimes with knobs to control the changegroup version.
> >
> > The API was unclear and far too low-level (e.g. sometimes you wanted to
> > access the number of changesets or the nodes in the changegroup
> > afterwards and the API hid these from you). To be honest, it wasn't
> > clear to me that this was how things worked until I wrote the series.
> > That's not a good sign.
>
> (Reminder: I'm aware the current code could use improvement)
>
> You new code seems to keep the 3 options for specifying the outgoing
> set. So I assume you focussed on getting a simpler return type. Can you
> elaborate on why do we need multiple return type in the first place?
>

Flexibility for callers. Now they can do things like obtain multiple
changegroup version representations from the same type. Perhaps you want to
write a server extension that stashes both a v1 and v2 bundle from pushed
data. It's now much simpler to do that.

>
> Also, could we simplify the other side of this. For example we could
> just keep the outgoing instance option to keep this layer simpler.
>
> > The new API solves makes thing simpler by allowing you to construct a
> > stateful object to representing a generated changegroup and then get a
> > changegroup representation various ways. It is much more explicit.
>
> "Stateful" seems to be an important key to lift my confusion. Can you
> elaborate?
>

Basically an entity that lets you obtain a representation of a changegroup
N ways without having to muck about low-level changegroup.py functions.

>
> >     * many @class method usage while our coding style usually avoid them
> >     along other "class as namespace" usage (as per Matt taste),
> >
> >
> > That's fine. These can be pulled out into module-level functions. e.g.
> > emitterfromrootsandheads(repo, roots, heads) -> changegroupemitter
> >
> >
> >
> >     * the
> >
>  superlongnameforeverything.thatcombinewithclassnamspacing.foroverninethousand
> >     charline we could probably me more compact using a abbreviation for
> >     common parts as we do elsewhere in the code.
> >
> >
> > I agree the names are a bit long. But I'm not a fan of abbreviations
> > because they make readability hard, especially for those who don't speak
> > English natively.
>
> We use common abbreviation in the Mercurial codebase in many place: rev,
> ctx, cg, obs, etc… I'm non-native speaker and code well with that. We
> also have a wiki page with some details on those. We could update it and
> point people to it more.
>
> https://www.mercurial-scm.org/wiki/CodingStyle#Naming_conventions
>
> > I think remaining "changegroupemitter" to "emitter" (because
> > "changegroup" is redundant with the module name) and moving the
> > @classmethods to module functions will make things much better. e.g.
> >
> > emitter = changegroup.changegroupemitter.fromheadsandcommon(repo, heads,
> common)
> >
> > emitter = changegroup.emitterfromheadsandcommon(repo, heads, common)
>
> I agree, that would be a good step forward
>
> >
> > Or if wanted to write a complex __init__ function, we could have it
> > accept all arguments. But I don't like complexity and prefer separate
> > "constructor" functions.
>
> It seems like we have 4 possible arguments: outgoing, common, base,
> heads. The complexity seems manageable. We could maybe have a factoring
> function for outgoing working that way and just delegate to it in the
> constructor.
>
> Cheers,
>
> --
> Pierre-Yves David
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20160803/a1e2210a/attachment.html>