[PATCH RFC] bundle2: add a part for pointers to changegroups

Pierre-Yves David pierre-yves.david at ens-lyon.org
Tue Sep 2 14:11:17 CDT 2014



On 09/02/2014 07:21 AM, Mike Hommey wrote:
> # HG changeset patch
> # User Mike Hommey <mh at glandium.org>
> # Date 1409632006 -32400
> #      Tue Sep 02 13:26:46 2014 +0900
> # Node ID 7a91b38219fb3f775eb5a710436a0ab9baafe650
> # Parent  34de1241da516e7896394893e66c7ac8ed9f9787
> bundle2: add a part for pointers to changegroups

(you got the summary line right, well played!)

> Bundle2 opens doors to advanced features allowing to reduce load on
> mercurial servers, and improve clone experience for users on unstable or
> slow networks.
>
> For instance, it could be possible to pre-generate a bundle of a
> repository, and give a pointer to it to clients cloning the repository,
> followed by another changegroup with the remainder. For significantly
> big repositories, this could come as several base bundles with e.g. 10k
> changesets, which, combined with checkpoints, would prevent users with
> flaky networks from starting over any time their connection fails.

For other readers, this changesets does -not- cover the "checkpoint" 
concept to rescue people that live in area were internet is bad and 
unreliable (eg: San Francisco Bay Area)). It just refers to it a 
possible development.

> While the server-side support for those features doesn't exist yet, it
> is preferable to have client-side support for this early-on.

+1 on having widespread client support. It is important to have any 
clients taking advantage of the effort for setting such things on the 
server.

> Note the patch to the unit test doesn't apply without fuzz unless you
> apply "bundle2: Add tests for multiple changegroup parts" first.

By chance, I'll point in the same direction than for the previous patch. 
Those tests should go into a new tests files.

> There are a few details to figure before this can be final:
 >
> - The current patch unconditionally uses hashlib for digest computation.
>    I think it's sensible to have support for multiple digests (and to
>    check all of them if that's what the server advertizes). The obvious
>    problem is that hashlib is not supported on python 2.4. Python 2.4
>    only supports md5 and sha1 out-of-the box, so we probably should just
>    stick with those two, at least until python 2.4 support is phased out.

The bundle2 capabilities is a key plus a list. The list can be used to 
transmit the supported hash formats.

The hash themself can probably be a dedicated capability this is 
server//client wide. This could even be a top level capability since 
stuff outside of bundle2 could use that (but using a top level 
capability make it harder to access from the bundler)

> - Relatedly, it might be worthy adding the changegroup size along the
>    checksums.

+1

> - Bikeshed on the part name. I'm not entirely fond of "changegroupurl",
>    and see next point.

I can suggest "seeother-changegroup" as a way to get you to love your 
original proposal.

> - I can think of another non-simple-url way to point to some other
>    changegroup, that might be worth adding support for: a pointer to one
>    or more changesets in a given repository. So instead of downloading a
>    pre-generated bundle via http, the client would negociate a getbundle
>    with that other repo. It's not entirely obvious how this could be
>    useful, but it would allow us (Mozilla) to store, on S3, bundles of
>    "patch sets" for our try server or review board, with a reference to
>    the base it needs, which then can be applied on an existing clone of
>    e.g. mozilla-central without having to use a full bundle (it would
>    only have to get what it doesn't have) The question then becomes
>    whether to create a new part for that, or to use the same part for
>    both use cases and deal with the difference in the format.

This is a fairly interresting idea but probably deserve to be in a 
different part. We should also consider having the same kind of 
mechanism to download bundle2 archive (yes bundle2 inside a bundle2)

> diff --git a/mercurial/bundle2.py b/mercurial/bundle2.py
> --- a/mercurial/bundle2.py
> +++ b/mercurial/bundle2.py
> @@ -137,22 +137,25 @@ contains any uppercase char it is consid
>   known for a Mandatory part, the process is aborted and an exception is raised.
>   If the part is advisory and no handler is known, the part is ignored. When the
>   process is aborted, the full bundle is still read from the stream to keep the
>   channel usable. But none of the part read from an abort are processed. In the
>   future, dropping the stream may become an option for channel we do not care to
>   preserve.
>   """
>
> +import exchange

You cannot do that. exchange is importing bundle2 already so you are 
creating a cycle. (One of our test is detecting such cycle and would 
have yelled at you if ran)

>   import util
>   import struct
>   import urllib
>   import string
>   import obsolete
>   import pushkey
> +import urllib2
> +import urlparse

We should maybe start to have distinct modules for the bundle mechanism 
itself and the bundle2 parts. I'll think about it.

>
>   import changegroup, error
>   from i18n import _
>
>   _pack = struct.pack
>   _unpack = struct.unpack
>
>   _magicstring = 'HG2X'
> @@ -815,35 +818,84 @@ def handlechangegroup(op, inpart):
>       inflicted to any end-user.
>       """
>       # Make sure we trigger a transaction creation
>       #
>       # The addchangegroup function will get a transaction object by itself, but
>       # we need to make sure we trigger the creation of a transaction object used
>       # for the whole processing scope.
>       op.gettransaction()
> -    cg = changegroup.unbundle10(inpart, 'UN')
> +    cg = exchange.readbundle(op.repo.ui, inpart, None)
>       ret = changegroup.addchangegroup(op.repo, cg, 'bundle2', 'bundle2')
>       op.records.add('changegroup', {'return': ret})
>       if op.reply is not None:
>           # This is definitly not the final form of this
>           # return. But one need to start somewhere.
>           part = op.reply.newpart('b2x:reply:changegroup')
>           part.addparam('in-reply-to', str(inpart.id), mandatory=False)
>           part.addparam('return', '%i' % ret, mandatory=False)
>       assert not inpart.read()
>
> +class digestchecker(object):
> +    """File handle wrapper that additionally checks content against hash
> +    digests.
> +
> +        d = digestchecker(fh, {'md5': '...'})
> +
> +    When multiple digests are given, all of them are verified.
> +    """
> +    def __init__(self, fh, digests):
> +        self._fh = fh
> +        self._hashes = {}
> +        self._digests = dict(digests)
> +        for k in digests:
> +            import hashlib
> +            assert hasattr(hashlib, k)
> +            self._hashes[k] = getattr(hashlib, k)()
> +
> +    def read(self, len=-1):
> +        content = self._fh.read(len)
> +        for h in self._hashes.values():
> +            h.update(content)
> +        return content
> +
> +    def validate(self):
> +        for k, v in self._digests.items():
> +            if v != self._hashes[k].hexdigest():
> +                return False
> +        return True
> +
> + at parthandler('b2x:changegroupurl')
> +def handlechangegroupurl(op, inpart):
> +    """apply a changegroup on the repo, given a url to it
> +    """
> +    lines = inpart.read().splitlines()
> +    url = lines.pop(0)
> +    digests = {}
> +    for line in lines:
> +        assert ':' in line
> +        k, v = line.split(':', 1)
> +        digests[k.strip()] = v.strip()
> +
> +    parsed_url = urlparse.urlparse(url)
> +    assert parsed_url.scheme in ('http', 'https')
> +
> +    real_part = digestchecker(urllib2.urlopen(url), digests)
> +    handlechangegroup(op, real_part)
> +    if not real_part.validate():
> +        raise error.Abort('Changegroup at %s is corrupted' % url)
> +
>   @parthandler('b2x:reply:changegroup', ('return', 'in-reply-to'))
> -def handlechangegroup(op, inpart):
> +def handlereplychangegroup(op, inpart):

Good catch, should go in another changeset.

>       ret = int(inpart.params['return'])
>       replyto = int(inpart.params['in-reply-to'])
>       op.records.add('changegroup', {'return': ret}, replyto)
>
>   @parthandler('b2x:check:heads')
> -def handlechangegroup(op, inpart):
> +def handlecheckheads(op, inpart):

Good catches, should go in another changeset.

>       """check that head of the repo did not change
>
>       This is used to detect a push race when using unbundle.
>       This replaces the "heads" argument of unbundle."""
>       h = inpart.read(20)
>       heads = []
>       while len(h) == 20:
>           heads.append(h)
> diff --git a/tests/test-bundle2.t b/tests/test-bundle2.t
> --- a/tests/test-bundle2.t
> +++ b/tests/test-bundle2.t

Same feedback as for the other one, should go in another test files. The 
test plan itself sounds fine.

> @@ -71,16 +71,17 @@ Create an extension to test bundle2 API
>     >
>     > @command('bundle2',
>     >          [('', 'param', [], 'stream level parameter'),
>     >           ('', 'unknown', False, 'include an unknown mandatory part in the bundle'),
>     >           ('', 'unknownparams', False, 'include an unknown part parameters in the bundle'),
>     >           ('', 'parts', False, 'include some arbitrary parts to the bundle'),
>     >           ('', 'reply', False, 'produce a reply bundle'),
>     >           ('', 'pushrace', False, 'includes a check:head part with unknown nodes'),
> +  >           ('', 'changegroupurl', '', 'includes a b2x:changegroupurl part'),
>     >           ('r', 'rev', [], 'includes those changeset in the bundle'),],
>     >          '[OUTPUTFILE]')
>     > def cmdbundle2(ui, repo, path=None, **opts):
>     >     """write a bundle2 container on standard ouput"""
>     >     bundler = bundle2.bundle20(ui)
>     >     for p in opts['param']:
>     >         p = p.split('=', 1)
>     >         try:
> @@ -126,16 +127,18 @@ Create an extension to test bundle2 API
>     >        # advisory known part with unknown mandatory param
>     >        bundler.newpart('test:song', [('randomparam','')])
>     >     if opts['unknown']:
>     >        bundler.newpart('test:UNKNOWN', data='some random content')
>     >     if opts['unknownparams']:
>     >        bundler.newpart('test:SONG', [('randomparams', '')])
>     >     if opts['parts']:
>     >        bundler.newpart('test:ping')
> +  >     if opts['changegroupurl']:
> +  >        bundler.newpart('b2x:changegroupurl', data=opts['changegroupurl'])
>     >
>     >     if path is None:
>     >        file = sys.stdout
>     >     else:
>     >         file = open(path, 'wb')
>     >
>     >     for chunk in bundler.getchunks():
>     >         file.write(chunk)
> @@ -857,17 +860,91 @@ with reply
>     added 0 changesets with 0 changes to 2 files
>     \x00\x00\x00\x00\x003\x15b2x:reply:changegroup\x00\x00\x00\x02\x00\x02\x0b\x01\x06\x01in-reply-to2return1\x00\x00\x00\x00\x00\x1f (esc)
>     b2x:output\x00\x00\x00\x03\x00\x01\x0b\x01in-reply-to2\x00\x00\x00dadding changesets (esc)
>     adding manifests
>     adding file changes
>     added 0 changesets with 0 changes to 2 files
>     \x00\x00\x00\x00\x00\x00 (no-eol) (esc)
>
> +Test changegroup urls
> +
> +  $ hg bundle --rev '8+7+5' --base 6 ../changegroup.hg
> +  3 changesets found
> +
>     $ cd ..
> +  $ cat > httpserver.py << EOF
> +  > from SocketServer import ThreadingMixIn
> +  > import BaseHTTPServer
> +  > import os
> +  > import SimpleHTTPServer
> +  > import threading
> +  > class HTTPRequestHandler(SimpleHTTPServer.SimpleHTTPRequestHandler):
> +  >     def finish(self):
> +  >         self.server.shutdown()
> +  > class HTTPServer(ThreadingMixIn, BaseHTTPServer.HTTPServer): pass
> +  > def serve_one():
> +  >     server = HTTPServer(('localhost', $HGPORT), HTTPRequestHandler)
> +  >     server.serve_forever()
> +  > open('httpserver.pid', 'w').write('%d\n' % os.getpid())
> +  > serve_one()
> +  > EOF
> +  $ cat > wait.py << EOF
> +  > import time
> +  > while True:
> +  >     try:
> +  >         if '\n' in open('httpserver.pid', 'r').read():
> +  >             break
> +  >     except IOError:
> +  >         pass
> +  >     time.sleep(0.2)
> +  > EOF
> +  $ (cd main && hg bundle2 --changegroupurl http://localhost:$HGPORT/changegroup.hg ../changegroupurl.hg2)
> +  $ python httpserver.py 2> /dev/null &
> +  $ python wait.py
> +  $ (cd main && hg unbundle2 < ../changegroupurl.hg2)
> +  adding changesets
> +  adding manifests
> +  adding file changes
> +  added 0 changesets with 0 changes to 2 files
> +  0 unread bytes
> +  addchangegroup return: 1
> +  $ "$TESTDIR/killdaemons.py" httpserver.pid
> +  $ rm httpserver.pid
> +
> +Test with corruption check
> +
> +  $ (cd main && hg bundle2 --changegroupurl "`echo http://localhost:$HGPORT/changegroup.hg; echo md5: 44e5b1c7b57916f97f2bcc6b4bb55d0e; echo sha1: 51d32383164403798628b8bf3e066cbce451f694`" ../changegroupurl.hg2)
> +  $ python httpserver.py 2> /dev/null &
> +  $ python wait.py
> +  $ (cd main && hg unbundle2 < ../changegroupurl.hg2)
> +  adding changesets
> +  adding manifests
> +  adding file changes
> +  added 0 changesets with 0 changes to 2 files
> +  0 unread bytes
> +  addchangegroup return: 1
> +  $ "$TESTDIR/killdaemons.py" httpserver.pid
> +
> +Test with failed corruption check
> +
> +  $ (cd main && hg bundle2 --changegroupurl "`echo http://localhost:$HGPORT/changegroup.hg; echo md5: 00000000000000000000000000000000`" ../changegroupurl.hg2)
> +  $ python httpserver.py 2> /dev/null &
> +  $ python wait.py
> +  $ (cd main && hg unbundle2 < ../changegroupurl.hg2)
> +  adding changesets
> +  adding manifests
> +  adding file changes
> +  added 0 changesets with 0 changes to 2 files
> +  transaction abort!
> +  rollback completed
> +  0 unread bytes
> +  abort: Changegroup at http://localhost:$HGPORT/changegroup.hg is corrupted
> +  [255]
> +  $ "$TESTDIR/killdaemons.py" httpserver.pid
>
>   Real world exchange
>   =====================
>
>   Add more obsolescence information
>
>     $ hg -R main debugobsolete -d '0 0' 1111111111111111111111111111111111111111 `getmainid 9520eea781bc`
>     $ hg -R main debugobsolete -d '0 0' 2222222222222222222222222222222222222222 `getmainid 24b6387c8c8c`
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
>

-- 
Pierre-Yves David


More information about the Mercurial-devel mailing list