[PATCH 1 of 2 hg-bfiles] Refactored xxxstore.get method

Greg Ward greg-hg at gerg.ca
Sun Nov 22 21:10:21 CST 2009


On Thu, Nov 19, 2009 at 3:45 PM,  <david.douard at logilab.fr> wrote:
> # HG changeset patch
> # User David Douard <david.douard at logilab.fr>
> # Date 1258136363 -3600
> # Node ID f51538471e58f2568535505a8d37b87cc41a1707
> # Parent  baea47f3f1139faeda56e61d3f7963cb311da8d3
> Refactored xxxstore.get method
>
> basestore.get now implements the actual logic of validate (compute hash) and copy files to their correct destinations, using a low level recvfile method, which takes care of doing the real job of retrieveing one file from the store (redefined in each real store class).

This sounds promising.  Looks like a good start.

> +class filestream(object):
> +    '''An iterator on an open file to be used in _copy_and_hash.
> +    Close file on deletion.'''
> +
> +    def __init__(self, infile, blocksize=128*1024):
> +        self.infile = infile
> +        self.blocksize = blocksize
> +
> +    def __del__(self):
> +        self.infile.close()

Danger #1: you should not close what you did not open; in other words,
whoever allocates a resource should be responsible for closing it.

Oh.  Wait.  You're just following this anti-pattern from my code.
Ooooops.  I guess I'd better fix it.

Danger #2: it's not good to depend on __del__() being called
deterministically.  It's true today with CPython, but it's never been
true with Jython.  And who knows what the future will bring.  Much
better to have an explicit close() method... except that this class
does not allocate any resources, so has nothing to close.

> +    def __iter__(self):
> +        while True:
> +            data = self.infile.read(self.blocksize)
> +            if not data:
> +                break
> +            yield data

Now here's the good stuff.  But does this really need a class?  Can't
we refactor that pattern with a standalone generator?  [...hacks for a
few minutes....]

Yes!  How does this look:

"""
 def _copy_and_hash(infile, outfile):
     '''Read bytes from infile, write them to outfile, computing the
     SHA-1 hash of the data along the way.  Close both infile and
     outfile.  Return the binary hash.'''
     hasher = hashlib.sha1()
-    while True:
-        data = infile.read(128*1024)
-        if not data:
-            break
-
+    for data in _blockstream(infile):
         hasher.update(data)
         outfile.write(data)

-    infile.close()
     outfile.close()
"""

where _blockstream() is:

def _blockstream(infile, blocksize=128*1024):
    """Generator that yields blocks of data from infile."""
    while True:
        data = infile.read(blocksize)
        if not data:
            break
        yield data
    infile.close()

I like it.  I think I might commit this on its own; it's a nice little
refactoring that should make your patch a little smaller.

> @@ -771,15 +783,51 @@
[...]
> +            hhash = binascii.hexlify(bhash)
> +            if hhash != hash:
> +                ui.warn('%s: data corruption (expected %s, got %s)\n'
> +                        % (filename, hash, hhash))
> +                os.remove(tmpfilename)
> +                missing.append((filename, hash))

Careful!  This means your refactoring isn't strictly a refactoring, as
you're changing behaviour.  Don't get me wrong, it's a *good* change
and I'd like to see it go in, but I think it should be a separate
changeset.  Refactor first and then add functionality, if at all
possible.  (And behaviour changes require test changes; refactoring by
definition does not require touching the tests.)

> +    def recvfile(self, tmpfile, filename, hash):

I think this should be called _getfile(), since's private between
basestore and its subclasses.

> +    def recvfile(self, tmpfile, filename, hash):
> +        try:
> +            infile = open(os.path.join(self.url, filename, hash), 'rb')
> +        except IOError, err:
> +            tmpfile.close()
> +            raise error.RepoError('cannot open file')

Why translate IOError to RepoError?  In fact, why close tmpfile?  Why
not just let the exception propagate; make the contract for recvfile()
(or _getfile(), whatever we call it) say "raise IOError on error",
then it's up to basestore.get() to handle it.

Hmmm: just IOError might not be enough, as we need a way to
distinguish between "missing file" and "ssh/http server is broken".
We should keep going if one file is missing, but there is no point in
repeatedly hammering on a broken server.

I'm going to commit and push something like the above _blockstream()
patch.  Can you rebase and resend?  Sorry about the conflict, but I
like my way of factoring out the loop in _copy_and_hash() better.  ;-)

Greg


More information about the Mercurial-devel mailing list