[PATCH 2 of 2] Reorder rename operations to minimise risk of leaving repository in unknown state

Adrian Buehlmann adrian at cadifra.com
Sat Oct 3 04:51:04 CDT 2009


On 03.10.2009 08:05, Sune Foldager wrote:
> Laurens Holst wrote:
>>          temp = tempname(dst)
>>          os.rename(dst, temp)
>> +        os.rename(src, dst)
>>          os.unlink(temp)
>> -        os.rename(src, dst)
>>
> 
> Since I sorta wrote the code you're changing above and since Windows
> integrity has been pretty sensitive to this code in the various versions
> we tried in the past, I am bit worried about just changing it but... did
> you see if this actually helps anything?
> 
> If it does, I would analyse it as follows: os.rename creates, as far as
> the stupid virus scanner is concerned, a new file, which is then opened
> for scanning. The file 'src', having not been created right this
> instant, might have been scanned in that past and already closed now, so
> the rename to dst can go through. Then the unlink might fail, of course.
> 
> ...but I don't see how this code in any way prevents unlink from
> failing, which would in any case prevent the transaction from
> completing. But I guess you're mainly concerned about a corrupt dirstate
> file?

This was my impression as well. The unlink will fail anyway and
abort program execution.

The change seems to try to reduce one kind of horrible damage
to some slightly lesser kind of horrible damage -- provided it does not
create new horrible side effects we have overlooked (which is always the
danger if we change code).

Note that the damage reported in issue 1840 is most likely caused by
a specific virus scanner hard-locking files at random times.

Grauw on Wed, 23 Sep 2009 19:35:31 +0200 (in TortoiseHg issue 580):
> Either way, it is one thing for an action to fail because of concurrent
> file access. It is yet a wholly different thing if that failure causes
> my whole repository (or at least my working copy’s relation to it) to
> be trashed…

If I try to take on Laurens' (=Grauw) hat, assuming I had used such a
horrible scanner together with Mercurial, I would probably even go as
far as saying that if my file access is broken that much, then
I would probably try not to abort the program if the "only" problem is
that deleting a temporary file fails.

Maybe doing something inelegant as (comments omitted, pasting full
modified function):

def rename(src, dst):
    """forcibly rename a file"""
    try:
        os.rename(src, dst)
    except OSError, err: # FIXME: check err (EEXIST ?)

        def tempname(prefix):
            for tries in xrange(10):
                temp = '%s-%08x' % (prefix, random.randint(0, 0xffffffff))
                if not os.path.exists(temp):
                    return temp
            raise IOError, (errno.EEXIST, "No usable temporary filename found")

        temp = tempname(dst)
        os.rename(dst, temp)
        try:
            os.unlink(temp)
        except:
            print _("fatal error: failed to delete temporary '%s'") % temp
        os.rename(src, dst)

Of course, this might leave tons of temporaries sprinkled
all over the repo, if I continue to use mercurial in that hostile
environment. And I might get hit by the "No usable temporary filename
found" sooner or later, which is a hard abort again.

But at least the repository would still be usable in that
specific failure case. I could then ask on IRC or the ml what that
fatal error means, then learn that it is most likely caused by my
broken virus scanner, then shut down the scanner, make a clone of
the repo and continue with the clean clone (without the scanner or
a hopefully better scanner).

Of course, it still doesn't fix the basic problem that the scanner
is incompatible for usage with mercurial (and probably many other
software as well).









More information about the Mercurial-devel mailing list