[PATCH] largefiles: improve status performance

Tue Jun 26 16:34:14 CDT 2012

Levi Bard wrote, On 06/24/2012 02:27 PM:
> # HG changeset patch
> # User Levi Bard <levi at unity3d.com>
> # Date 1340540784 -7200
> # Node ID ffc9863d33de86841aa0d93364fa071854d04ae5
> # Parent  efd2e14f72353626355dc82465bdf23bf6416aa2
> largefiles: improve status performance
>
> Synchronize largefiles' mtimes when updating them.
> This greatly reduces the amount of superfluous largefile hashing status does.
>
> Benchmarked on a 1GB repo with 54000 normal files and 113 largefiles:
> Without largefiles: 600ms
> With largefiles - before: 2500ms
> With largefiles - after: 700ms
>
> diff -r efd2e14f7235 -r ffc9863d33de hgext/largefiles/lfcommands.py
> --- a/hgext/largefiles/lfcommands.py	Thu Jun 21 17:37:02 2012 -0500
> +++ b/hgext/largefiles/lfcommands.py	Sun Jun 24 14:26:24 2012 +0200
> @@ -479,6 +479,16 @@
>                   # recognition that such cache missing files are REMOVED.
>                   lfdirstate.normallookup(lfile)
>                   return None # don't try to set the mode
> +            else:
> +                try:
> +                    # Synchronize largefile's last modified time
> +                    # with dirstate

Shouldn't we do it the other way around and update the dirstate with the 
actual stat info from the file? That is how we usually do in 
dirstate.normal & friends. That will make the timestamps race-free and 
avoid confusing build systems. Or is your scheme also sound no matter 
what happens?

We usually try very hard to make sure that edits done in the same second 
as the update never are lost. The best (but unfortunate) option for 
largefiles might be to store the the actual timestamp in dirstate 
immediately after the update and sleep sleep until the 'now' timestamp 
has changed - at least if a significant amount of files has been updated 
and it thus would be expensive to check their content.

> +                    state, mode, size, time = lfdirstate.dmap(lfile)
> +                    if -1 != time:
> +                        # Don't force-synchronize -1 "dirty" mtime

FWIW: I don't understand that comment and what role the -1 is playing in 
this context - perhaps because dmap is leaking something that used to be 
internal to the dirstate.

> +                        os.utime(abslfile, (-1, time))

As far as I remember setting the timestamp is quite likely to fail on 
some OS's and file systems. If you go this route and update the 
timestamp then we might have to handle errors like we do in the only 
other use of os.utime.

And why this -1 for atime?

> +                except KeyError:

This is only in order to catch errors inside dmap? That's a long way to 
throw a generic exception. But ok ...

> +                    pass
>               ret = 1
>           mode = os.stat(absstandin).st_mode
>           if mode != os.stat(abslfile).st_mode:
> diff -r efd2e14f7235 -r ffc9863d33de hgext/largefiles/lfutil.py
> --- a/hgext/largefiles/lfutil.py	Thu Jun 21 17:37:02 2012 -0500
> +++ b/hgext/largefiles/lfutil.py	Sun Jun 24 14:26:24 2012 +0200
> @@ -140,6 +140,8 @@
>           return super(largefilesdirstate, self).forget(unixpath(f))
>       def normallookup(self, f):
>           return super(largefilesdirstate, self).normallookup(unixpath(f))
> +    def dmap(self, f):
> +        return self._map[f]

It seems like it would be better to do all the work here, instead of 
exposing a leaky _map getter.

Tests ...

This functionality can perhaps be tested with hg debugstate --nodates 
(for example after the first hg update in test-largefiles.t) and verify 
that a 'st' after 1 second sleep doesn't change anything in the 
debugstate output.

It could perhaps also make sense to add a debug notice every time a 
largefile is read to check the hash?

/Mads