[PATCH] issue 1286

Fri Sep 5 03:45:30 CDT 2008

On 05.09.2008 02:07, Petr Kodl wrote:
> You are correct - but I change the line in walk to access the _foldmap 
> directly via 
> 
> _foldmap.get()
>  
> Hence the extra check to make sure it stays empty. If you always go 
> through the normalize the check is not necessary.

You are right. Got it now.

Thanks for the explanation.

> There is another aspect of the whole walk call. In Hg there are two main 
> operation modes for the walk
> 
> 1) The disk tree is walked via os.listdir  and values compared to 
> something hashed we already have
> - this is used during hg stat

Just a minor point about "something hashed we already have":

I was of the impression that Mercurial does not store hashes in
the dirstate. It only compares file times and file sizes to decide if it has
to look into the files to check whether they are changed compared
to what's in .hg/store.

See http://www.selenic.com/mercurial/wiki/index.cgi/DirState

> 2) Something we already have is walked and values compared to HDD tree 
> - this is used during eg. hg diff when the step two does not iterate and 
> everything is resolved in step 3
>
> For the first case the number of disc accesses can be optimized to be 
> proportional to # of directories instead of number of files.
> On Win32 the FindFirst/FindNext is supplying stat values and on Linux 
> the opendir seems to be doing good job caching - not sure about OSX.
> 
> In case 2 this is not an option - we walk the tree is memory in ABC 
> order - and call lstat on every file so
> the number of lstat calls is proportional to number of files.
> 
> I ran some basic benchmarks with large trees looking at Hg/Bzr and based 
> on the numbers coming back it looks like
> bzr is now always using method 1 - assume you know how to walk the tree 
> fast and do the lookup in memory.
> 
> so for bzr the status and diff commands on clean tree have very similar 
> performance characteristics while in hg there can be substantial 
> variance between the two
> 
> One potential advantage of #2 is that you let the filesystem  take care 
> of the case/unicode folding - but that is about the only one, and 
> assuming the walk always iterates the tree on disc we would always know 
> the correct file name and folding can be always done in memory without 
> further disc IO.
> 
> It would also mean that the code never has to call lstat or exists on 
> individual files - with exception of files names typed in as command 
> line parameters - which usually means just handful of files where the 
> check is more expensive.

Nothing to respond on this from my side yet. Just to confirm that I have
read it and it looks interesting.