RFC] New generic filesystem walker for mercurial
Benoit Boissinot
bboissin at gmail.com
Sat Jan 6 06:31:18 CST 2007
On 1/3/07, Emanuele Aina <faina.mail at tiscali.it> wrote:
> I've started to write a filesystem walker for mercurial, trying to
> separate the filesystem walker from the mercurial specific code.
>
> [snip]
>
> I have done some tests on a kernel repo, mostly with the filesystem
> cache full, so those are relevant only for the CPU usage (but I plan to
> do some tests with the cache emptied).
>
> Currently is a tiny tiny bit faster than the current 'hg debugwalk'
> (1.27s instead of 1.39s on a kernel repo :), but is probably due to the
> additional code in mercurial (option parsing, etc.).
>
Please add plug your code in commands.py to do fair comparison
(matcher code might do some difference too). You might want to compare
strace output (especially stat syscalls).
> Even though my main interest is not performance, I want to make sure
> there isn't some big regressions on this front.
>
> I have now some questions:
>
> - mercurial currently walks the filesystem (dirstate.py:statwalk())
> sorting the contents of each directory, visiting subdirectory in-order
> and then sorting the whole results.
>
> This means that with:
> root
> `- aaa.h
> `- foo.h
> `- foo
> | `- bar.h
> | `- baz.h
> `- zzz.h
> mercurial gives:
> aaa.h
> foo.h
> foo/bar.h
> foo/baz.h
> zzz.h
>
> The point is that 'foo' (the directory) should be put before 'foo.h'
> but, as the ordering consider the whole path, it is treate as 'foo/'.
>
> Is this intentional?
Yes, as a thread explained it this summer (the point is to always read
to files in the same order)
>
> If not, should this be preserved or I can put 'foo' before 'foo.h'?
>
> Can I switch to a topdown walk, listing results for the topmost
> files before (aaa.h, zzz.h) and then descending the tree (foo/bar.h
> and foo/baz.h)?
>
> In the code I've left both walkers: it should be sufficent to switch
> from walk_inorder() to walk_topdown() for the generic filesystem
> walker.
>
inorder is preferable
> - the .hgignore pattern cannot be used to prune a subtree from the walk
> because ^foo/bar$ can match the foo/bar directory but we cannot skip
> it as the contained files (e.g. foo/bar/baz) are not matched by the
> pattern, right?
>
> Someone has any idea about how to do this (if feasible at all)?
>
We should be able to prune a subtree, I think we check if a directory
match before visiting it.
> Because of this we also need to do a lstat() to check for files or
> directories *before* checking the ignore patters, so we need to lstat
> every file, even the ignored ones. Is there something to do to avoid
> this?
> I thought of checking both ignore(path) and ignore(path+'/') and if
> both return True save the call to lstat(). Could this a be good idea?
>
> - what is the purpose of the match and badmatch arguments of statwalk?
> When should they be called? On what kinds of files?
>
match is used to know if a file is ignored or not, badmatch is used to
report ignored files.
> - missing files are returned at the end of the walk, but files in the
> manifest which have changed to an unsupported filetype are returned
> as missing during the walk (due to the sorting post-processing).
> This is intentional or I can leave both at the end?
>
I don't think it is intentional.
> - I've thought of adding some unit tests and performance tests for the
> walker. Can I use unittest based testcases placed in tests/ or there
> is a better way of doing these things? It seems that what is in tests/
> is more oriented towards functional testing (testing hg as a program)
> than unit testing (testing the behavior of a single function/object).
>
there is at least one unit test in tests/test-doctest.py, feel free to
add more as long as it is integrated with run-tests.py
> To do the performance testing I've thought of taking the list of files
> from a kernel repo and then override os.listdir() and os.lstat(),
> mocking their real behavior. This way one could also check the impact
> of cold cache adding some delays to these functions, but I'm not sure
> how these delays should be modeled. Any ideas?
>
Matt or Chris know better that kind of stuff.
> When these questions are solved I plan to integrate the changes in
> mercurial and publish my personal branch.
>
> Do you prefer merging the unrelated repos (mercurial and walker) or
> making a clean patch to the mercurial one?
Clean patch is preferable.
regards,
Benoit
More information about the Mercurial-devel
mailing list