RFC] New generic filesystem walker for mercurial

Benoit Boissinot bboissin at gmail.com
Sat Jan 6 06:31:18 CST 2007


On 1/3/07, Emanuele Aina <faina.mail at tiscali.it> wrote:
> I've started to write a filesystem walker for mercurial, trying to
> separate the filesystem walker from the mercurial specific code.
>
> [snip]
>
> I have done some tests on a kernel repo, mostly with the filesystem
> cache full, so those are relevant only for the CPU usage (but I plan to
> do some tests with the cache emptied).
>
> Currently is a tiny tiny bit faster than the current 'hg debugwalk'
> (1.27s instead of 1.39s on a kernel repo :), but is probably due to the
> additional code in mercurial (option parsing, etc.).
>
Please add plug your code in commands.py to do fair comparison
(matcher code might do some difference too). You might want to compare
strace output (especially stat syscalls).

> Even though my main interest is not performance, I want to make sure
> there isn't some big regressions on this front.
>
> I have now some questions:
>
> - mercurial currently walks the filesystem (dirstate.py:statwalk())
>    sorting the contents of each directory, visiting subdirectory in-order
>    and then sorting the whole results.
>
>    This means that with:
>      root
>        `- aaa.h
>        `- foo.h
>        `- foo
>        |    `- bar.h
>        |    `- baz.h
>        `- zzz.h
>    mercurial gives:
>        aaa.h
>        foo.h
>        foo/bar.h
>        foo/baz.h
>        zzz.h
>
>    The point is that 'foo' (the directory) should be put before 'foo.h'
>    but, as the ordering consider the whole path, it is treate as 'foo/'.
>
>    Is this intentional?

Yes, as a thread explained it this summer (the point is to always read
to files in the same order)
>
>    If not, should this be preserved or I can put 'foo' before 'foo.h'?
>
>    Can I switch to a topdown walk, listing results for the topmost
>    files before (aaa.h, zzz.h) and then descending the tree (foo/bar.h
>    and foo/baz.h)?
>
>    In the code I've left both walkers: it should be sufficent to switch
>    from walk_inorder() to walk_topdown() for the generic filesystem
>    walker.
>
inorder is preferable

> - the .hgignore pattern cannot be used to prune a subtree from the walk
>    because ^foo/bar$ can match the foo/bar directory but we cannot skip
>    it as the contained files (e.g. foo/bar/baz) are not matched by the
>    pattern, right?
>
>    Someone has any idea about how to do this (if feasible at all)?
>
We should be able to prune a subtree, I think we check if a directory
match before visiting it.

>    Because of this we also need to do a lstat() to check for files or
>    directories *before* checking the ignore patters, so we need to lstat
>    every file, even the ignored ones. Is there something to do to avoid
>    this?
>    I thought of checking both ignore(path) and ignore(path+'/') and if
>    both return True save the call to lstat(). Could this a be good idea?
>
> - what is the purpose of the match and badmatch arguments of statwalk?
>    When should they be called? On what kinds of files?
>

match is used to know if a file is ignored or not, badmatch is used to
report ignored files.

> - missing files are returned at the end of the walk, but files in the
>    manifest which have changed to an unsupported filetype are returned
>    as missing during the walk (due to the sorting post-processing).
>    This is intentional or I can leave both at the end?
>
I don't think it is intentional.

> - I've thought of adding some unit tests and performance tests for the
>    walker. Can I use unittest based testcases placed in tests/ or there
>    is a better way of doing these things? It seems that what is in tests/
>    is more oriented towards functional testing (testing hg as a program)
>    than unit testing (testing the behavior of a single function/object).
>
there is at least one unit test in tests/test-doctest.py, feel free to
add more as long as it is integrated with run-tests.py

>    To do the performance testing I've thought of taking the list of files
>    from a kernel repo and then override os.listdir() and os.lstat(),
>    mocking their real behavior. This way one could also check the impact
>    of cold cache adding some delays to these functions, but I'm not sure
>    how these delays should be modeled. Any ideas?
>
Matt or Chris know better that kind of stuff.

> When these questions are solved I plan to integrate the changes in
> mercurial and publish my personal branch.
>
> Do you prefer merging the unrelated repos (mercurial and walker) or
> making a clean patch to the mercurial one?

Clean patch is preferable.

regards,

Benoit


More information about the Mercurial-devel mailing list