improved autorename of addremove

Herbert Griebel herbertg at gmx.at
Tue Mar 31 08:54:27 CDT 2009


Hi,

I uploaded all the code for the improved autorename feature on bitbucket.
There are two branches:

Python only, branch autorename:
http://bitbucket.org/herb/hg/changeset/65d7c5c06e00/

Fast C code, branch autorename_c_code:
http://bitbucket.org/herb/hg/changeset/f545a7a70303/


The code is the same, except for added comments, a minor fix,
and an improved name matching algorithm.

The name matching now is able to match a large set of equal files correctly
if moved, example:

All files a.txt and folders are moved to folder x,
all files a.txt have the same content
(moving all files back from x also works):

removing a.txt
removing a/a.txt
removing a/a/a.txt
removing a/b/a.txt
removing b/a.txt
removing b/a/a.txt
removing c/a.txt
removing c/a/a.txt
removing c/a/a/a.txt
removing c/a/a/b/a.txt
removing c/a/a/b/c/a.txt
removing c/a/b/a.txt
adding x/a.txt
adding x/a/a.txt
adding x/a/a/a.txt
adding x/a/b/a.txt
adding x/b/a.txt
adding x/b/a/a.txt
adding x/c/a.txt
adding x/c/a/a.txt
adding x/c/a/a/a.txt
adding x/c/a/a/b/a.txt
adding x/c/a/a/b/c/a.txt
adding x/c/a/b/a.txt
recording removal of c\a\a\b\c\a.txt as rename to x\c\a\a\b\c\a.txt (100% similar)
recording removal of c\a\a\b\a.txt as rename to x\c\a\a\b\a.txt (100% similar)
recording removal of c\a\b\a.txt as rename to x\c\a\b\a.txt (100% similar)
recording removal of c\a\a\a.txt as rename to x\c\a\a\a.txt (100% similar)
recording removal of c\a\a.txt as rename to x\c\a\a.txt (100% similar)
recording removal of b\a\a.txt as rename to x\b\a\a.txt (100% similar)
recording removal of a\b\a.txt as rename to x\a\b\a.txt (100% similar)
recording removal of a\a\a.txt as rename to x\a\a\a.txt (100% similar)
recording removal of c\a.txt as rename to x\c\a.txt (100% similar)
recording removal of b\a.txt as rename to x\b\a.txt (100% similar)
recording removal of a\a.txt as rename to x\a\a.txt (100% similar)
recording removal of a.txt as rename to x\a.txt (100% similar)
Elapsed time: 00:00:00,33  (23:00:19,54 to 23:00:19,87)


Again, the matching algorithm can only give most likely matches
based on content and pathname of the file and cannot guess the
user's intention. For example a->b has 90% matching, and
c->d has also 90% matching. Then it is quite likely you want
a->b and c->d, but it could be also vice versa, a->d and c->b
for other reasons than name and content matching. That's also
the reason why I thing the ultimate solution is not a command
line tool but a nice GUI which lets you choose correct matches
easily with the help of a good similarity matching.

I think the biggest potential for an improvement is in the name
matching, speeding up the content matching is next, getting better
statistics to avoid byte by byte compares is very hard.

If some of the comments or explanations are confusing, please let me know.
Any comments/fixes/improvements/patches are welcomed and appreciated!



More information about the Mercurial-devel mailing list