[PATCH 00 of 13] Cleanup of the purge extension

Alexis S. L. Carvalho alexis at cecm.usp.br
Tue Mar 6 15:54:38 CST 2007

Thus spake Emanuele Aina:
> Alexis S. L. Carvalho preoccupò:
> >In these filesystems, you can tell hg to add a file with one name, but
> >os.listdir() may return another name.
> >
> >For example, on case-insensitive filesystems:
> >
> >$ touch Foo
> >$ hg add foo
> >$ hg status
> >A foo
> >? Foo
> >
> >If hg purge removes Foo, users will get angry.
> >
> >On OS X things get even muddier, since it likes to use normalized
> >Unicode in decomposed form (or something like that):  if you create a
> >file called "é" ("e" with acute accent as a single character), it will
> >decompose it into two characters ("e" and a combining acute accent).  So
> >the return of os.listdir won't match what we have on the dirstate.
> >
> >Maybe it'd be enough to refuse to run if statwalk returns some "m"issing
> >file, but I'm not completely sure. 
> Even this is not going to be enough: in your example, doing 'hg purge'
> will erroneously delete the unknown "Foo" file, even if it is not
> present any missing file. :(

Aborting if statwalk returns a file with src == 'm' (a.k.a. file in
dirstate, but missing in the filesystem) would catch my example - as far
as statwalk is concerned, the file "foo" is missing.

But again: I'm not sure this would be safe enough.

> This problem is not strictly related to 'purge' but more general as it
> affects also 'status', as it is shown by your example.

status calls something like

repo.status() -> dirstate.status() -> dirstate.statwalk()

If statwalk claims a file is missing, dirstate.status explicitly
os.lstat's it and, in my example above, finds it.

So, hg status manages to find "foo", even though it also shows "Foo" as
an unknown file (which is mostly harmless, until somebody tries to e.g.
clean the tree ;) .

IOW, there are 2 different problems: hg is usually interested only in
tracked files - it can just lstat every file in this list to see if it's
on the filesystem.  OTOH, purge has a list of the files on the
filesystem and it wants to know which ones are not tracked by hg -
which, as we're seeing, can be quite a chore when there are aliases

> The problem could be divided in two:
> - detect case-insensitive or name mangling filesystems
>   we could maybe put a special file in .hg and, at repo object creation,
>   try to access it with a different name: for example '.hg/Foo-è',
>   accessed with '.hg/foo-è' and '.hg/Foo-e`' (in unicode)

This is a bit like util.checkfolding (which only checks case
collisions).  Right now it's used only by hg update/merge/revert.

> - normalize the file names
>   it can be done in the dirstate.__contains__() method, once a
>   name-mangling fs has been detected

This is harder - right now hg doesn't require e.g. UTF-8 paths, so
normalizing things could get interesting...

Also, this could be somewhat too expensive in repos with many files -
especially if the main user is hg purge.

Which doesn't mean we're perfect there - having some check on hg add
would probably be nice...

> >I'd really like to put at least some safety net before moving purge.py
> >to hgext.
> What kind of safety net?

For example, refusing to run if there are missing files (I think it
should be enough, if we assume that the filesystem doesn't return 2
aliases to the same file[1] on a single os.listdir - which I hope is not
much to ask...  But I wouldn't mind somebody thinking a bit more about

Maybe this could be done only for name-mangling filesystems.  And we
probably could use a --force flag to allow users to shoot their feet.

BTW, it'd probably be nice for hg purge to get some options to remove
only unknown files, ignored files or empty directories.


[1] - to be pedantic, two aliases to the same hard link.

More information about the Mercurial-devel mailing list