[PATCH 07 of 13 STABLE V4] icasefs: retry directory scan once for already invalidated cache

Thu Dec 22 14:49:01 CST 2011

On Thu, 2011-12-22 at 15:23 +0900, FUJIWARA Katsunori wrote:
> At Wed, 21 Dec 2011 18:07:44 -0600,
> Matt Mackall wrote:
> > 
> > On Fri, 2011-12-16 at 21:27 +0900, FUJIWARA Katsunori wrote:
> > > # HG changeset patch
> > > # User FUJIWARA Katsunori <foozy at lares.dti.ne.jp>
> > > # Date 1324037380 -32400
> > > # Branch stable
> > > # Node ID c7ba906160bde878957e030d01e9ed44213f3889
> > > # Parent  748694149821b3356563a7e12b51f22b1dfdf269
> > > icasefs: retry directory scan once for already invalidated cache
> > > 
> > > some hg operation (e.g.: qpush) create new files after first
> > > dirstate.walk()-ing, and it invalidates _fspathcache for fspath().
> > 
> > Ok, this was very confusing. When you say "invalidates", you don't mean
> > "actively destroys the cache to make sure the cache isn't stale" (what
> > is usually mean by "invalidate" when talking about a cache), you mean
> > "makes the cache incorrect".
> 
> Oops, sorry for confusing description.
> 
> > +    def find(p, contents):
> > +        lenp = len(p)
> > +        for n in contents:
> > +            if lenp == len(n) and normcase(n) == p:
> > +                return n
> > +        return None
> > 
> > Note that we'll eventually have to fix this because normcase can
> > actually change the length of filenames. For instance, "ı" (c4 b1) to
> > "I" (49) on NTFS and decomposition and escaping issues on HFS+.
> 
> I confirmed that creation of file named as "\xc3 \xa0" in UTF-8 can
> get decomposed "\x61 \xcc \x80" result of "os.listdir()" on Mac OS
> X. So, I think that each elements of os.listdir() result are already
> normalized except for case, and length compare is still valid.

> But I did not consider normalizing situation like "c4 b1" to "49" on
> NTFS.
> 
> Should simple compare between "p" and "normcase(n)" without length
> check be done ?
> 
> By the way, that NTFS behavior seems to depend on language
> configuration of NTOS. On my Japanese(= cp932) Windows 7:

It's so much worse than that.

First, the case-folding tables -vary- from release to release and are
stored on disk at filesystem creation time. We ignore that and just use
Python's internal database.

Second, the mapping between wide filenames and their byte display has
multiple layers. In general, the filesystem interface uses the "ANSI
code page" for that mapping, and in general the ANSI code page is not
the same as the "OEM code page" used for console display. We ignore that
too and treat everything as ANSI. Notably the command line args on the
console also arrive in ANSI (??!).

Also, characters that cannot be directly mapped from Unicode to the
current ANSI code page go through a "best fit" step where characters
like "∞" (infinity symbol) get mapped to "8" (the number 8) for display
(literally infinitely wrong). But unlike casefolding, the best fit
mapping is one way: you can't open "8" even though you can see it! And
there's basically nothing we can do about that within the constraints of
the ANSI C file APIs.

There are other horrors lurking (I count 7 independent code page
settings per process), but these are the big ones.

>     - "\u0131" ("\xc4 \xb1" in UTF-8) created by Unicode API is:
> 
>         - normalized into "\x3f" (= "?") in byte API
>           (* "no corresponded character" in cp932)

That's a bit surprising. But it turns out there's no best-fit entry for
'ı' in cp932 even though the mapping to 'i' is pretty obvious:

http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt

In other words, the best fit tables are ad-hoc and mutually
inconsistent.

>         - still "\u0131" in Unicode API 
> 
>     - "\xc4\xb1" created by byte API is:
>       (* this byte sequence is valid also in cp932)
> 
>         - kept in byte API
>         - normalized as "\uff84\uff71" in Unicode API
> 
> > > -        if dir not in _fspathcache:
> > > -            _fspathcache[dir] = os.listdir(dir)
> > > -        contents = _fspathcache[dir]
> > > +        contents = _fspathcache.get(dir, None)
> > > +        if contents is None:
> > > +            contents = os.listdir(dir)
> > > +            _fspathcache[dir] = contents
> > 
> > Not sure what this change is for. Our standard cache lookup pattern is:
> > 
> > if x not in cache:
> >   cache[x] = f(x)
> > y = cache[x]
> > 
> > So long as you don't put bogus None results in the cache, this works
> > correctly and more quickly.
> 
> Sorry, I'm not so familier Python performance characteristic, so I
> thought that my implementation could reduce looking up cost at cache
> hit from twice to once (and almost all of penalty at cache-miss was
> shared by "os.listdir()").
> 
> I'll post patches adjusted to described pattern.
> 
> > > -        lenp = len(part)
> > > -        for n in contents:
> > > -            if lenp == len(n) and normcase(n) == part:
> > > -                result.append(n)
> > > -                break
> > > -        else:
> > > -            # Cannot happen, as the file exists!
> > > -            result.append(part)
> > > +        found = find(part, contents)
> > > +        if not found:
> > > +            # retry once for the corner case: add files after dir walking
> > > +            contents = os.listdir(dir)
> > > +            _fspathcache[dir] = contents
> > > +            found = find(part, contents)
> > 
> > This seems wrong: what happens if the case of one or more directory
> > components also got changed by mq too?
> 
> Sorry for confusing description, again.
> 
> I meant that "once per directory" per "patch".
> 
> As you described, retry is done in some directory components, if:
> 
>   - one patch creates some files in separate directories, and/or
>   - patches in which one file is created are pushed
> 
> I do not have any other ideas to fix ....
> 
> ----------------------------------------------------------------------
> [FUJIWARA Katsunori]                             foozy at lares.dti.ne.jp

-- 
Mathematics is the supreme nostalgia of our time.