Request vetting a fix for my win32lfn extension

Thu Nov 8 13:13:36 CST 2012

On Thu, 2012-11-08 at 10:33 +0400, Nikolaj Sjujskij wrote:
> Den 2012-11-08 03:11:35 skrev Aaron Cohen <aaron at assonance.org>:
> 
> > Since around Mercurial 2.3, my extension has been broken, by the change  
> > to
> > move pack_dirstate to C.
> >
> > I don't do a very good job of updating and testing new versions, so
> > recently a bug was filed against it which was the first time I know if  
> > it.
> > I've now "fixed" it, but I would appreciate some help determining if my  
> > fix
> > is correct.
> >
> > Basically, win32lfn replaces python's normal os.listdir with a call to
> > FindFilesIterator. This is to allow the purpose of my extension to work  
> > (I
> > transparently convert long directory names to something like
> > \\?\long_directory_name).
> >
> > I was a little careless when I implemented this, and did not notice that
> > FindFilesIterator returns a unicode string rather than a normal string  
> > for
> > file names, which meant that I was putting unicode into the dirstate map  
> > as
> > the key for the file. This was apparently a harmless oversight.
> >
> > However, the new C code explicitly checks that the key is a python  
> > string,
> > and nothing else, which broke my extension. I have no problem with this
> > breakage, I'd just appreciate it if someone who knows a little more about
> > encoding than me, could verify that my simple fix is reasonable.
> >
> > You can see the fix here on bitbucket:
> > https://bitbucket.org/remleduff/win32lfn/changeset/3d9fa2890fa0712e3f286710d379c5d6
> >
> >  This fix is simply to convert the unicode string that I receive from the
> > Windows API function to a string by calling "filename.encode()".
> >
> > Is there anything that immediately jumps out as incorrect here?
>   It would fail for every filename with non-ASCII symbols. Probably it  
> should be `filename.encode('utf8')`, since it's internal string format in  
> Mercurial (if I'm not mistaken).

Actually, you are mistaken. Mercurial does not transcode filenames and
treats them as raw bytes. So the encoding here needs to be the so-called
filesystem encoding. This is typically the same as the ANSI code page.
And there's really no good answer for what strategy to use for
unencodable characters.

However, this is still suboptimal, as Python's encoding isn't a perfect
match for the kernel/filesystem encoding (for instance, doesn't do
"best-fit"). It'd be better to use a byte-based API. If that means using
a non-iterated API, that's not really a loss.

-- 
Mathematics is the supreme nostalgia of our time.