Request vetting a fix for my win32lfn extension

Fri Nov 9 10:37:35 CST 2012

Thanks, I switched to using ctypes so I could use the bytestring API. That
does seem to work better.

On Thu, Nov 8, 2012 at 2:13 PM, Matt Mackall <mpm at selenic.com> wrote:

> On Thu, 2012-11-08 at 10:33 +0400, Nikolaj Sjujskij wrote:
> > Den 2012-11-08 03:11:35 skrev Aaron Cohen <aaron at assonance.org>:
> >
> > > Since around Mercurial 2.3, my extension has been broken, by the change
> > > to
> > > move pack_dirstate to C.
> > >
> > > I don't do a very good job of updating and testing new versions, so
> > > recently a bug was filed against it which was the first time I know if
> > > it.
> > > I've now "fixed" it, but I would appreciate some help determining if my
> > > fix
> > > is correct.
> > >
> > > Basically, win32lfn replaces python's normal os.listdir with a call to
> > > FindFilesIterator. This is to allow the purpose of my extension to work
> > > (I
> > > transparently convert long directory names to something like
> > > \\?\long_directory_name).
> > >
> > > I was a little careless when I implemented this, and did not notice
> that
> > > FindFilesIterator returns a unicode string rather than a normal string
> > > for
> > > file names, which meant that I was putting unicode into the dirstate
> map
> > > as
> > > the key for the file. This was apparently a harmless oversight.
> > >
> > > However, the new C code explicitly checks that the key is a python
> > > string,
> > > and nothing else, which broke my extension. I have no problem with this
> > > breakage, I'd just appreciate it if someone who knows a little more
> about
> > > encoding than me, could verify that my simple fix is reasonable.
> > >
> > > You can see the fix here on bitbucket:
> > >
> https://bitbucket.org/remleduff/win32lfn/changeset/3d9fa2890fa0712e3f286710d379c5d6
> > >
> > >  This fix is simply to convert the unicode string that I receive from
> the
> > > Windows API function to a string by calling "filename.encode()".
> > >
> > > Is there anything that immediately jumps out as incorrect here?
> >   It would fail for every filename with non-ASCII symbols. Probably it
> > should be `filename.encode('utf8')`, since it's internal string format in
> > Mercurial (if I'm not mistaken).
>
> Actually, you are mistaken. Mercurial does not transcode filenames and
> treats them as raw bytes. So the encoding here needs to be the so-called
> filesystem encoding. This is typically the same as the ANSI code page.
> And there's really no good answer for what strategy to use for
> unencodable characters.
>
> However, this is still suboptimal, as Python's encoding isn't a perfect
> match for the kernel/filesystem encoding (for instance, doesn't do
> "best-fit"). It'd be better to use a byte-based API. If that means using
> a non-iterated API, that's not really a loss.
>
> --
> Mathematics is the supreme nostalgia of our time.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20121109/cfc9632c/attachment.html>