Add a Unicode mode, but keep the bytes mode
Laurens Holst
laurens.nospam at grauw.nl
Sun Nov 6 05:09:16 CST 2011
Op 5-11-2011 17:10, Victor Stinner schreef:
>> Another possibility might be to add two configuration options, one
>> describes the repository encoding and one the target encoding. Without
>> these set, Mercurial is encoding-agnostic (current behaviour), when you
>> set the repository encoding it automatically recodes filenames to your
>> local system’s encoding, or to the target encoding (if set). I think
>> this is similar to what the eol extension does.
> It is not so different than my "Unicode mode", and so it has the same
> contraints and limitations, except that it has an important advantage: it
> helps to have a smoother transition (backward compatibility) if you work in an
> homogeneous environment (e.g. only Windows with cp1252 ANSI code page). Python
> embeds most common encodings (e.g. most Windows code pages), it can work.
>
> Being able to use latin1 (instead of UTF-8) would also help the corner case
> because all byte strings are decodable from latin1.
>
> It avoids also to really convert the content of a repository: if the "new"
> encoding is already able to decode all filenames, you don't have to transcode
> filenames, and hashes are unchanged.
I think this is the main advantage yes.
Downside is that this way the transcoding is something the user needs to
manually set in his configuration file, even though the project should
know itself whether its build tools are encoding-agnostic (make) or
encoding-aware (ant). So this information could just as well be stored
in the repository. To make it more complicated, also consider the case
when I switch my build system from ant to make (would you want to recode
the entire working copy? uff).
This would be particularly useful for the case of an UTF-8 repository on
Windows. On Windows if you use the ‘bytes’ API it uses cp1252 (on most
of our western systems), not UTF-8, and I don’t think this will ever
change for backwards compatibility reasons. I wouldn’t even call it a
bytes API really. If you would store the origin encoding of the
repository in the repository itself, together with a transcode=true
flag, Windows can make a decision on what API to use.
Having such information stored in the repository (regardless of the
transcode flag) may also be useful to prevent inconsistent encodings in
the repository, and for hgweb as well.
So maybe it would be best to have a way to set repository encoding on a
repository, but without having to convert an existing repository.
Perhaps though pushkeys? This would have the advantage of being able to
set this for the entire repo in retrospect. Or else perhaps a versioned
.hgencoding file.
p.s. Another thing, I may be wrong but I seem to recall that Mercurial
uses a particular flag to open files that is available on the bytes API,
but not on the unicode API? I’m not sure but perhaps worth checking out.
~Laurens
--
~~ Ushiko-san! Kimi wa doushite, Ushiko-san nan da!! ~~
Laurens Holst, developer, Utrecht, the Netherlands
Website: www.grauw.nl. Working @ www.roughcookie.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4332 bytes
Desc: S/MIME cryptografische ondertekening
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20111106/24d0215e/attachment.bin>
More information about the Mercurial-devel
mailing list