Add a Unicode mode, but keep the bytes mode

Laurens Holst laurens.nospam at grauw.nl
Sun Nov 6 05:09:16 CST 2011


Op 5-11-2011 17:10, Victor Stinner schreef:
>> Another possibility might be to add two configuration options, one
>> describes the repository encoding and one the target encoding. Without
>> these set, Mercurial is encoding-agnostic (current behaviour), when you
>> set the repository encoding it automatically recodes filenames to your
>> local system’s encoding, or to the target encoding (if set). I think
>> this is similar to what the eol extension does.
> It is not so different than my "Unicode mode", and so it has the same
> contraints and limitations, except that it has an important advantage: it
> helps to have a smoother transition (backward compatibility) if you work in an
> homogeneous environment (e.g. only Windows with cp1252 ANSI code page). Python
> embeds most common encodings (e.g. most Windows code pages), it can work.
>
> Being able to use latin1 (instead of UTF-8) would also help the corner case
> because all byte strings are decodable from latin1.
>
> It avoids also to really convert the content of a repository: if the "new"
> encoding is already able to decode all filenames, you don't have to transcode
> filenames, and hashes are unchanged.

I think this is the main advantage yes.

Downside is that this way the transcoding is something the user needs to 
manually set in his configuration file, even though the project should 
know itself whether its build tools are encoding-agnostic (make) or 
encoding-aware (ant). So this information could just as well be stored 
in the repository. To make it more complicated, also consider the case 
when I switch my build system from ant to make (would you want to recode 
the entire working copy? uff).

This would be particularly useful for the case of an UTF-8 repository on 
Windows. On Windows if you use the ‘bytes’ API it uses cp1252 (on most 
of our western systems), not UTF-8, and I don’t think this will ever 
change for backwards compatibility reasons. I wouldn’t even call it a 
bytes API really. If you would store the origin encoding of the 
repository in the repository itself, together with a transcode=true 
flag, Windows can make a decision on what API to use.

Having such information stored in the repository (regardless of the 
transcode flag) may also be useful to prevent inconsistent encodings in 
the repository, and for hgweb as well.

So maybe it would be best to have a way to set repository encoding on a 
repository, but without having to convert an existing repository. 
Perhaps though pushkeys? This would have the advantage of being able to 
set this for the entire repo in retrospect. Or else perhaps a versioned 
.hgencoding file.

p.s. Another thing, I may be wrong but I seem to recall that Mercurial 
uses a particular flag to open files that is available on the bytes API, 
but not on the unicode API? I’m not sure but perhaps worth checking out.

~Laurens

-- 
~~ Ushiko-san! Kimi wa doushite, Ushiko-san nan da!! ~~
Laurens Holst, developer, Utrecht, the Netherlands
Website: www.grauw.nl. Working @ www.roughcookie.com


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4332 bytes
Desc: S/MIME cryptografische ondertekening
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20111106/24d0215e/attachment.bin>


More information about the Mercurial-devel mailing list