Proposed strategy to port Mercurial to Python 3
Victor Stinner
victor.stinner at haypocalc.com
Wed Nov 2 17:35:53 CDT 2011
Le mercredi 2 novembre 2011 23:08:43, vous avez écrit :
> > something like:
> > text = b('bytes') # instead of b'bytes'
>
> Interesting, but probably not sufficient..
I didn't write that it would be sufficcient, but it is required to share the
same code base for Python 2 and 3. As I wrote, port Mercurial to Python 3
should be done step by step. Mark byte strings with b() would be a first step.
> Python 3.x bytes objects are
> crippled relative to Python 2.x str objects because they've taken away
> some of the string-oriented methods like '%'. Also, b"a"[0] = "a" on Py2
> and 97 on Py3.
There are tricks to write code working on Python 2 and 3, e.g. b'abc'[0:1] ==
b'a' is True on Python 2 and 3. I started to use such trick in Mercurial.
> > Because I don't know Mercurial, it's difficult to understand if bytes or
> > Unicode type should be used.
>
> The short answer is that Python Unicode objects (whether they're called
> str or unicode by Python) are completely unwelcome in the bulk of the
> Mercurial codebase. Please see:
>
> http://mercurial.selenic.com/wiki/EncodingStrategy
I don't see in this document why Unicode objects are unwelcome. It's the
contrary: I see many cases where Unicode objects would solve real issues. The
best example is Windows. If you store filenames as Unicode, you avoid the two
following issues for example:
* "in some circumstances, kernel will accept non-ASCII filenames, list them as
different names, and fail to open the file under the original name"
* "Wide character encodings like Shift-JIS cause trouble here because they
make "\" byte ambiguous": by the way, I don't understand this sentence: "\" is
ambiguous in byte strings, not in Unicode string. The yen sign (U+00A5, ¥) is
encoded to 0x5C which is a backslash (U+005C, \).
In Python 3, we cannot avoid Unicode: Unicode is everywhere, and it works fine.
Unicode objects solve also the "The encoding tracking problem": it is
important to use the right encoding when for inputs and outputs, but the
encoding for inputs and outputs are well known (I can provide a exhaustive
list if you like to).
> There have been several projects in this area, including a GSoC project
> last year, I suggest you read up on that.
Ok, I will see them. Do you have references to these projects? What is the
status of these projects?
Victor
More information about the Mercurial-devel
mailing list