Proposed strategy to port Mercurial to Python 3

Wed Nov 2 17:35:53 CDT 2011

Le mercredi 2 novembre 2011 23:08:43, vous avez écrit :
> > something like:
> >     text = b('bytes')   # instead of b'bytes'
> 
> Interesting, but probably not sufficient..

I didn't write that it would be sufficcient, but it is required to share the 
same code base for Python 2 and 3. As I wrote, port Mercurial to Python 3 
should be done step by step. Mark byte strings with b() would be a first step.

> Python 3.x bytes objects are
> crippled relative to Python 2.x str objects because they've taken away
> some of the string-oriented methods like '%'. Also, b"a"[0] = "a" on Py2
> and 97 on Py3.

There are tricks to write code working on Python 2 and 3, e.g. b'abc'[0:1] == 
b'a' is True on Python 2 and 3. I started to use such trick in Mercurial.

> > Because I don't know Mercurial, it's difficult to understand if bytes or
> > Unicode type should be used.
> 
> The short answer is that Python Unicode objects (whether they're called
> str or unicode by Python) are completely unwelcome in the bulk of the
> Mercurial codebase. Please see:
> 
> http://mercurial.selenic.com/wiki/EncodingStrategy

I don't see in this document why Unicode objects are unwelcome. It's the 
contrary: I see many cases where Unicode objects would solve real issues. The 
best example is Windows. If you store filenames as Unicode, you avoid the two 
following issues for example:

 * "in some circumstances, kernel will accept non-ASCII filenames, list them as 
different names, and fail to open the file under the original name"

 * "Wide character encodings like Shift-JIS cause trouble here because they 
make "\" byte ambiguous": by the way, I don't understand this sentence: "\" is 
ambiguous in byte strings, not in Unicode string. The yen sign (U+00A5, ¥) is 
encoded to 0x5C which is a backslash (U+005C, \).

In Python 3, we cannot avoid Unicode: Unicode is everywhere, and it works fine.

Unicode objects solve also the "The encoding tracking problem": it is 
important to use the right encoding when for inputs and outputs, but the 
encoding for inputs and outputs are well known (I can provide a exhaustive 
list if you like to).

> There have been several projects in this area, including a GSoC project
> last year, I suggest you read up on that.

Ok, I will see them. Do you have references to these projects? What is the 
status of these projects?

Victor