/!\ This page is primarily intended for Mercurial's developers.

/!\ This page is no longer relevant but is kept for historical purposes.

Py3k Port

This document describes the current status of mercurial's Py3k port. The work here described was developed as part of the Google Summer of Code 2010 program.

1. Summary

Last milestone: "hg manifest" runs successfully (given the manual edits linked below are applied).

Current development: Documentation & Improvement of the fixers to generalize the manual edits

2. Objective and constraints

This project's objective is quite clear: to "port" mercurial to py3k. "Port" is between quotes because this is not a complete port or a rewrite: we want to make mercurial run in py3k while maintaining compatibility with python 2.x. There is an additional constraint, though: mercurial supports python from 2.4, which means the features introduced in 2.6 to ease the porting process can't really be used in the port. Also, refactoring the code to work in both python 2 and 3 proved to be too much work because:

  1. It would be troublesome to make a multipython code;
  2. It would be a maintenance hell.

Thus, we came to the conclusion that extending 2to3, the python refactoring tool, was the way to go. So, to summarize the port's objective and constraints:

An important aspect of the approach taken is that we stick to a "from the inside out approach". This means we started working on a port of the core C modules, then to the extension C modules (inotify only, currently), then removing most warnings issued by python 2.6 in "3 mode" (a mode that that issues warnings for deprecated modules and other incompatible changes) to, then, work on the fixing of the code.

2.1. "Design" of the port

Following the suggestion given by mpm in a message to the development list, the approach used in this port consisted in:

  1. teach 2to3 to change all strings in the source into bytestrings
  2. fix up the annoying b"A"[0] = 65 behavior
  3. make the minimum amount of other source changes to get it working under 3.x

The decision pointed out in a) is ok in mercurial's code because "There are basically no Unicode objects 'in the wild' in Mercurial. Their usage is more or less restricted to a couple transcoding function in encoding.py where they can't hurt anybody." 1

A corollary of the above is that paths and file contents should be treated as bytestrings by hg.

Another aspect of the port is related to how strings as bytes and strings as text interact. Currently, the common cases we'll have are

 a) ui.write(repo[rev].user())  # username is transcoded to local encoding
 b) ui.write(_('abort: can't do that')) # translated and possibly transcoded
 c) ui.write('debug message') # debug messages aren't translated
 d) ui.write(repo[rev][file].data()) # raw file data
 
 We also have many instances of:
 
 e) ui.write('debug message: %s\n' % somerawdata) # cases c and d
 f) ui.write(_('some message: %s\n') % somerawdata) # cases b and d

And the most tricky ones are cases e) and f). For those, the decision in the port was to convert every string formatting call that could operate on bytes objects to function that could properly handle these types of formatting, since py3k can't format bytes nor mix bytes and unicode. Some examples follow:

$ python3
Python 3.1.2 (r312:79147, Apr  1 2010, 09:12:21) 
[GCC 4.4.3 20100316 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> "%s" % "foo"
'foo'
>>> "%s" % b"foo"
"b'foo'"
>>> b"%s" % "foo"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for %: 'bytes' and 'str'
>>> b"%s" % b"foo"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for %: 'bytes' and 'bytes'

In particular, case "%s" % b"foo" only worked because b"foo".__repr__() equals "b'foo'". So, to prevent these problems at runtime, we decided to convert the formatting calls to a specialized formatter capable of handling these cases.

3. Current Status

3.1. Milestones

  1. Port of the core C modules to py3k (./)

  2. Port of inotify's C modules to py3k (./)

  3. Removal of most the warnings issued by python2.6 run with the -3 switch (./)

  4. Implementation of a setup.py-like script that calls 2to3 with our custom fixers (./)

  5. Implementation of a fixer that translates strings into bytestrings (./)

  6. Implementation of a fixer to handle formatting with bytes (b'%s' % 'foo') (./)

  7. Implementation of a fixer to module name changes not reported by 2.6 (implemented, but not applied)
  8. Fix demandimport on py3k (Done in 50a4e55aa278) (./)

  9. Incorporate the manual edits into fixers and wrappers to eliminate the need
    • for manual changes
  10. Finish the port (catch-all rule)

3.2. Existing problems

The most important problem we have now is that changing all strings to bytestrings doesn't work. There are some python 3 APIs that only support (unicode) string arguments (like open's "mode" argument) and some "special elements" only support (unicode) strings, like the slots, the return value of repr and keyword argument names. All of them must be of the string type.

3.3. Where to go from now?

I can foresee three approaches to take from now to finish the port. I'll briefly define them to, then, discuss their pros and cons:

  1. Take the approach taken by other projects: refactor all of hg's strings

    • to be defined either as bytes or unicode on a case-by-case basis;
  2. Define and use some kind of marker to mark strings that shouldn't be
    • translated to bytes when converted by 2to3, then audit all the code marking those;
  3. Take a trial-and-error approach that consists of: applying the fixers,
    • seeing what works and what doesn't and:
    • Refactor the fixers to handle the broken cases;
    • Monkeypatch the failing APIs to add more forgiving ones.

The main problem with 1 is that it requires the auditing of the whole mercurial source code and it probably won't be very helpful, as most strings will be marked as bytes anyway (take a look at the "'Design' of the port" section to know why). This takes us to option 2: with it, we only mark the strings that should be understood as text in the source and let the fixers convert the rest, we might have much less work, and things might work just as fine as with option 1.

I'd like to elaborate a bit more on option two before going to the last one: as you might have guessed, py3k enforces a separation of strings and bytes to the point to inflict incommensurable pain: slots won't take bytes arguments (because members are referenced as strings), kwargs won't take bytes keys, etc, etc. These cases and some similar others are the ones responsible for the need of so many manual edits to get something to work in py3k. There is also the problem of calling "b'bytestring' in 'text string'" raising an exception and approach 2 could help in solving it.

(To implement option 2, I'd like to define a function that is a no-op, just to mark strings. Additionally, we can take an approach similar to that of sqlalchemy, that separates the python 2 and python 3 using comments that are later extracted by their 2to3 implementation.)

And then there is option 3. It is rather nice, as we don't need to directly mess with hg's code (and don't have the risk of breaking it). But it also has its problems: running 2to3 isn't instantaneous, making the editing/testing cycle rather awkward.

So far, I've trodden the path of approach 3, using both alternatives 3.a and 3.b, but I'd like to experiment with option 2, as it might make the porting process easier when combined to option 3: we monkeypatch/convert the functions that have problems to be more generic and fix the strings where we aren't allowed to monkeypatch/convert (as when the "in" operator is used).

3.4. Source code

Most of the code developed in this project has been already imported into mercurial's official repository. Which means that pulling from it will give you updated code that is known to mostly work. Additionally, you can clone Renato Cunha's patch queue, if you want to test more experimental code and code that hasn't been imported to mercurial yet.

3.5. How to run it

Highly experimental

This page describes a highly experimental feature that isn't ready yet. It is most useful for enthusiasts that want to know the status of the port and/or who are willing to help on it.

From mercurial's source root, you can run:

python3 contrib/setup3k.py build_ext -i build_py -c -d . build_mo

this is equivalent to running "make local" in hg's source root, with the difference that the python3 interpreter will be used and that the python source code will be preprocessed by 2to3 before exiting. This command takes approximately three minutes to run on a five year old Athlon64 3000+.

3.6. Example execution with output

$ cd <mercurial_repo_root>
$ HGRC= HGPLAIN= python3 hg manifest -r 1
*** failed to import extension color: __import__() argument 1 must be string, not bytes
*** failed to import extension fetch: __import__() argument 1 must be string, not bytes
*** failed to import extension purge: __import__() argument 1 must be string, not bytes
*** failed to import extension convert: __import__() argument 1 must be string, not bytes
*** failed to import extension graphlog: __import__() argument 1 must be string, not bytes
*** failed to import extension bookmarks: __import__() argument 1 must be string, not bytes
*** failed to import extension gpg: __import__() argument 1 must be string, not bytes
*** failed to import extension progress: __import__() argument 1 must be string, not bytes
*** failed to import extension mq: __import__() argument 1 must be string, not bytes
*** failed to import extension hgk: __import__() argument 1 must be string, not bytes
*** failed to import extension rebase: __import__() argument 1 must be string, not bytes
*** failed to import extension pager: __import__() argument 1 must be string, not bytes
*** failed to import extension patchbomb: __import__() argument 1 must be string, not bytes
*** failed to import extension record: __import__() argument 1 must be string, not bytes
*** failed to import extension churn: __import__() argument 1 must be string, not bytes
*** failed to import extension extdiff: __import__() argument 1 must be string, not bytes
*** failed to import extension transplant: __import__() argument 1 must be string, not bytes
.hgignore
PKG-INFO
README
hg
mercurial/__init__.py
mercurial/byterange.py
mercurial/fancyopts.py
mercurial/hg.py
mercurial/mdiff.py
mercurial/revlog.py
mercurial/transaction.py
notes.txt
setup.py
tkmerge

3.7. See also


CategoryGsoc

Py3kPort (last edited 2012-10-25 20:48:22 by mpm)