Differences between revisions 1 and 4 (spanning 3 versions)
Revision 1 as of 2017-11-24 18:26:41
Size: 7907
Editor: GregorySzorc
Comment: initial
Revision 4 as of 2017-11-24 19:05:07
Size: 11576
Editor: GregorySzorc
Comment:
Deletions are marked like this. Additions are marked like this.
Line 5: Line 5:
 * `hg` is a Rust binary on all platforms  * `hg` is a Rust binary that embeds and uses a Python interpreter when appropriate (`hg` is a Python script today)
Line 12: Line 12:
Line 14: Line 13:
Line 41: Line 39:
=== Rust Support ===
Mercurial relies on other entities (like Linux distros) to package and distribute Mercurial. This means we have to consider their support for packaging programs that use Rust or else we risk losing packagers. This means we need to consider:

* The minimum version of Rust to require * Whether we can use beta or nightly Rust features

For official Mercurial distributions, these considerations don't exist, as we'll be giving a binary to end-users. So this topic is all about our relationship with downstream packagers.
Line 42: Line 47:
Line 46: Line 50:
Line 58: Line 61:
Line 64: Line 66:
Line 72: Line 73:
Line 76: Line 76:
Line 80: Line 79:
Line 85: Line 83:
`pip` is tailored towards Python applications. If Mercurial is a Rust application and Python is an implementation detail, does it make sense to use `pip` and PyPI as a distribution channel?  `pip` is tailored towards Python applications. If Mercurial is a Rust application and Python is an implementation detail, does it make sense to use `pip` and PyPI as a distribution channel?
Line 91: Line 89:
=== Support for PyPy / non-CPython Pythons ===
There exist Python distributions beyond the official CPython distribution. PyPy likely being the one of the most interest to us because of its performance advantages.

The cost to supporting non-CPython Pythons when `hg` is a Rust binary could be very high. That would likely significantly curtail the use of the CPython API. Instead, we'd have to do interop via `ctypes` or `cffi` or provide N ways to do interop.

It's worth noting that if Mercurial is a self-contained application, we could potentially swap out CPython for PyPy. We could go as far as to unsupport CPython completely.

=== Rust <=> Python Interop ===
Rust and Python code will need to call into each other. (Although it is anticipated that the bulk of the calling will be from Python into Rust code - at least initially.)

There are many options for us here.

 * python27-sys, python3-sys
 * [[https://github.com/dgrunwald/rust-cpython|rust-cpython]]
 * [[https://github.com/PyO3/pyo3|PyO3]]
 * [[https://github.com/getsentry/milksnake|milksnake]] - see also https://blog.sentry.io/2017/11/14/evolving-our-rust-with-milksnake
 * Roll our own / vanilla ctypes/cffi

`python27-sys` and `python3-sys` are low-level Rust bindings to the CPython API. Lots of `unsafe {}` code here.

`rust-cpython` and `PyO3` are higher-level bindings to `python27-sys` and `python3-sys`. They are what you want to use for day-to-day Rust programming.

PyO3 is a fork of rust-cpython. It seems to be a bit nicer. But it requires Nightly Rust features.

Milksnake uses Rust's `cbindgen` crate to automatically generate Python `cffi` bindings to Rust libraries. Essentially, you write a Rust library that exports symbols and milksnake can generate a Python binding to it. There's a lot going on. But it is definitely an interesting approach. And some of the components are useful without the rest of milksnake. e.g. the idea of using `cbindgen` + `cffi` to generate low-level Python bindings. Because Milksnake uses `cffi`, the approach should work with both CPython and PyPy.

A major reason for adopting Rust (and C before that) is performance. We know from Mercurial's C extensions that native code is often vasly undermined by a) crossing the Python<->native boundary b) excessive use of Python API from native code. For example, obsolescence marker parsing is ~100x faster in C. However, once you construct `PyObject` for all the parsed markers, it is only 2-4x faster.

We know that using `ctypes` to call from Python into native code is significantly slower than binary Python extensions. Although if the number of function calls and data being transferred across the boundary is small, this difference isn't as pronounced. Rust will enable us to write more functionality in native code (we try to avoid writing C today for maintainability and security reasons). So the performance of the Python<->native bridge will be more important over time. Therefore, it seems prudent to rule out `ctypes`. That leaves us with extensions or CFFI.

Using Rust in Mercurial

This page describes the plan and status for leveraging the Rust programming language in Mercurial.

Desired End State

  • hg is a Rust binary that embeds and uses a Python interpreter when appropriate (hg is a Python script today)

  • Python code seemlessly calls out to functionality implemented in Rust
  • Fully self-contained Mercurial distributions are available (Python is an implementation detail / Mercurial sufficiently independent from other Python presence on system)
  • The customizability and extensibility of Mercurial through extensions is not significantly weakened.
  • chg functionality is rolled into hg

Problems

CRT Mismatch on Windows

Mercurial still uses Python 2.7. Python 2.7 is officially compiled with MSVC 2008 and links against vcruntime90.dll. Rust and its standard library don't support MSVC 2008. They are likely linked with something newer, like MSVC 2015 or 2017.

If we want compatibility with other binary Python extensions, we need to use a Python built with MSVC 2008 and linked against msvcr90.dll.

So, our options are:

  1. Build a custom Python 2.7 distribution with modern MSVC and drop support for 3rd party binary Python 2.7 extensions.
  2. Switch Mercurial to Python 3 and build Rust code with same toolchain as Python we target.
  3. Mix the CRTs.

#1 significantly undermines Mercurial's extensibility. Plus, Python 2.7 built for !MSVC 2008 isn't officially supported.

#2 is in progress. However, the timeline for officially supporting Python 3 to the point where we can transition the official distribution for it is likely too far out (2H 2018) and would hinder Rust adoption efforts.

That leaves mixing the CRTs. This would work by having the Rust components statically link a modern CRT while having Python dynamically load msvcr90.dll.

Mixing CRTs is dangerous because if you attempt to perform a multipart operation with multiple CRTs, things could blow up. e.g. if you malloc() in CRT A and free() in CRT B. Or attempt to operate on FILE instances across CRTs. More info at https://docs.microsoft.com/en-us/cpp/c-runtime-library/potential-errors-passing-crt-objects-across-dll-boundaries.

Fortunately, our exposure to the multiple CRT problem is significantly reduced because:

  • Rust and its standard library doesn't make heavy use of CRT primitives.
  • Memory managed by Rust and Python is already being kept separate by the Python API. In Rust speak, we won't be transferring ownership of raw pointers between Rust and Python. Python's refcounting mechanism ensures all PyObject are destroyed by Python. The only time ownership of memory crosses the bridge is when we create something in Rust and pass it to Python. But that object will be a PyObject and backing memory would have been managed with the Python APIs.

  • We shouldn't be using FILE anywhere. And I/O on an open file descriptor would likely be limited to its created context. e.g. if we open a file from Rust, we're likely not reading it from Python.

We would have to keep a close eye out for CRT objects spanning multiple CRTs. We can mitigate exposure for bad patterns by establishing static analysis rules on source code. We can also examine the produced Rust binaries for symbol references and raise warnings when unwanted CRT functions are used by Rust code.

Rust Support

Mercurial relies on other entities (like Linux distros) to package and distribute Mercurial. This means we have to consider their support for packaging programs that use Rust or else we risk losing packagers. This means we need to consider:

* The minimum version of Rust to require * Whether we can use beta or nightly Rust features

For official Mercurial distributions, these considerations don't exist, as we'll be giving a binary to end-users. So this topic is all about our relationship with downstream packagers.

Packaging Overhaul Needed

If hg becomes a Rust binary and we want Mercurial to be a self-contained application, we'll need to overhaul our packaging mechanisms on all operating systems.

Distributing Python

Mercurial would need to distribute a copy of Python.

Python insists that embedded Python load a pythonXX shared library. e.g. python27.dll or libpython27.so.

We would also need to distribute a copy of the Python standard library (.py, .pyc, etc files). These could be distributed in flat form (hundreds of .py files) or in a zip file. (Python supports importing modules from zip files.) If we wanted to get creative, we could invent our own archive format / module loading mechanism (but this feels like unnecessary work).

We can't prune the Python standard library of unused modules because Mercurial extensions may make use of any feature in the standard library. So we'll be distributing the entire Python standard library.

But the distribution of Python is not required: various packagers (like operating systems) would want Mercurial to use a Python provided to it. So our Rust hg needs to support loading a bundled Python and a Python provided to it. This can likely be controlled with build-time flags.

Windows

Mercurial could conceptually be distributed as a .zip file. That archive would contain pre-built hg.exe, pythonXX.dll, any other shared library dependencies, a copy of the Python standard library, Mercurial Python files, and any support files.

Because zip files aren't user friendly, we'd likely provide a standalone .exe or .msi installer (like we do today).

Linux

We could provide a self-contained archive file containing hg binary, libpython27.so, and any other dependencies. We could also provide rpm, deb, etc packages for popular distributions. These would be self-contained and not dependent any many (any?) other packages. Our biggest concern here is libc compatibility. That can be solved by static linking, compiling against a sufficiently old (and compatible) libc, or providing distro-specific packages.

Of course, many distros will want to provide their own Mercurial package. And they will likely want Mercurial to make use of the system Python. We can and must support this.

An issue with a self-contained distribution is loading of shared libraries. Not all operating systems and loaders may support loading of binary-relative shared libraries. We may need to hack something together that uses dlopen() to explicitly specify which libpython27.so, etc to load.

MacOS

This is very similar to Linux. We may support the native application / installer mechanism to make things more user friendly. We don't have good support for this today. So it is likely most users will rely on Homebrew or MacPorts for installation.

BSDs / Solaris / Etc

Basically the same strategy as Linux.

PyPI / pip

We support installing Mercurial via pip today. We upload a source distribution to PyPI and anyone can pip install Mercurial to install Mercurial in their Python environment. On Windows (where users can't easily compile binary Python extensions), we provide Python wheels with pre-built Mercurial binaries.

The future of pip install Mercurial with an oxidized Mercurial is less clear.

pip is tailored towards Python applications. If Mercurial is a Rust application and Python is an implementation detail, does it make sense to use pip and PyPI as a distribution channel?

pip install Mercurial is very convenient (at least for the people that have pip installed and can run it). It is certainly easier than downloading and running an installer. So unless we bake an upgrade facility into Mercurial itself, pip install Mercurial is the next best thing for upgrading after the system package manager (apt, yum, brew, port, etc).

pip install Mercurial goes through a well-defined mechanism to take the artifact it downloaded from PyPI to install it. This mechanism could be abused to facilitate the use of PyPI/pip for distributing a self-contained Mercurial distribution. e.g. the user would end up with a Rust binary in PYTHONHOME/bin/hg that loads a custom version of Python and is fully self-contained and isolated from the Python it was pip installed into. This would be super hacky. It may not even be allowed by PyPI's hosting terms of service? But we could certainly abuse pip install if we needed to.

Support for PyPy / non-CPython Pythons

There exist Python distributions beyond the official CPython distribution. PyPy likely being the one of the most interest to us because of its performance advantages.

The cost to supporting non-CPython Pythons when hg is a Rust binary could be very high. That would likely significantly curtail the use of the CPython API. Instead, we'd have to do interop via ctypes or cffi or provide N ways to do interop.

It's worth noting that if Mercurial is a self-contained application, we could potentially swap out CPython for PyPy. We could go as far as to unsupport CPython completely.

Rust <=> Python Interop

Rust and Python code will need to call into each other. (Although it is anticipated that the bulk of the calling will be from Python into Rust code - at least initially.)

There are many options for us here.

python27-sys and python3-sys are low-level Rust bindings to the CPython API. Lots of unsafe {} code here.

rust-cpython and PyO3 are higher-level bindings to python27-sys and python3-sys. They are what you want to use for day-to-day Rust programming.

PyO3 is a fork of rust-cpython. It seems to be a bit nicer. But it requires Nightly Rust features.

Milksnake uses Rust's cbindgen crate to automatically generate Python cffi bindings to Rust libraries. Essentially, you write a Rust library that exports symbols and milksnake can generate a Python binding to it. There's a lot going on. But it is definitely an interesting approach. And some of the components are useful without the rest of milksnake. e.g. the idea of using cbindgen + cffi to generate low-level Python bindings. Because Milksnake uses cffi, the approach should work with both CPython and PyPy.

A major reason for adopting Rust (and C before that) is performance. We know from Mercurial's C extensions that native code is often vasly undermined by a) crossing the Python<->native boundary b) excessive use of Python API from native code. For example, obsolescence marker parsing is ~100x faster in C. However, once you construct PyObject for all the parsed markers, it is only 2-4x faster.

We know that using ctypes to call from Python into native code is significantly slower than binary Python extensions. Although if the number of function calls and data being transferred across the boundary is small, this difference isn't as pronounced. Rust will enable us to write more functionality in native code (we try to avoid writing C today for maintainability and security reasons). So the performance of the Python<->native bridge will be more important over time. Therefore, it seems prudent to rule out ctypes. That leaves us with extensions or CFFI.


CategoryNewFeatures CategoryNewFeatures

OxidationPlan (last edited 2022-06-25 18:04:59 by MarcosCruz)