Differences between revisions 1 and 9 (spanning 8 versions)
Revision 1 as of 2017-11-24 18:26:41
Size: 7907
Editor: GregorySzorc
Comment: initial
Revision 9 as of 2017-11-24 19:44:07
Size: 14668
Editor: GregorySzorc
Comment:
Deletions are marked like this. Additions are marked like this.
Line 5: Line 5:
 * `hg` is a Rust binary on all platforms  * `hg` is a Rust binary that embeds and uses a Python interpreter when appropriate (`hg` is a Python script today)
Line 11: Line 11:
== Current Status ==

''(last updated November 2017)''

 * No Rust in Mercurial core
 * Facebook's [[https://github.com/facebookexperimental/mononoke|Mononoke Mercurial server]] is written in Rust. There are tons of Mercurial primitives implemented in that repo.
 * Facebook's [[https://bitbucket.org/facebook/hg-experimental|hg-experimental]] repository has some Python extensions written in Rust.

== Priorities for Oxidation ==

All existing C code is a priority for oxidation because we don't like maintaining C code for safety and compatibility reasons. Existing C code includes:

 * base85 routines
 * diffing and patch application (`bdiff` and `mpatch`)
 * manifest parsing
 * revlog index
 * dirstate
 * path normalization, including case folding detection
 * chg

In addition, the following would be good candidates for oxidation:

 * All revlog I/O (reading is more important than writing)
 * Working directory I/O (extracting content from revlogs/store and writing to filesystem)
 * bundle2 reading and writing
 * changelog reading
 * revsets
 * All filesystem I/O (allows us to use Windows APIs and properly handle filenames on Windows)
Line 12: Line 41:
Line 14: Line 42:
Line 31: Line 58:
Mixing CRTs is dangerous because if you attempt to perform a multipart operation with multiple CRTs, things could blow up. e.g. if you `malloc()` in CRT A and `free()` in CRT B. Or attempt to operate on `FILE` instances across CRTs. More info at https://docs.microsoft.com/en-us/cpp/c-runtime-library/potential-errors-passing-crt-objects-across-dll-boundaries. Mixing CRTs is dangerous because if you attempt to perform a multipart operation with multiple CRTs, things could blow up. e.g. if you `malloc()` in CRT A and `free()` in CRT B. Or attempt to operate on `FILE` instances across CRTs. More info at https://docs.microsoft.com/en-us/cpp/c-runtime-library/potential-errors-passing-crt-objects-across-dll-boundaries. See also https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/crt-alphabetical-function-reference for a full list of CRT functions.
Line 41: Line 68:
=== Rust Support ===
Mercurial relies on other entities (like Linux distros) to package and distribute Mercurial. This means we have to consider their support for packaging programs that use Rust or else we risk losing packagers. This means we need to consider:

* The minimum version of Rust to require * Whether we can use beta or nightly Rust features

For official Mercurial distributions, these considerations don't exist, as we'll be giving a binary to end-users. So this topic is all about our relationship with downstream packagers.
Line 42: Line 76:
Line 46: Line 79:
Line 58: Line 90:
Line 64: Line 95:
Line 72: Line 102:
Line 76: Line 105:
Line 80: Line 108:
Line 85: Line 112:
`pip` is tailored towards Python applications. If Mercurial is a Rust application and Python is an implementation detail, does it make sense to use `pip` and PyPI as a distribution channel?  `pip` is tailored towards Python applications. If Mercurial is a Rust application and Python is an implementation detail, does it make sense to use `pip` and PyPI as a distribution channel?
Line 91: Line 118:
=== Support for PyPy / non-CPython Pythons ===
There exist Python distributions beyond the official CPython distribution. PyPy likely being the one of the most interest to us because of its performance advantages.

The cost to supporting non-CPython Pythons when `hg` is a Rust binary could be very high. That would likely significantly curtail the use of the CPython API. Instead, we'd have to do interop via `ctypes` or `cffi` or provide N ways to do interop.

It's worth noting that if Mercurial is a self-contained application, we could potentially swap out CPython for PyPy. We could go as far as to unsupport CPython completely.

=== Rust <=> Python Interop ===
Rust and Python code will need to call into each other. (Although it is anticipated that the bulk of the calling will be from Python into Rust code - at least initially.)

There are many options for us here.

 * python27-sys, python3-sys
 * [[https://github.com/dgrunwald/rust-cpython|rust-cpython]]
 * [[https://github.com/PyO3/pyo3|PyO3]]
 * [[https://github.com/getsentry/milksnake|milksnake]] - see also https://blog.sentry.io/2017/11/14/evolving-our-rust-with-milksnake
 * Roll our own / vanilla ctypes/cffi

`python27-sys` and `python3-sys` are low-level Rust bindings to the CPython API. Lots of `unsafe {}` code here.

`rust-cpython` and `PyO3` are higher-level bindings to `python27-sys` and `python3-sys`. They are what you want to use for day-to-day Rust programming.

PyO3 is a fork of rust-cpython. It seems to be a bit nicer. But it requires Nightly Rust features.

Milksnake uses Rust's `cbindgen` crate to automatically generate Python `cffi` bindings to Rust libraries. Essentially, you write a Rust library that exports symbols and milksnake can generate a Python binding to it. There's a lot going on. But it is definitely an interesting approach. And some of the components are useful without the rest of milksnake. e.g. the idea of using `cbindgen` + `cffi` to generate low-level Python bindings. Because Milksnake uses `cffi`, the approach should work with both CPython and PyPy.

A major reason for adopting Rust (and C before that) is performance. We know from Mercurial's C extensions that native code is often vasly undermined by a) crossing the Python<->native boundary b) excessive use of Python API from native code. For example, obsolescence marker parsing is ~100x faster in C. However, once you construct `PyObject` for all the parsed markers, it is only 2-4x faster.

We know that using `ctypes` to call from Python into native code is significantly slower than binary Python extensions. Although if the number of function calls and data being transferred across the boundary is small, this difference isn't as pronounced. Rust will enable us to write more functionality in native code (we try to avoid writing C today for maintainability and security reasons). So the performance of the Python<->native bridge will be more important over time. Therefore, it seems prudent to rule out `ctypes`. That leaves us with extensions or CFFI.

=== Reconciling `hg` with Rust extensions ===

Initially, `hg` will be a minimal Rust binary that embeds a Python interpreter. It simply tells the interpreter to invoke Mercurial's `main()` function. In this world, other Rust functionality is likely loaded via shared libraries or Python extensions. In other words, we have multiple Rust ''contexts'' running from different binaries (an executable and a shared library). The executable handles very early process activity. The shared library handles business logic.

Over time, we'll likely want to expand the role of Rust for early process activity. For example, we'll need to implement some command line processing in Rust for `chg` functionality. We may also want to implement config file loading (we need to rewrite the config parser anyway to facilitate writing back config changes). And, if we could load a repo from disk and maybe even implement performance critical commands (like `hg status`) from pure Rust, this would likely be a massive performance win. (Although we have to consider how this will interact with extensibility.)

What this means is that we'll have multiple Rust binaries holding Mercurial state. This feels brittle. Ideally we'd have a single Rust binary. If Python needed to call into native/Rust code, it would get those symbols from the parent `hg` binary instead of from a shared library. It is unclear how this would work. It is obviously possible to resolve the address of a symbol in the current binary. But existing "call native code" mechanisms in Python seem to assume that symbols are coming from loaded libraries, not the current executable. This may require modifications to `cffi` or some custom code to generate the Python ''bindings'' to executable-local symbols.

Using Rust in Mercurial

This page describes the plan and status for leveraging the Rust programming language in Mercurial.

Desired End State

  • hg is a Rust binary that embeds and uses a Python interpreter when appropriate (hg is a Python script today)

  • Python code seemlessly calls out to functionality implemented in Rust
  • Fully self-contained Mercurial distributions are available (Python is an implementation detail / Mercurial sufficiently independent from other Python presence on system)
  • The customizability and extensibility of Mercurial through extensions is not significantly weakened.
  • chg functionality is rolled into hg

Current Status

(last updated November 2017)

  • No Rust in Mercurial core
  • Facebook's Mononoke Mercurial server is written in Rust. There are tons of Mercurial primitives implemented in that repo.

  • Facebook's hg-experimental repository has some Python extensions written in Rust.

Priorities for Oxidation

All existing C code is a priority for oxidation because we don't like maintaining C code for safety and compatibility reasons. Existing C code includes:

  • base85 routines
  • diffing and patch application (bdiff and mpatch)

  • manifest parsing
  • revlog index
  • dirstate
  • path normalization, including case folding detection
  • chg

In addition, the following would be good candidates for oxidation:

  • All revlog I/O (reading is more important than writing)
  • Working directory I/O (extracting content from revlogs/store and writing to filesystem)
  • bundle2 reading and writing
  • changelog reading
  • revsets
  • All filesystem I/O (allows us to use Windows APIs and properly handle filenames on Windows)

Problems

CRT Mismatch on Windows

Mercurial still uses Python 2.7. Python 2.7 is officially compiled with MSVC 2008 and links against vcruntime90.dll. Rust and its standard library don't support MSVC 2008. They are likely linked with something newer, like MSVC 2015 or 2017.

If we want compatibility with other binary Python extensions, we need to use a Python built with MSVC 2008 and linked against msvcr90.dll.

So, our options are:

  1. Build a custom Python 2.7 distribution with modern MSVC and drop support for 3rd party binary Python 2.7 extensions.
  2. Switch Mercurial to Python 3 and build Rust code with same toolchain as Python we target.
  3. Mix the CRTs.

#1 significantly undermines Mercurial's extensibility. Plus, Python 2.7 built for !MSVC 2008 isn't officially supported.

#2 is in progress. However, the timeline for officially supporting Python 3 to the point where we can transition the official distribution for it is likely too far out (2H 2018) and would hinder Rust adoption efforts.

That leaves mixing the CRTs. This would work by having the Rust components statically link a modern CRT while having Python dynamically load msvcr90.dll.

Mixing CRTs is dangerous because if you attempt to perform a multipart operation with multiple CRTs, things could blow up. e.g. if you malloc() in CRT A and free() in CRT B. Or attempt to operate on FILE instances across CRTs. More info at https://docs.microsoft.com/en-us/cpp/c-runtime-library/potential-errors-passing-crt-objects-across-dll-boundaries. See also https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/crt-alphabetical-function-reference for a full list of CRT functions.

Fortunately, our exposure to the multiple CRT problem is significantly reduced because:

  • Rust and its standard library doesn't make heavy use of CRT primitives.
  • Memory managed by Rust and Python is already being kept separate by the Python API. In Rust speak, we won't be transferring ownership of raw pointers between Rust and Python. Python's refcounting mechanism ensures all PyObject are destroyed by Python. The only time ownership of memory crosses the bridge is when we create something in Rust and pass it to Python. But that object will be a PyObject and backing memory would have been managed with the Python APIs.

  • We shouldn't be using FILE anywhere. And I/O on an open file descriptor would likely be limited to its created context. e.g. if we open a file from Rust, we're likely not reading it from Python.

We would have to keep a close eye out for CRT objects spanning multiple CRTs. We can mitigate exposure for bad patterns by establishing static analysis rules on source code. We can also examine the produced Rust binaries for symbol references and raise warnings when unwanted CRT functions are used by Rust code.

Rust Support

Mercurial relies on other entities (like Linux distros) to package and distribute Mercurial. This means we have to consider their support for packaging programs that use Rust or else we risk losing packagers. This means we need to consider:

* The minimum version of Rust to require * Whether we can use beta or nightly Rust features

For official Mercurial distributions, these considerations don't exist, as we'll be giving a binary to end-users. So this topic is all about our relationship with downstream packagers.

Packaging Overhaul Needed

If hg becomes a Rust binary and we want Mercurial to be a self-contained application, we'll need to overhaul our packaging mechanisms on all operating systems.

Distributing Python

Mercurial would need to distribute a copy of Python.

Python insists that embedded Python load a pythonXX shared library. e.g. python27.dll or libpython27.so.

We would also need to distribute a copy of the Python standard library (.py, .pyc, etc files). These could be distributed in flat form (hundreds of .py files) or in a zip file. (Python supports importing modules from zip files.) If we wanted to get creative, we could invent our own archive format / module loading mechanism (but this feels like unnecessary work).

We can't prune the Python standard library of unused modules because Mercurial extensions may make use of any feature in the standard library. So we'll be distributing the entire Python standard library.

But the distribution of Python is not required: various packagers (like operating systems) would want Mercurial to use a Python provided to it. So our Rust hg needs to support loading a bundled Python and a Python provided to it. This can likely be controlled with build-time flags.

Windows

Mercurial could conceptually be distributed as a .zip file. That archive would contain pre-built hg.exe, pythonXX.dll, any other shared library dependencies, a copy of the Python standard library, Mercurial Python files, and any support files.

Because zip files aren't user friendly, we'd likely provide a standalone .exe or .msi installer (like we do today).

Linux

We could provide a self-contained archive file containing hg binary, libpython27.so, and any other dependencies. We could also provide rpm, deb, etc packages for popular distributions. These would be self-contained and not dependent any many (any?) other packages. Our biggest concern here is libc compatibility. That can be solved by static linking, compiling against a sufficiently old (and compatible) libc, or providing distro-specific packages.

Of course, many distros will want to provide their own Mercurial package. And they will likely want Mercurial to make use of the system Python. We can and must support this.

An issue with a self-contained distribution is loading of shared libraries. Not all operating systems and loaders may support loading of binary-relative shared libraries. We may need to hack something together that uses dlopen() to explicitly specify which libpython27.so, etc to load.

MacOS

This is very similar to Linux. We may support the native application / installer mechanism to make things more user friendly. We don't have good support for this today. So it is likely most users will rely on Homebrew or MacPorts for installation.

BSDs / Solaris / Etc

Basically the same strategy as Linux.

PyPI / pip

We support installing Mercurial via pip today. We upload a source distribution to PyPI and anyone can pip install Mercurial to install Mercurial in their Python environment. On Windows (where users can't easily compile binary Python extensions), we provide Python wheels with pre-built Mercurial binaries.

The future of pip install Mercurial with an oxidized Mercurial is less clear.

pip is tailored towards Python applications. If Mercurial is a Rust application and Python is an implementation detail, does it make sense to use pip and PyPI as a distribution channel?

pip install Mercurial is very convenient (at least for the people that have pip installed and can run it). It is certainly easier than downloading and running an installer. So unless we bake an upgrade facility into Mercurial itself, pip install Mercurial is the next best thing for upgrading after the system package manager (apt, yum, brew, port, etc).

pip install Mercurial goes through a well-defined mechanism to take the artifact it downloaded from PyPI to install it. This mechanism could be abused to facilitate the use of PyPI/pip for distributing a self-contained Mercurial distribution. e.g. the user would end up with a Rust binary in PYTHONHOME/bin/hg that loads a custom version of Python and is fully self-contained and isolated from the Python it was pip installed into. This would be super hacky. It may not even be allowed by PyPI's hosting terms of service? But we could certainly abuse pip install if we needed to.

Support for PyPy / non-CPython Pythons

There exist Python distributions beyond the official CPython distribution. PyPy likely being the one of the most interest to us because of its performance advantages.

The cost to supporting non-CPython Pythons when hg is a Rust binary could be very high. That would likely significantly curtail the use of the CPython API. Instead, we'd have to do interop via ctypes or cffi or provide N ways to do interop.

It's worth noting that if Mercurial is a self-contained application, we could potentially swap out CPython for PyPy. We could go as far as to unsupport CPython completely.

Rust <=> Python Interop

Rust and Python code will need to call into each other. (Although it is anticipated that the bulk of the calling will be from Python into Rust code - at least initially.)

There are many options for us here.

python27-sys and python3-sys are low-level Rust bindings to the CPython API. Lots of unsafe {} code here.

rust-cpython and PyO3 are higher-level bindings to python27-sys and python3-sys. They are what you want to use for day-to-day Rust programming.

PyO3 is a fork of rust-cpython. It seems to be a bit nicer. But it requires Nightly Rust features.

Milksnake uses Rust's cbindgen crate to automatically generate Python cffi bindings to Rust libraries. Essentially, you write a Rust library that exports symbols and milksnake can generate a Python binding to it. There's a lot going on. But it is definitely an interesting approach. And some of the components are useful without the rest of milksnake. e.g. the idea of using cbindgen + cffi to generate low-level Python bindings. Because Milksnake uses cffi, the approach should work with both CPython and PyPy.

A major reason for adopting Rust (and C before that) is performance. We know from Mercurial's C extensions that native code is often vasly undermined by a) crossing the Python<->native boundary b) excessive use of Python API from native code. For example, obsolescence marker parsing is ~100x faster in C. However, once you construct PyObject for all the parsed markers, it is only 2-4x faster.

We know that using ctypes to call from Python into native code is significantly slower than binary Python extensions. Although if the number of function calls and data being transferred across the boundary is small, this difference isn't as pronounced. Rust will enable us to write more functionality in native code (we try to avoid writing C today for maintainability and security reasons). So the performance of the Python<->native bridge will be more important over time. Therefore, it seems prudent to rule out ctypes. That leaves us with extensions or CFFI.

Reconciling `hg` with Rust extensions

Initially, hg will be a minimal Rust binary that embeds a Python interpreter. It simply tells the interpreter to invoke Mercurial's main() function. In this world, other Rust functionality is likely loaded via shared libraries or Python extensions. In other words, we have multiple Rust contexts running from different binaries (an executable and a shared library). The executable handles very early process activity. The shared library handles business logic.

Over time, we'll likely want to expand the role of Rust for early process activity. For example, we'll need to implement some command line processing in Rust for chg functionality. We may also want to implement config file loading (we need to rewrite the config parser anyway to facilitate writing back config changes). And, if we could load a repo from disk and maybe even implement performance critical commands (like hg status) from pure Rust, this would likely be a massive performance win. (Although we have to consider how this will interact with extensibility.)

What this means is that we'll have multiple Rust binaries holding Mercurial state. This feels brittle. Ideally we'd have a single Rust binary. If Python needed to call into native/Rust code, it would get those symbols from the parent hg binary instead of from a shared library. It is unclear how this would work. It is obviously possible to resolve the address of a symbol in the current binary. But existing "call native code" mechanisms in Python seem to assume that symbols are coming from loaded libraries, not the current executable. This may require modifications to cffi or some custom code to generate the Python bindings to executable-local symbols.


CategoryNewFeatures CategoryNewFeatures

OxidationPlan (last edited 2022-06-25 18:04:59 by MarcosCruz)