Note:

This page is primarily intended for developers of Mercurial.

Automatic code formatting/file content rewriting

Status: Implemented

Main proponents: DannyHooper hooper@google.com

$/!\$ This is documenting a completed project, but has not been updated to represent changes made while implementing. The documentation below is likely out of date.

Google has an internal extension that runs our standard code formatters (clang-format, gofmt, etc.) on modified code in any set of non-public commits. It amends those commits with formatting fixes in a manner that requires no evolution, merging, or conflict resolution, either by the user or internally (assuming no descendants are excluded and thus left orphaned by the operation). There is also the option to run the formatters on uncommitted changes in the working directory. At a high level, we refer to this feature as 'hg fix'. We run a simple command and all of our curly brackets end up in the right place before we push our drafts. This seems like rich ground for a generic, configurable extension that amends commits with your desired file content changes.

Contents

Automatic code formatting/file content rewriting

Goal

Some projects care enough about code formatting that they run standard auto-formatters on their code so nobody can argue about the result. Formatters change over time, and are sometimes not used consistently, so it's also useful to only apply the formatters to lines of code that change in a given changeset. This avoids introducing unrelated formatting adjustments to your changeset.

Some developers use editors that will do the formatting for them. Some developers run the formatters manually on the working directory. It may not be easy to do this in a way that only targets modified lines of code. At any rate, it can require various manual steps. We can eliminate those by providing an hg command that finds and formats everything you care about. We can even amend your existing commits without you having to check them out. We can also do this without getting you into nasty conflict resolution, as I will describe later.

You could also imagine that it would be useful to do things analogous to code formatting, like sorting the lines in a file. Even if all we want to do is format code, your project probably disagrees with mine about which formatters and which settings to use. So it's important to make all this very configurable.

The overall vision for this feature is therefore something like this:

[formatters]
cpp-formatter=clang-format --style=Google
cpp-fileset=**.cpp or **.hpp
sorted-formatter=sort
sorted-fileset=my_alphabetical_list_of_things.txt

$ hg format -r 'draft()'
formatted in aac87c62: foo.cpp
formatted in e42747ac: foo.cpp
formatted in e42747ac: foo.hpp
formatted in c194a06e: my_alphabetical_list_of_things.txt
amended 3 revisions with formatting changes

At which point three of your commits now have successors that mirror the topology of the predecessors. The successors have well formatted code in their diffs, whereas the predecessors do not.

How to handle formatting just the modified lines of files, when desired, is one of the more nuanced aspects of the problem. Naming of commands, config sections, and so on, is important but separate from the technical aspects I will focus on below.

Detailed description

I'll generally keep the theme of "running code formatters," but you can understand this to mean any sort of tool that modifies file content for you to make your life easier.

This is an attempt to distill the many lessons learned from implementing something similar to the proposed extension at Google. Let DannyHooper know if any section needs more background or details.

How to execute formatters

It is basically necessary to run a code formatter once for every file of interest in every revision of interest. We will take a revset from the user (or else use a sensible default revset), and optionally a file pattern. Each matching file that changes in each changeset must be considered, except for content-less files like removals or symlinks. There are special cases like renames where you could do fewer formatter runs, but it's not worth the additional complexity because most formatters are fast enough that a user will wait for them once, and we can parallelize all formatter executions.

It is useful to run various different code formatters on the same codebase. To run code formatters, you must have some interface to each of them. They generally differ in the way they accept whatever inputs they have in common. The most universal way to call a code formatter is to subprocess it; building and linking many formatters into your executable would be painful and unportable, if at all possible. The most consistent interface when subprocessing is that they can accept file content on standard input and provide the formatted file content on standard output. This sometimes relies on filesystem features like /dev/stdin. Only some formatters know how to format just specific lines of the file, and they generally have different command line syntax for doing so.

So, we need a way to generate an argv for a given formatter and given knowledge of what lines should be formatted. Some formatters will also care about the name of the file, which is lost because we're using pipes as our standard interface. Some will accept the assumed filename as a command line flag. Some formatters will make decisions based on the content of the file, so for simplicity we must pass any file that seems like it might be usefully modified by a given formatter. This can be handled by filesets in the config.

Our current implementation essentially consists of a function ConstructShellCommand(assumed_filename, modified_line_ranges) that we implement once for each formatter we want to support. This isn't as easy to configure as you might hope, but it is flexible enough to handle unusual requirements for certain formatters.

For a generic extension, we would want something like the hgrc example seen above. It would allow basic control of which files are passed to which formatters, and what executables are run with what flags. It could be extended slightly to specify how to pass flags for modified line ranges and assumed filenames using string substitution:

[formatters]
cpp-formatter=/usr/bin/clang-format --style=Google
cpp-linerange=--lines=$firstline:$lastline
cpp-filename=--assume-filename=$filename
cpp-fileset=**.cpp or **.hpp

The exclusion of the linerange or filename configs would mean no such arguments are added for that formatter.

Some possibilities to keep in mind are:

Meta-formatters, which would be subprocessed by hg, but then do their own implementation of generating shell commands and running them.
- This is an escape route for users who can't get enough configurability out of whatever formatter configuration mechanism hg provides. They can just configure every canged file to be passed into their meta-formatter.
Wrapper scripts can easily be written by users calling formatters with aberrant CLIs. These could create temp files if necessary.
Formatters, meta-formatters, and wrapper scripts could potentially accept some form of multiple-file input, and then manage concurrency for themselves. There would need to be a config option for specifying the input format to each formatter (single file on stdin, multiple files in some serialized format on stdin, or any other variations).

Code formatters like to read config files

If we don't run formatters in place on files in a fully checked out working directory, they won't be able to read config files that would exist in such a working directory. This is a problem for which I see no simple solution, since versioning your formatter configuration along with your code seems useful. However, if we are to run code formatters in parallel to reduce command-line latency, it is useful to keep the code in memory instead of checking out to disk the complete tree for each needed revision. Whether the performance/configurability trade off is acceptable probably depends on who you're asking. To get around making a working copy of the entire repository for every revision, you would need to encode some definition of the subset of files that might be needed to read the formatter's configuration. In common cases a fileset would probably work.

Dependency management is terrible

Code formatters are dependencies. They come from many disparate places. Bundling many of them with an extension would be crazy for various reasons (e.g. licensing, portability, version skew).

Users will generally have to tailor their formatter configurations to match their expected development environment. We could possibly provide some generic default configurations that will, e.g., run clang-format on your files that look like C/C++ if the right executable can be found. This would be similar to providing prefab configs for merge and diff tools. Users who want a lot of control can check the tools and hgrc into their repository to get consistent results across machines.

How to amend a set of revisions

After all of the files have been formatted in memory, we can use memctx to commit all of the successors in topological order. There's some room for committing the first successor before all the other formatters have completed, but the implementation complexity is probably not worth the latency reduction unless you're using a very slow formatter and your repo is very slow to commit to.

Every time we commit a formatted successor, we set its parents to be either the parents of its predecessor, or the successors of those parents, if we've just created those. Every connected subgraph of formatted successors will then internally have the same topology as the predecessors, and any parents that point outside the subgraph will be the same as those of the predecessors.

We can easily avoid amending any commits that need no formatting changes by iterating over the changes instead of the set of commits we passed through the formatters. Formatting a revset should be a no-op if the formatters don't think anything should change.

What to do if the command would create orphans

What if we have a chain of two commits (B and C), and we only format the parent? B' succeeds B, and the child C will be orphaned.

   C (orphaned because we amended B)
B' |
|  B (obsolete because we formatted/amended it)
| /
|/
A

This is a problem because now the formatting changes from B to B' can conflict with the changes from B to C if we try to evolve/rebase C into C'. The user could resolve this, but they might make mistakes. If they're smart, they'll just accept the C side of the conflict, then format C' once they have that. This is awkward, especially because the merge doesn't concern any meaningful changes in the code anyway. It is best to provide a UI that discourages users from formatting B without also formatting C at the same time.

But how? We have a few options, which will appeal differently to different teams:

Automatically format C in addition to B, to make the problem impossible to reach
Abort because of the presence of C, to make the problem impossible to reach
Silently ignore C and let it become an orphan, because users are always right
Give a warning with a reasonable cleanup command to copy-paste after the fact
Provide easy to use commands/aliases that use revsets where this won't happen, don't recommend the risky manual stuff

At Google, we're trying the last two things in combination.

How much output should the user see?

This is largely a matter of preference, and the union of everything everybody wants would be very verbose, redundant, and noisy. Here are some things that might be useful to output:

The commit hash for any commits that are...
- Considered for formatting
- Actually amended due to formatting changes in them
- Actually amended due to their parents being amended
- Not affected
The names of any files that are...
- Considered for formatting
- Actually modified
- Not affected
What formatters are being run on which files, so that maybe the user can complain about the formatter's choice of indentation instead of blaming it on the hg extension
A summary of how many commits/files were modified
Error messages from formatters that fail due to syntax errors or otherwise

It is useful to make much of this information available through templates so that users can pick and choose. Templates also allow administrators to add information like "file your formatting complaints at http://foo so we can fix them in our dev environment". We can embrace the spirit of "make everything a template" from the beginning with this command.

How to compute what lines to format

Some formatters, like gofmt, will always format the whole file. That's okay. For formatters that can format just a specific set of lines in the file, we want to support that for reasons given previously.

One of the important goals of this proposal is to reduce busy work for the user. We can automate formatting files across many commits, but it's not a benefit if they then have to evolve/rebase all of the successors to rebuild their original topology. This is a process that is at best tedious, and at worst error-inducing because of the many conflicting chunks that would result from a simplistic implementation.

Assume you format only the lines that have changed at each revision relative to its parent, and always merge rebased children by accepting the child successor version. If you are formatting a chain of two commits, you will find that any formatting fixes in the parent commit's successor that do not overlap changed lines in the child commit will be effectively reverted in the child's successor. This is because those line numbers won't have been passed to the invocation of the formatter that was used to generate the successor of the child.

You could maybe use some sort of merge to address this issue. It is expensive to do and requires logic to handle resolution of the many provably meaningless conflicts that will result from formatting overlapping line ranges in the two commits. It turns out, however, that this is not necessary.

After some discussion at Google, our solution is to create the successors in the correct topology to begin with. There is no subsequent evolve/rebase step required. To make sure that no formatting changes are forgotten in descendant commits, we simply format all changes since a common ancestor whenever we format a changeset. For every connected subgraph of changesets that is being formatted, we take the nearest common ancestor as the revision to diff against when computing the changed line number ranges. Those ranges are passed to the formatters.

There is actually a need to diff against multiple common ancestors in cases involving merge commits. The solution is to perform multiple diffs against the ancestors and take the union of the sets of changed line numbers.

We've internally referred to these ancestors as "baserevs" so far. Here is a somewhat representative example of determining these baserevs for some commits that are to be formatted:

5      To be formatted;          baserevs = {A}
|
| 4    To be formatted; a merge; baserevs = {A, B, C, D}
| |\
| | 3  To be formatted; a root/merge;  baserevs = {C, D}
| | |\
| | | D
| | |
| | C
| |
| 2    To be formatted; a merge; baserevs = {A, B}
|/|
| B
|
1      To be formatted; a root;  baserevs = {A}
|
A      (A, B, C, D are not being formatted, but are parents of to-be-formatted commits)

In this example, where all numbered revisions are to be formatted, when formatting revision 2:

Any file that has changed between revision 2 and revision A will be checked
Any file that has changed between revision 2 and revision B will be checked
Other files are ignored, unless the user specified their paths on the command line
Line numbers that changed between revision 2 and revision A will be checked
Line numbers that changed between revision 2 and revision B will be checked
Other line numbers are left unmodified if the formatting tool supports this, unless the user overrides this with a flag like --format-all-lines.
The successor of revision 2 will be another merge commit with parents B and succ(1)

Many of these details are, of course, unimportant for formatters that cannot use the changed line numbers information to do partial formatting. It is however a strong requirement for our own usage of clang-format and others at Google.

A less obvious reason this is superior to a "format, rebase, format, rebase, ..." approach is that we can run all formatter invocations in parallel, because there are no data dependencies. We have to start worrying more about limiting concurrency to avoid memory exhaustion and to best utilize the available cores.

Roadmap

Have internal implementation of most of this at Google
Write up this wiki page describing the approach
Get some community feedback
Generic implementation
Review patch (https://phab.mercurial-scm.org/D2897)

CategoryDeveloper CategoryNewFeatures

AutomaticFormattingPlan