File Censorship Plan
DVCS users occasionally commit and publish sensitive data like passwords, private keys, and personally identifying information. "Censorship" will remove the sensitive data so future clones receive tombstone data instead.
Non-goals include: removing changesets due to sensitive commit messages, removing manifests due to sensitive file names, proactively removing sensitive file data from existing clones.
As mentioned above, private data such as passwords or private keys can be unwittingly committed to source control, as well as legally sensitive data such as personally identifying information. While one can (and should) change passwords that are published, legal requirements can require PII to be removed from the source control system so it will no longer be shared. Data and software licenses can also require such removal after the license expires.
In DVCS like Mercurial, hashes demonstrate historical integrity by including parent hashes along with content (see MerkleTree). One can always rewrite each piece of history going back to the introduction of sensitive data. If enough published commits are based upon a commit containing sensitive file data, rewriting history may be prohibitively expensive. For example, the expiration of data/software licenses may require several years of history to be rewritten.
If rewriting history is unpalatable, at present the owner of the repository must manually excise the data from the file's history and accept that the hash of that file will be unverifiable. Done blindly, any file revisions which are stored as a "delta" based on the offending file data (directly or transitively through a chain of deltas) will be unreadable after the base content is removed. Those revisions must be rewritten as well. No generally-available tools exist for performing this delicate surgery.
The repository owner may continue committing to the heads of the repository, but attempts to view the repository at any changeset containing the sensitive file data will fail due to the hash mismatch (examples: hg update, hg diff, hg annotate). "hg verify" will fail due to the hash mismatch as well. Clones of such a tainted repository that don't yet have the excised data will not receive it, inheriting the limitations of the original repo. Existing clones which do not have a copy of the data will behave similarly after pulling.
Existing clones of the repository which include the offending data are unaffected by modifications to the original repository's history - there is no general means through which the original could "reach out" and remove data from all clones. So these existing clones will remain fully functional. They will successfully interoperate with the original except when sending or receiving revisions of the affected file based on the excised revision. Interoperability then fails due to the use of deltas in revision exchange.
As seen with the original data removal, deltas require agreement on a file revision's content. Depending on the repositories, revisions might successfully transfer, abort transfer due to hash mismatches, or silently corrupt the receiving repository in the worst case. This last possibility stems from a "fast-path" optimization possible when adding exchanged deltas to revlogs, and demonstrates that Mercurial itself must provide some native support to make removing file content generally safe in practice.
3. Design Highlights
Individual file revisions may be censored. When requested by a user, a censored revision is presented as an empty file if it can be verified. Censored file revisions have non-empty data called a tombstone: metadata subject to verification, padded to match the size of the censored data.
Users may configure a verification policy based on the expected tombstone contents; for example, a policy using a shared GPG key could verify tombstones containing GPG signatures. The default policy will be abort which always fails verification, and another built-in policy ignore will always pass verification.
Exchange risk is largely mitigated by a new rule enforced by any Mercurial which natively supports censorship: a delta based on a censored revision must trivially replace the entire base text. A conforming delta will apply correctly regardless of whether or not the base is censored, thanks to the tombstone's padding. This rule enables censor-aware Mercurial to emit valid deltas any client can use and reject deltas that it cannot itself use.
An extra safeguard is introduced to the censorship operation, to reduce the impact of the revlog "fast-path" which skips verifying exchanged deltas. When a file revision is censored and is present in any topological heads, a new blank revision of the file is added to the filelog, capping the censored file node. Then, to each head which contains the newly-censored file node, we add a cap child changeset that modifies the file to use the new blank revision. This makes a censor-unaware Mercurial clone less likely to produce "fast-path" deltas that would corrupt a third censor-unaware clone.
4. Implementation Details
4.1. Filelog Format
A censored filelog entry has three twists:
- A revlog index entry flag bit is set, so censored nodes are efficiently identifiable by censor-aware Mercurial.
- The metadata section has a "censored" key added, with the tombstone base-64 encoded as the value.
- The revision data is padded to be the same uncompressed length as the censored revision data.
Tombstone data is opaque and may take any form desired by the censoring user. Some repositories might offer a plain-text justification for the censorship, others might provide a link to a web address with details. Still others might store a GPG-signed justification message, so the signature can be verified.
The index entry flag bit is set in two circumstances: by the act of censoring a revision, and when censor-aware Mercurial receives a censored revision from a peer. In most exchange scenarios, the full censor tombstone will be materialized and the revlog will know to set the appropriate flag bit. In fast-pathed scenarios without full delta decoding, the first 128 bytes of the delta will be inspected for the addition of the "censored:" metadata key, and if found the flag bit will be set.
Censor-aware Mercurial clients take extra steps during exchange to conform to and enforce the rule: a delta based on a censored revision must trivially replace the entire base text.
To enforce the new rule, for each incoming revision with a non-null base, we check if that base is censored. If it is, we look up the uncompressed size of the censored revision, $BASESIZE. We then unpack the changegroup header of the incoming delta, whose 12 bytes must encode: (0, $BASESIZE, $REVSIZE). If this is not the case, we abort the exchange. Otherwise, we complete receiving the changegroup, then verify there are no additional changegroups for that file. If additional changegroups follow, we abort the exchange.
To conform to the new rule, for each outgoing revision, we check if its delta base is censored. If it is, we send the trivial delta replacing the entire base with the entire new revision. This delta is guaranteed to apply even if the recipient has not censored the delta base.
5. Unsupported Exchange Scenarios
There is at least one identified exchange involving old Mercurial clients which could result in repository corruption:
Original repo R has C changesets. File F has N <= C revisions. R is maintained with censor-aware Mercurial.
- R is cloned by an old client, creating repo Y.
- Repo R censors file F at the Nth revision. This adds a "capstone" revision to F, N+1, linked to a new changeset C+1.
- R is cloned by an old client using "hg clone -r C", creating repo Z.
- If Z receives changes from Y or vice-versa, they might corrupt each other's filelog for F.
6. Testing Plan
This feature does affect backwards compatibility and will need to be tested across older versions of Mercurial. In particular, the supported exchanges must be tested between a censor-aware Mercurial client and earlier versions of Mercurial from the 1.x, 2.x and 3.x development lines.
To test against all requisite Mercurial clients, a custom test runner script will build a local hg at each tag in the hg repo (excluding 0.x tags) and verify that the supported exchange scenarios work without extensions or other client modifications. The script will test each hg version both with and without C extensions. This test might not be committed permanently as it will not be useful after censorship is released; regardless it will be shared with the mercurial-devel mailing list.
7. Future Improvements
The "gpgsig" extension distributed with Core mercurial should provide a gpgsig censorship policy which attempts to verify signatures in tombstone data.
- If a changegroup format is designed which allows for transmitting a revision's revlog flag bits, this would simplify the identification of incoming tombstones. In particular it would obviate the brittle receive-side check for the "censor:" metadata tag.