Formalizing Interfaces

Fri Aug 11 15:44:18 EDT 2017

> On Aug 11, 2017, at 11:33, Durham Goode <durham at fb.com> wrote:
> 
> 
> 
>> On 8/10/17 10:32 PM, Gregory Szorc wrote:
>> I've wanted to hack on overhauling revlogs and storage for several
>> months now. The more I think about how to go about it and the reasons
>> why I want to do it (mainly performance and scaling), the more I think a
>> more drastic departure from revlogs and the current storage "backend" is
>> needed. There are some great properties of revlogs and the
>> direct-addressable model of the current store, don't get me wrong. But
>> we can only scale that model so far (as I'm sure anyone from Facebook or
>> Google will tell you).
>> 
>> Anyway, overhauling storage is a daunting proposition. It should be
>> terrifying (because it is).
>> 
>> I started thinking about how we could facilitate extreme experimentation
>> on things like say swapping out a new storage backend. Perhaps even with
>> one implemented in Rust :)
>> 
>> Durham did a lot of work on the manifest code a few months back.
>> Essentially, he successfully decoupled the API of the manifest from the
>> low-level implementation of a revlog. It even abstracts away flat versus
>> tree manifests. It's great stuff. I was reminded of his work when I
>> started trying to do something similar for revlogs.
>> 
>> Long story short, one thing led to another and I had this crazy idea
>> that it would be a good idea to declare formal interfaces for important
>> constructs, like the changelog, manifests, file history, peers, and
>> quite possibly the repo object itself. If we did this and somehow
>> enforced the interface, it would be possible to swap out implementations
>> as needed. For example, if the classic store with 1 file/revlog per
>> tracked path doesn't scale for you because you have >1M files, then you
>> can swap in a store that uses "pack files," remote storage, etc. If a
>> classic filesystem-based working directory doesn't fit the bill, perhaps
>> you swap in one that is virtual filesystem aware.
>> 
>> I wanted to experiment with this idea with something that is easier to
>> reason about than storage. That led me to the "peer" classes (peer.py,
>> httppeer.py, sshpeer.py, etc). I've just submitted a series where I
>> formalize the peer interface using abstract base classes. Read the
>> commit messages for https://phab.mercurial-scm.org/D332
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__phab.mercurial-2Dscm.org_D332&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=nuarHzhP1wi1T9iURRCj1A&m=fikFAFyD9ByWfnQNkcDvHQA7rzOzSpj4nHfviqXvX5Y&s=jC6vqaHho7Al5tlhfi0BPKiUFMXa0-dccg0aVeTtpPs&e=>
>> and https://phab.mercurial-scm.org/D339
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__phab.mercurial-2Dscm.org_D339&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=nuarHzhP1wi1T9iURRCj1A&m=fikFAFyD9ByWfnQNkcDvHQA7rzOzSpj4nHfviqXvX5Y&s=sHB98Ga9MIwu4aVTNeAFZqa2TcOQG3MXwYAeApNjjiY&e=>
>> and the commits in that stack for more details.
>> 
>> Part of developing that series uncovered a number of minor bugs. And, I
>> think the end result is a peer API that is more easily understood and
>> easier to hack on. So, I think there is merit to the approach for code
>> maintainability reasons alone.
>> 
>> But I want to dream bigger and apply this to more significant primitives
>> (like storage).
>> 
>> If we adopt formal interfaces for important constructs, I'd like to see
>> support in the test harness for swapping in alternate implementations of
>> these things. Jun recently added #testcases syntax to .t tests so we
>> could run multiple variations of the same test. I'd like do something
>> similar at the entire test suite level. e.g. `run-tests.py
>> --changelog=sqlite` or `run-tests.py --store=leveldb`. Or more
>> realistically, `run-tests.py --peers http,ssh` would run the test suite
>> using both the http and ssh peer implementations. I could see that
>> culminating with tests naturally dividing themselves into low-level unit
>> tests (interface implementation specific) and higher-level, generic
>> integration tests. If we "code to the interface," it should be possible
>> to swap in a brand new implementation of something like a peer protocol
>> or changelog and it will pass the integration tests. We could even have
>> dummy, super fast implementations to facilitate hacking on things like
>> frontend features to help reduce the edit-test cycle.
>> 
>> Formal interfaces will facilitate extreme experimentation without the
>> traditional fragility that a dynamic language like Python introduces.
>> They will allow us to more clearly define boundaries between components.
>> This will make it vastly easier to refactor and do things like rewrite
>> large components in Rust. It would also make it *much* easier to
>> implement "hgit" (using the Mercurial CLI to interface with a Git
>> repository, advanced features like revlogs and all). (Call me crazy, but
>> I'd love to ship this feature as part of Mercurial to help entice new
>> users.)
>> 
>> Regardless of whether we go all in on formal interfaces, there's an
>> interesting idea 2 paragraphs back: tests that are implementation
>> agnostic. Today, we end up duplicating test functionality for minor
>> variations. e.g. http vs ssh vs local peer interactions. bundle1 vs
>> bundle2. I'd really like to move many tests to Jun's #testcases feature
>> because it will allow us to achieve higher test coverage while writing
>> fewer tests. I think it would be worthwhile to figure out how to
>> consolidate as many tests as possible so they can test behavior, not a
>> specific implementation. A side-benefit of doing this is we'll uncover
>> areas where implementations vary in behavior. This will help squash bugs
>> and produce a more consistent user interface and experience, because
>> every time we see e.g. different behavior between things performing the
>> same role, we'll ask ourselves why the discrepancy.
>> 
>> Anyway, this is probably the craziest Mercurial idea I've had in a
>> while. I'm not sure how much of it is realistic. But I'd certainly like
>> to establish some formality around interfaces for core components to
>> facilitate code comprehension, refactoring, and testing. The work I just
>> submitted on the peer API seems to show potential. I'm just not sure if
>> we can achieve some of the more ambitious goals with the approach I've
>> taken in that series. I'd love to hear what others think.
>> 
>> Gregory
> 
> I'm all for more formalized interfaces between layers. We've introduced a relatively formal interface for our remotefilelog/treemanifest storage and it has proven extremely helpful. For instance:
> 
> - It allows composing arbitrary stores, as long as they all meet the interface. We can union a series of stores so it can look in local caches first, then local storage, then memcache, then ssh remotely.
> - We can even insert a fake "store" that converts old formats to new formats. For instance, we have a store where when you ask for a manifest, it reads a flat manifest and turns it into a treemanifest and saves it. By putting that store at the end of the chain of stores we can look at normal tree manifest storage first, and transparently convert flat manifests to trees on demand as the user asks.
> - The same composability has let us implement bundlerepo more efficiently. Instead of layering the bundle revlogs on top of normal revlogs at the filesystem byte level, we just insert a bundle store at the beginning of the series of stores.
> - It has let us implement repack algorithms that can transparently operate on top of revlogs. So we coded the repack algorithm without knowledge of revlogs, plug in a revlog store, and repack magically outputs pack files. This could be used for many kinds of format changes.

Oh wow - I had no idea all this stuff existed! I really need to spend a day pouring over your repo. It should go without saying that this looks pretty cool!

> 
> It has been pretty straight forward and easy to do for manifests and filelogs, because they are mostly random access key->revision lookups. The changelog will be a bit harder because it has very different access patterns that are currently optimized for revlogs.

> 
> I think a basic interface could be put in place for those manifest and filelog storage without too much effort. The main place our current interface leaks is around 1) writes, and 2) changegroup creation. We don't have a generic interface for writes yet. Changegroups are very related to revlogs, and we haven't come up with a good abstraction that yet either.

Yeah. As I was brainstorming revlog futures, one of the things I kept dwelling on was the unique requirements of changelog vs manifest vs files. This led to expanding scope to stores which led to this thread :)

I agree that writes and transactions are difficult to model. What I'd really like to do is split read-only from mutability. I had an RFC series a few weeks ago where I explored splitting localrepo into immutable and mutable components. I think that subconsciously fed into me doing this interface work. Changegroups are also hard.

> 
> This might be a good place to start using python type hints too.

Agree. But until we require source transformation to run on Python 2.7, we may have to wait :/

> 
> 
> Examples of our abstraction can be seen here:
> https://bitbucket.org/facebook/hg-experimental/src/e61e2670306ee36eaece26188412ac82c97f2d1c/remotefilelog/contentstore.py
> 
> unioncontentstore shows a store that unions other stores
> remotefilelogcontentstore implements the same interface but looks for data remotely
> manifestrevlogstore implements the same interface but wraps treemanifest relovgs
> 
> That's the data store abstraction (i.e. getting revision text). There's also an abstraction for history storage (node -> p1, p2, linknode, copyinfo)

I've also been thinking about how to separate fulltext from index data. I'd love to see this formalized in core, as this is critical for non-revlog file storage. I'll take a loot at your interfaces for inspiration!

Thanks for the reply and info. It's very helpful!