Manifest Refactor

Thu Jul 14 18:38:55 EDT 2016

On Tue, Jul 12, 2016 at 3:03 PM, Durham Goode <durham at fb.com> wrote:
> We'll be looking at moving to tree manifests as our source of truth over the
> next few months, and one problem area is the fact that the manifest class is
> not well factored for this usecase. This one class is the collection of all
> manifests, the accessor for information about individual manifests, and the
> storage format (revlog).
>
> Before I do a bunch of work, I wanted to run my proposal for breaking up the
> manifest by you guys:
>
> 1. Add a "manifestlog" class that represents the collection of all
> root-level manifests (i.e. what commits point to; not any sub-trees)  It's
> basically what repo.manifest would return, and mainly consist of "get" and
> "add" apis that return and accept manifest instances.

As Augie said elsewhere in this thread, there's more than "get"
(actually called "read") and "add" in the current API. Where would you
put the dirlog() function that's currently on manifest? It may be
useful to include a list of the classes you imagine and their methods
(but skip the uninteresting ones), so we can see clearly where each
method goes.

I'm very much for renaming manifest to manifestlog. I've meant to do
that for a long time myself. It will of course mean some work for
extensions, but that's life.

> It would be
> responsible for caching recently used manifests, and potentially serving up
> the right kind of manifest when demanded (ex: during our transition from
> flat manifests to tree manifests, we may want to allow loading both, and
> this class would multiplex them). It would have no, or very little,
> knowledge about revlogs/storage.
>
> 2. Make the "manifest" class represent a single instance of a manifest (it
> would point at other instances of "manifest" for sub-trees).  From a
> consumers point of view, when they do 'repo.manifest.get(node)' they will
> receive a manifest instance and they should be able to not care how it's
> implemented.  It would expose apis like 'children', 'walk',
> 'get(fileordirname)', 'parents', 'linkrev', etc.
>
> The specific implementation of the manifest instance could use whatever
> storage scheme it wants.  For example, in the normal vanilla manifest, it
> would look much like manifestdict does today, with no knowledge of revlogs
> (you just pass text to the constructor). In a tree world, each instance in
> the tree could have knowledge of its own backing revlog, and be able to
> construct new instances as someone recurses down.

That sounds very much like it already is. Do you mean to simply define
a common super class for manifestdict and treemanifest? I think that
super class would be pretty empty (but e.g. "def keys(self): return
list(self.iterkeys())" could live there). It's what statically typed
languages would do, of course. I'm not familiar enough with Python to
say what's more Pythonic.

One of the incarnations of the treemanifest class had a reference to
the dirlog as you suggested. I'm sorry, but I don't remember why I
gave up on that version in favor of the current version where it
instead has a callback.

>
> 3. Add a "manifestrevlog" class that inherits from revlog. This is the
> actual ondisk storage.  Ideally "manifest" instances would just call simple
> read and write apis (and not depend on revlog implementation details), so we
> could in theory replace the revlog storage with something else (packed
> revlogs, lookaside to memcache, whatever) without having to rewrite the
> actual manifest business logic.

As Augie said, you'll probably need to support reading deltas here too.

Btw, I have replace some of the calls to readdelta() for treemanifest
by readshallowdelta() to avoid calling into _slowreaddelta(). There's
one more call in changegroup guarded by server.validate config that
could probably be changed. The last caller is
context._manifestdelta(), which could possibly be changed to return
the full manifest when using tree manifests, because it looks like
it's not iterated over. Only lookup is needed (which is fine with
lazily loaded tree manifests). The reason I even mention that is
because I had otherwise considered having another lazily loading delta
version of tree manifests. But as I write this, I think that may not
be necessary. So, you can probably just ignore this paragraph :-)

>
>
> Breaking the manifest into these three parts (collection, instance, storage)
> should make it easier to mix and match manifest implementations and storage
> schemes, without rewriting lots of logic.
>
> For thing that do take heavy dependencies on it being a revlog (like
> push/pull/changegroup), they will be able to reach around the abstractions
> and talk directly to the revlog when necessary. And future storage
> implementations will either have to do the same or find a common API that
> can allow changegroups to be created/received for both storages.
>
>
> Thoughts? Concerns? Is renaming the collection class (which is the primary
> interface for how the rest of mercurial interacts with the manifest) from
> manifest to manifestlog a bad idea?  I could rename the instance concept to
> manifestctx or something instead.
>