Manifest Refactor

Durham Goode durham at fb.com
Tue Jul 12 22:03:36 UTC 2016


We'll be looking at moving to tree manifests as our source of truth over 
the next few months, and one problem area is the fact that the manifest 
class is not well factored for this usecase. This one class is the 
collection of all manifests, the accessor for information about 
individual manifests, and the storage format (revlog).

Before I do a bunch of work, I wanted to run my proposal for breaking up 
the manifest by you guys:

1. Add a "manifestlog" class that represents the collection of all 
root-level manifests (i.e. what commits point to; not any sub-trees)  
It's basically what repo.manifest would return, and mainly consist of 
"get" and "add" apis that return and accept manifest instances.  It 
would be responsible for caching recently used manifests, and 
potentially serving up the right kind of manifest when demanded (ex: 
during our transition from flat manifests to tree manifests, we may want 
to allow loading both, and this class would multiplex them). It would 
have no, or very little, knowledge about revlogs/storage.

2. Make the "manifest" class represent a single instance of a manifest 
(it would point at other instances of "manifest" for sub-trees).  From a 
consumers point of view, when they do 'repo.manifest.get(node)' they 
will receive a manifest instance and they should be able to not care how 
it's implemented.  It would expose apis like 'children', 'walk', 
'get(fileordirname)', 'parents', 'linkrev', etc.

The specific implementation of the manifest instance could use whatever 
storage scheme it wants.  For example, in the normal vanilla manifest, 
it would look much like manifestdict does today, with no knowledge of 
revlogs (you just pass text to the constructor). In a tree world, each 
instance in the tree could have knowledge of its own backing revlog, and 
be able to construct new instances as someone recurses down.

3. Add a "manifestrevlog" class that inherits from revlog. This is the 
actual ondisk storage.  Ideally "manifest" instances would just call 
simple read and write apis (and not depend on revlog implementation 
details), so we could in theory replace the revlog storage with 
something else (packed revlogs, lookaside to memcache, whatever) without 
having to rewrite the actual manifest business logic.


Breaking the manifest into these three parts (collection, instance, 
storage) should make it easier to mix and match manifest implementations 
and storage schemes, without rewriting lots of logic.

For thing that do take heavy dependencies on it being a revlog (like 
push/pull/changegroup), they will be able to reach around the 
abstractions and talk directly to the revlog when necessary. And future 
storage implementations will either have to do the same or find a common 
API that can allow changegroups to be created/received for both storages.


Thoughts? Concerns? Is renaming the collection class (which is the 
primary interface for how the rest of mercurial interacts with the 
manifest) from manifest to manifestlog a bad idea?  I could rename the 
instance concept to manifestctx or something instead.



More information about the Mercurial-devel mailing list