[PATCH 3 of 6] treemanifest: create treemanifest class

Mon Mar 16 17:21:11 CDT 2015

On Wed, Mar 11, 2015 at 1:59 PM Augie Fackler <raf at durin42.com> wrote:

> On Tue, Mar 10, 2015 at 09:23:35PM -0700, Martin von Zweigbergk wrote:
> > # HG changeset patch
> > # User Martin von Zweigbergk <martinvonz at google.com>
> > # Date 1425931076 25200
> > #      Mon Mar 09 12:57:56 2015 -0700
> > # Node ID 34ad94f42d39e9ad3e7a7b8695f419bc2ff8b37f
> > # Parent  5671d3b1883e8fdce48e43aabce6ca5c86f6ea70
> > treemanifest: create treemanifest class
> >
> > There are a number of problems with large and flat manifests. Copying
> > from http://mercurial.selenic.com/wiki/ManifestShardingPlan:
> >
> >  * manifest too large for RAM
> >
> >  * manifest resolution too much CPU (long delta chains)
> >
> >  * committing is slow because entire manifest has to be hashed
> >
> >  * impossible for narrow clone to leave out part of manifest as all is
> >    needed to calculate new hash
> >
> >  * diffing two revisions involves traversing entire subdirectories
> >    even if identical
> >
> > This is a first step in a series introducing a manifest revlog per
> > directory.
> >
> > This change adds boolean configuration option
> > experimental.treemanifest. When the option is enabled, manifests are
> > parsed into a new tree data structure with one tree node per
> > directory. At this point, it is just a different data structure in
> > memory; there is still just a single manifest revlog on disk.
> >
> > The code is not yet optimized. Enabling the option makes most or all
> > operations slower. Once we start storing manifest revlogs for every
> > directory, it should be possible to make many of these operations much
> > faster. The fastdelta() optimization has been explicitly disabled (and
> > not implemented) for the treemanifests.
> >
> > Tests can be run using the new data structure by switching the config
> > option default in localrepo._applyrequirements(). All tests pass.
> >
> > diff -r 5671d3b1883e -r 34ad94f42d39 mercurial/localrepo.py
> > --- a/mercurial/localrepo.py  Tue Mar 10 16:26:13 2015 -0700
> > +++ b/mercurial/localrepo.py  Mon Mar 09 12:57:56 2015 -0700
> > @@ -326,6 +326,9 @@
> >          manifestcachesize = self.ui.configint('format',
> 'manifestcachesize')
> >          if manifestcachesize is not None:
> >              self.svfs.options['manifestcachesize'] = manifestcachesize
> > +        usetreemanifest = self.ui.configbool('experimental',
> 'treemanifest')
> > +        if usetreemanifest is not None:
> > +            self.svfs.options['usetreemanifest'] = usetreemanifest
> >
> >      def _writerequirements(self):
> >          reqfile = self.vfs("requires", "w")
> > diff -r 5671d3b1883e -r 34ad94f42d39 mercurial/manifest.py
> > --- a/mercurial/manifest.py   Tue Mar 10 16:26:13 2015 -0700
> > +++ b/mercurial/manifest.py   Mon Mar 09 12:57:56 2015 -0700
> > @@ -311,22 +311,240 @@
> >                     + content for start, end, content in x)
> >      return deltatext, newaddlist
> >
> > +def _splittopdir(f):
> > +    if '/' in f:
> > +        dir, subpath = f.split('/', 1)
> > +        return dir + '/', subpath
> > +    else:
> > +        return '', f
> > +
> > +class treemanifest(object):
>
> This class can (and probably should) use _lazymanifest directly,
> rather than proxying through the higher-level dict-like abstraction.
>

I mentioned on #IRC, but I should update this thread too (tl;dr: I'll stick
with dicts):

It turned out that the tests I ran actually got a little *slower* with
_lazymanifest. I think it's because the two main benefits of _lazymanifest
-- fast access to a few entries, and linear iteration -- are not used by
the treemanifest code. While parsing, the treemanifest code iterates over
all the entries in order to populate the tree. And on iteration, because of
how it currently keeps directories and files separate, it needs to sort the
paths within each directory again anyway. With _lazymanifest lookups (and
updates) go from being constant time in the size of the directory (direct
contents) to being logarithmic. I'm guessing this small cost is normally
outweighed by the gain in reduced parsing (when looking up a few elements),
and sorted-ness (when iterating), but when we don't take advantage of those
points, it instead becomes a little slower.

So, in summary, it seems like the dicts are the way to go. Let me know if
you disagree. Maybe there was something I missed. Otherwise I'll continue
on that path.

I'll send a V2 soon. It has been rebased (thereby gaining hasdir() and
dirs() methods), but is otherwise very similar to this version.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20150316/928294b8/attachment.html>