[PATCH RFC] revlog/changegroup: use callbacks for populating efiles

Sat Jul 11 17:26:31 CDT 2015

On Sat, Jul 11, 2015 at 1:17 PM, Matt Mackall <mpm at selenic.com> wrote:

> On Sat, 2015-07-11 at 11:26 -0700, Gregory Szorc wrote:
> > On Fri, Jul 10, 2015 at 5:03 PM, Matt Mackall <mpm at selenic.com> wrote:
> >
> > > On Fri, 2015-07-10 at 16:46 -0700, Gregory Szorc wrote:
> > > > # HG changeset patch
> > > > # User Gregory Szorc <gregory.szorc at gmail.com>
> > > > # Date 1436571961 25200
> > > > #      Fri Jul 10 16:46:01 2015 -0700
> > > > # Node ID bca84aa6ea80b386a01c245e7f4e4ebdb720cd8c
> > > > # Parent  648323f41a89619d9eeeb7287213378c340866c8
> > > > revlog/changegroup: use callbacks for populating efiles
> > >
> > > Probably needs to use the flush() thingy that the nearby censor code is
> > > using.
> > >
> >
> > Calling flush() makes the error go away!
> >
> > While my SSD equipped MBP didn't appear to exhibit a slowdown unbundling
> > mozilla-central, flushing file descriptors after applying each changeset
> > feels a bit excessive and it makes me uncomfortable due to potential
> > performance implications. I just don't think an extra I/O related syscall
> > should be required for this feature.
>
> The flush is only a flush from the userspace file buffers to kernel-side
> buffers, it won't actually force I/O. It'll still be lazily written by
> the kernel's caching layer at roughly the same rate as without the
> buffering because the userspace buffers are tiny (4k) relative a modern
> system's disk cache. But yes, it is a syscall and syscalls aren't free.
>
> > But you have a stronger grasp on these things than me. Is calling flush()
> > after every changeset add acceptable? Keep in mind this will be the only
> > behavior for addchangegroup(), regardless of whether progress bars are
> > displayed.
>
> Well, it can be conditional on there being a callback.
>

Right. But for changegroup.addchangegroup(), there will always be a
callback so this will be the default behavior for many cases.

>
> In the slightly bigger picture, we might want to make the revlog object
> aware of the file handles it might need to flush so that rather than
> explicitly flushing on the write side, the read side can flush only if
> needed. Something like:
>
> for f in self._flushonread:
>     f.flush()
>
> ..with the appropriate try/finally around the writers.
>
> In the even bigger still picture, we might want to make it optionally
> possible for revlog to hold onto file handles for the object's lifetime
> to reduce syscall count and filesystem lookup overhead. Historically we
> don't do this because Windows.
>

Looking at things a bit more, my original patch series also added
additional flushes because buildtext() inside revlog._addrevision() does
flushing and my "return text" patch made the call to buildtext()
unconditional.

Tracing the system calls when applying a bunch of changesets, I'm seeing
what I think is "odd" behavior. .hg/store/00changelog.i.a is opened,
fstat()d twice, lseek()d twice, read() a few times, maybe write()n, and
closed. This cycle repeats over and over. I'm still trying to grok what is
happening. But it certainly doesn't feel very efficient. I suspect it has
something to do with the flushing, which happens every couple of changesets
inside buildtext(). I'm pretty sure this corresponds to starting/ending a
delta chain.

This is the first time I've dug into the bowels of revlogs. It's a bit
overwhelming. On the bright side, I think I'm starting to identify some
potential performance improvements!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20150711/9cd3d59d/attachment.html>