[PATCH] changelog: load pending file directly

Gregory Szorc gregory.szorc at gmail.com
Sun May 14 02:20:21 EDT 2017


On Sat, May 13, 2017 at 8:36 PM, Jun Wu <quark at fb.com> wrote:

> Excerpts from Gregory Szorc's message of 2017-05-13 18:02:05 -0700:
> > On Sat, May 13, 2017 at 5:11 PM, Jun Wu <quark at fb.com> wrote:
> >
> > > Looks good to me. I like the clean up.
> > >
> > > Excerpts from Gregory Szorc's message of 2017-05-13 16:29:50 -0700:
> > > > [...]
> > > >      @storecache('00changelog.i')
> > >
> > > storecache needs to include '00changelog.i.a'. But that belongs to
> another
> > > patch.
> > >
> >
> > I thought about this when I wrote the patch. Semantically, yes
> @storecache
> > should include 00changelog.i.a. However, I don't think it actually needs
> it
> > because:
> >
> > * 00changelog.i.a should be immutable for the lifetime of its existence
> > * when the transaction closes successfully, 00changelog.i is updated,
> > invalidating the cache
> > * when the transaction aborts, file cache is purged via
> > repo.invalidate(clearfilecache=True) and the transaction context is
> over,
> > so the next .changelog will load 00changelog.i
> >
> > There's also the issue of a reader invalidating its .changelog by an
> > independent writer starting a transaction. That feels sub-optimal.
>
> Right. When I think about it more, it's not needed.
>
> I'm more interested in stateful chg case [1]. At the same time, it's
> impossible to have 2 processes running 2 transactions, so it'd be fine.
>
> [1]: https://bitbucket.org/quark-zju/hg-draft/commits/branch/chg-stateful
>
> > FWIW, I had a series a year or so ago where I added a context manager
> that
> > when active basically said "I am a read-only operation." It aggressively
> > cached some things, leading to significant perf wins for some operations.
> > If I had infinite time, I'd hack on that a bit more and refactor the
> > localrepo class so you needed to obtain a specific class/type
> representing
> > a mutable repository, probably via a context manager based API that also
> > handled locks and transactions. If done right, most consumers could
> assume
> > a read-only repo and aggressively cache by default (rather than today,
> > where we check for freshness on e.g. every .changelog attribute access.
>
> For changelog, the filecache layer should be a no-op (no stat syscalls)
> because we will use "unfi.__dict__['changelog']". repoview.changelog seems
> to be problematic, although it looks cheap, but I can see it becomes a
> problem if "repo.changelog" is used in a loop.
>

IIRC marmoute optimized it significantly. But it still isn't as cheap as it
could be. I think it should be an attribute lookup by default because it
isn't like the changelog is changing all the time.


>
> Instead of marking part of the operations "read-only", I wonder if we can
> do
> the opposite - repo is immutable by default, write operations need to be
> marked explicitly. We already have repo.transaction, which provides the
> information already.
>

I agree that the repo should be read-only by default and that some kind of
function call is necessary to transform it into mutable/slow mode.


>
> So I think repoview could be something like:
>
>   class repoview(object):
>       def __init__(self):
>           ....
>           self.changelog = self._clcache = self._getchangelog()
>
>       def _getchangelog(self):
>           # the same as repoview.changelog today
>
>       def transaction(self):
>           self.changelog = property(self._getchangelog)
>           def restorechangelog(self):
>               self.changelog = self._clcache
>           unfi.transaction(releasefn=restorechangelog)
>
> That seems to be able to achieve the perf win and is easier.
>
> > Related to this idea, I want to add a context manager API to the
> localrepo
> > class to indicate intended access patterns. For example, the optimization
> > strategy for reading revlogs varies significantly depending on if you are
> > reading forward or reverse and whether you are accessing all revisions
> or a
> > subset. If we had a hint of the anticipated access pattern, we could
> > optimize I/O and caching patterns appropriately. Unlike a
>
> That's true. Given the fact that revlog has so many callers I wonder if
> there is an automatic way to do that so caller does not need to change.
> A naive way is to record (access_direction, count), where "direction" is -1
> or 1 depending on the rev being accessed compared to the last rev. Count
> increases when direction does not change but another rev is accessed.
>
> In this way, we could set a threshold for "count" and change caching
> strategy dynamically, if direction is backwards, cache the delta base
> revision, otherwise, cache whatever revision accessed like today.
>

Yes, that's certainly possible (and is something I've considered
implementing). However, we do need hints to go to the next level. For
example, if you know you will be reading most/all revisions, you can do
things like fetch all the data and decompress all chunks parallel. That's
more efficient than going back and forth between I/O and processing.
python-zstandard has APIs for parallel decompression BTW. If the right
revlog APIs/hints are in place, revlog read performance can speed up
drastically. Of course, reading *everything* doesn't scale to infinity. But
you can set batch size to something ridiculously large like 200 MB to get
the benefits on 99% of revlogs.


>
> > "read-only/mutable repo context manager," a perf-centric "anticipated
> > access patterns context manager" is something I may do in 4.3.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20170513/45920c23/attachment.html>


More information about the Mercurial-devel mailing list