Making chg stateful

Jun Wu quark at fb.com
Thu Feb 2 10:52:34 EST 2017


Excerpts from Simon Farnsworth's message of 2017-02-02 15:25:45 +0000:
> On 02/02/2017 09:34, Jun Wu wrote:
> <snip good bits>
> 
> > What's the API?
> >
> >   (This is an existing implementation detail. I'm open to any ideas)
> >
> >   For example, let's say we want to preload the changelog index (code
> >   simplified so it does not take care of all corner cases).
> >
> >   First, tell chg how to hash and load something:
> >
> >       from mercurial.chgserver import repopreload
> >
> >       @repopreload('index')
> >       def foopreloader(repo):
> >           # use the size of changelog.i as the hash. note: the hash function
> >           # must be very fast.
> >           hash = repo.svfs.stat('00changelog.i').st_size
> >           # tell chg about the current hash. if hash matches, the generator
> >           # function stops here.
> >           yield hash
> >
> >           # if hash mismatches, load the changelog in a slower way.
> >           with repo.svfs('00changelog.i') as f:
> >               data = f.read()
> >               hash = len(f)
> >               index = _buildindex(data)
> >               index.partialmatch('ffff') # force build the radix tree
> >               # tell chg about the loading result and the hash that
> >               # absolutely matches the result.
> >               yield index, hash
> >
> >   Then, repo.chgcache['index'] becomes available in worker processes. When
> >   initializing the changelog index, try to use the chg cache:
> >
> >       # inside changelog's revlog.__init__:
> >       # note: repo.chgcache is empty for non-chg cases, a fallback is needed
> >       self.index = repo.chgcache.get('index') or _loadindex(...)
> >
> 
> Can we ensure that chgcache.get(key) always gets an answer (i.e. no 
> fallback needed)? My concern is divergence between the fallback and 
> chgcache versions of the data fetch.

It's diverged intentionally, mainly because 1:

 1. The chg version could have more internal states to preload. Take the
    changelog index for example, chg would like to force pre-populating the
    partialmatch radix tree, and other indexes we will have in the future.
 2. If chg is not used (ex. Windows), it's not necessary to calculate the
    hash, although that's cheap.

I think chg preload function could call the vanilla load function, plus some
extra work. That will ensure consistency.

> It seems like it should be trivial to implement, too - repo.chgcache for 
> the non-chg case would do:
> 
>      g = preloadfunctable[name](self._prepo)
>      hash = next(g)
>      content, hash = next(g)
>      return g
> 
> knowing that computing hash is cheap.
> 
> This would mean that you'd just do:
>      self.index = repo.chgcache.get('index')
> and if chg can speed you up, it does, while if it's not in use, you get 
> the old slow path. It also means that chg and hg will give the same 
> results, and that there's pressure to use a cheap hash for invalidations 
> (because the repo is effectively always hashed.
> 
> >   The API is the simplest that I can think of, while being also reasonably
> >   flexible (for example, we can add additional steps like forcing building
> >   the radix tree etc). But I'm open to suggestions.
> >
> > Some implementation details
> >
> >   (This is the part that I want feedback the most)
> >
> >   1) IPC between chg master and forked worker
> >
> >      This is how the worker tells the master about what (ex. repo paths) to
> >      preload.
> >
> >      I think it does not need to be 100% reliable. So I use shared memory in
> >      the hacky prototype [2]. Pipes are reliable and may notify master
> >      quicker, while have risks of blocking. shm seems much easier to
> >      implement and I think the latency of master getting the information is
> >      not a big deal. I prefer shm, but other ideas are welcomed.
> >
> 
> Did you try O_NONBLOCK pipes, out of interest?

That could be an option. The ugly part is "write"ing more than PIPE_BUF size
could success with partial content written. So it requires special logic to
deal with boundaries, maybe by adding '\0's at both ends.

> Having said that, given that the worker is forked from the master, shm 
> seems perfectly reasonable to me - the worker and master will, by 
> definition, be running the same code.
> 
> However, rather than coding it ourselves, I'd be inclined to use 
> multiprocessing (https://docs.python.org/2/library/multiprocessing.html ) 
> to provide the shared memory primitives you need.

I actually checked multiprocessing before. It requires you to use its
special fork() implementation instead of just "os.fork()" so its fork hook
gets executed. It has a complex abstraction about fork hooks, raw pointers,
serialization, etc. and does not actually support Windows well. Given the
unnecessary complexity and overhead of multiprocessing, I think a 50-line
implementation is better.

> >   2) Side-effect-free repo
> >
> >      This is the most "painful" part for the preloading framework to be
> >      confident.
> >
> >      The preload function has a signature that takes a repo object. But chg
> >      could not provide a real repo object. So it's best-effort.
> >
> >      On the other hand, most preload functions only need to hash file stats,
> >      i.e. just use repo.vfs, repo.svfs etc. They do not need a full-featured
> >      repo object.
> >
> >      Therefore I think the choices are:
> >
> >       a) Provide a repo object that is: localrepository - side effects
> >          (maximum compatibility)
> >       b) Build a fresh new repo object that has only the minimal part, ex.
> >          vfs, svfs etc. (less compatible)
> >       c) Do not provide a repo object. Just provide "repo path".
> >          (move the burden to the preload function writers)
> >
> >      I currently prefer a). The plan is to move part of "localrepository" to
> >      a side-effect free "baserepository" and use "baserepository" in the
> >      background preloading thread.
> >
> 
> I like this plan.
> 
> >   3) Side effect of extensions
> >
> >      The new chg framework will not run uisetup()s. Where the preloading
> >      framework does sometimes depend on some side effects of extensions'
> >      uisetup()s. For example, the *manifest extension could change greatly
> >      about what the manifest structure so the default manifest preloading
> >      won't work as expected.
> >
> >      I think there are different ways to address this:
> >
> >       a) Add a top-level "chgsetup()" which only gets called by chg, and is
> >          meant to be side-effect free.
> >
> >          Extensions could wrap chg's pseudorepo object, and also register
> >          its own preloading functions.
> >
> >          This is actually pretty clean. But I'm not sure whether "chgsetup"
> >          is a good name or not, or if a top-level function is a good idea in
> >          general.
> >
> >       b) Have a config option to force chg to load extensions as it does
> >          today.
> >
> >          The problems are uisetup will accept wrong ui objects, which just
> >          confuses developers (and is the motivation of the ongoing
> >          refactoring). And it probably makes chg's logic more complex than
> >          ideal.
> >
> >      I cannot think of other good ideas on this. In this situation, I prefer
> >      a).
> >
> 
> Is there a good reason to call the "chgsetup()" function only from chg, 
> or can we arrange for it to be called with or without chg in use?
>
> Again, I'm thinking about the risk of divergent code paths - if it's 
> possible to have the stateless parts in chgsetup(), and the stateful 
> parts in extsetup()/uisetup(), and all functions called in both cases 
> (just in different processes if you're running chg).

This is a valid concern. But we can't have the cake and eat it. I'll think
about this more later.

> > The rough plan
> >
> >   I'd like to get the preloading API done. And then add logic to preload
> >   various things in another hgext. Preloading changelog index could be done
> >   with confidence. While other things could be a bit risky as extensions are
> >   unpredictable. Therefore a config option per repo to control what to
> >   preload is necessary.
> >


More information about the Mercurial-devel mailing list