Making chg stateful

Fri Feb 3 04:33:32 EST 2017

To be clear, the whole preloading framework needs design decisions made in
different places. I'll go through them one by one:

1) What does the worker talk to the master via IPC?

  Current: worker sends "possible_dirty_repo: $REPO_PATH" after runcommand

    It's coupled with the (pseudo) repo object. But the master could preload
    things in correct order - first changelog, then phase, obsstore, etc.

  What if: worker sends "possible_dirty_index: $INDEX_PATH",
  "possible_dirty_obsstore $OBSSTORE_PATH" etc.

    It looks more flexible at first sight. However the dependency issue
    could be tricky to solve - for example, the obsstore requires changelog
    (and revset).  It's much harder to preload obsstore without a pseudo
    repo object with repo.changelog, repo.revs available.

  Therefore, probably just send the repo path.

2) Preload function signature (used by master's background thread).

  This function is to hash and load things. It runs in the master, and is
  basically a generator (so the "loading" part could be skipped on cache
  hit, and it's easier to make content and hash_with_content consistent, if
  they share states):

      @preloadregistar(something)
      def preloadfunc(...):
          yield quick_hash
          yield content, hash_with_content

  Problem: What's "..." ? It could be "repopath" as that's exactly what the
  worker sends to us. But that will make the preload function more
  complicated to be correct - repo has "requirements", etc and the index is
  not always simply path.join(repo_path, 'store', '00changelog.i').

  Similar to 1, if we decouple from the repo object, index preloading would
  be fine:

      @preload('index')
      def preloadindex(indexpath):
          ....

  But things with dependency will have trouble:

      @preload('obsstore')
      def preloadobsstore(obsstorepath):
          # how to access repo.revs('....') ?

  Therefore, the preload function needs a repo (even it's "pseudo") to be
  convenient to implement.

  What to yield may be changed, see "4)".

3) Where to store preloaded results (in master) ?

  The storage can only be a global state. Currently it's a two-level dict,
  repo_path is the 1st level key, while "name"s (ex. "changelog",
  "obsstore") are the 2nd level.

  I think the global state could be either public or private.

4) Where to get preloaded results (in worker) ?

  repo.chgcache (or repo.whatever) mentioned previously is just a shortcut
  to "globalstate[repo.path]". In the real world,
  globalstate[repo.path]['index'], globalstate[repo.path]['obsstore'], etc.
  will be accessed.

  Note: when they are accessed, the framework should know how to calculate
  the latest hash and detect cache invalidation.

  If we want to decouple from repo, we may want to swap the 2 keys of
  "globalstate", ex. globalstate['index'][indexpath]. That requires the
  function to yield the indexpath out so the framework knows where to check:

      @preload('index')
      def preloadindex(repo):
          yield indexpath
          yield hash
          yield content, hash

  This is helpful if something is shared. Like a shared repo etc. Otherwise,
  it may make the code a bit longer than necessary. So maybe we make what to
  "yield" typed:

      @preload('index')
      def preloadindex(repo):
          yield preload.key(indexpath) # optional, default to repo.root
          yield preload.checkhash(hash)
          yield preload.update(content, hash)

  This also looks more flexible - we may extend the preloading framework by
  adding more contents.

Excerpts from Yuya Nishihara's message of 2017-02-03 00:45:22 +0900:
> On Thu, 2 Feb 2017 09:34:47 +0000, Jun Wu wrote:
> > Perf Numbers
> > 
> >   I wrote a hacky prototype [2] that shows significant improvements on
> >   various commands in our repo:
> >   
> >                                Before   After (in seconds)
> >       chg bookmark             0.40     0.08
> >       chg log -r . -T '{node}' 0.50     0.08
> >       chg sl                   0.71     0.27
> >       chg id                   0.71     0.37
> >   
> >   And hg-committed (with 12M obsstore) could benefit from it too mainly
> >   because the obsstore becomes preloaded:
> >   
> >                                Before   After
> >       chg bookmark             0.56     0.06
> >       chg log -r . -T '{node}' 0.56     0.07
> >       chg log -r . -p          0.86     0.08
> >       chg sl                   0.84     0.21
> >       chg id                   0.54     0.06
> >   
> >   So I think it's nice to get a formal implementation upstreamed. It's also
> >   easier to solve the perf issues individually like building the hidden
> >   bitmap, building an serialization format for the radix tree, etc.
> >   Although we can also build them later.
> 
> Nice. I want this perf number today. :)
> 
> > So what state do we store?
> > 
> >   {repopath: {name: (hash, content)}}. For example:
> > 
> >     cache = {'/home/foo/repo1': {'index': ('hash', changelogindex),
> >                                  'bookmarks': ('hash', bookmarks),
> >                                  .... },
> >              '/home/foo/repo2': { .... }, .... }
> > 
> >   The main ideas here are:
> >     1) Store the lowest level objects, like the C changelog index.
> >        Because higher level objects could be changed by extensions in
> >        unpredictable ways. (this is not true in my hacky prototype though)
> >     2) Hash everything. For changelog, it's like the file stat of
> >        changelog.i. There must be a strong guarantee that the hash matches
> >        the content, which could be challenging, but not impossible. I'll
> >        cover more details below.
> > 
> >   The cache is scoped by repo to make the API simpler/easy to use. It may
> >   be interesting to have some global state (like passing back the extension
> >   path to import them at runtime).
> 
> Regarding this and "2) Side-effect-free repo", can or should we design the API
> as something like a low-level storage interface? That will allow us to not
> make repo/revlog know too much about chg.
> 
> I don't have any concrete idea, but that would work as follows:
> 
>  1. chg injects an object to select storage backends
>     e.g. repo.storage = chgpreloadable(repo.storage, cache)
>  2. repo passes it to revlog, etc.
>  3. revlog uses it to read indexfile, where in-memory cache may be returned
>     e.g. storage.parserevlog(indexfile)
> 
> Perhaps, this 'storage' object is similar to the one you call 'baserepository'.
> 
> >   Then, repo.chgcache['index'] becomes available in worker processes. When
> >   initializing the changelog index, try to use the chg cache:
> > 
> >       # inside changelog's revlog.__init__:
> >       # note: repo.chgcache is empty for non-chg cases, a fallback is needed
> >       self.index = repo.chgcache.get('index') or _loadindex(...)
> > 
> >   The API is the simplest that I can think of, while being also reasonably
> >   flexible (for example, we can add additional steps like forcing building
> >   the radix tree etc). But I'm open to suggestions.
> > 
> > Some implementation details
> > 
> >   (This is the part that I want feedback the most)
> > 
> >   1) IPC between chg master and forked worker
> > 
> >      This is how the worker tells the master about what (ex. repo paths) to
> >      preload.
> > 
> >      I think it does not need to be 100% reliable. So I use shared memory in
> >      the hacky prototype [2]. Pipes are reliable and may notify master
> >      quicker, while have risks of blocking. shm seems much easier to
> >      implement and I think the latency of master getting the information is
> >      not a big deal. I prefer shm, but other ideas are welcomed.
> 
> I like using pipes and it won't be complex so long as an IPC message is short
> enough to write atomically (< ~4k). But pipe-vs-shm is merely an implementation
> detail, so we can switch them later if needed.
> 
> >   2) Side-effect-free repo
> > 
> >      This is the most "painful" part for the preloading framework to be
> >      confident.
> > 
> >      The preload function has a signature that takes a repo object. But chg
> >      could not provide a real repo object. So it's best-effort.
> > 
> >      On the other hand, most preload functions only need to hash file stats,
> >      i.e. just use repo.vfs, repo.svfs etc. They do not need a full-featured
> >      repo object.
> > 
> >      Therefore I think the choices are:
> > 
> >       a) Provide a repo object that is: localrepository - side effects
> >          (maximum compatibility)
> >       b) Build a fresh new repo object that has only the minimal part, ex.
> >          vfs, svfs etc. (less compatible)
> >       c) Do not provide a repo object. Just provide "repo path".
> >          (move the burden to the preload function writers)
> > 
> >      I currently prefer a). The plan is to move part of "localrepository" to 
> >      a side-effect free "baserepository" and use "baserepository" in the
> >      background preloading thread.
> 
> (see the comment above)
> 
> >   3) Side effect of extensions
> > 
> >      The new chg framework will not run uisetup()s. Where the preloading
> >      framework does sometimes depend on some side effects of extensions'
> >      uisetup()s. For example, the *manifest extension could change greatly
> >      about what the manifest structure so the default manifest preloading
> >      won't work as expected.
> > 
> >      I think there are different ways to address this:
> > 
> >       a) Add a top-level "chgsetup()" which only gets called by chg, and is
> >          meant to be side-effect free.
> >          
> >          Extensions could wrap chg's pseudorepo object, and also register
> >          its own preloading functions.
> > 
> >          This is actually pretty clean. But I'm not sure whether "chgsetup"
> >          is a good name or not, or if a top-level function is a good idea in
> >          general.
> > 
> >       b) Have a config option to force chg to load extensions as it does
> >          today.
> > 
> >          The problems are uisetup will accept wrong ui objects, which just
> >          confuses developers (and is the motivation of the ongoing
> >          refactoring). And it probably makes chg's logic more complex than
> >          ideal.
> > 
> >      I cannot think of other good ideas on this. In this situation, I prefer
> >      a).
> 
> (a) seems fine, if chgsetup() is 100% optional so the extension works flawlessly
> without it (i.e. goes through uncached path.)