Making chg stateful

Thu Feb 2 10:45:22 EST 2017

On Thu, 2 Feb 2017 09:34:47 +0000, Jun Wu wrote:
> Perf Numbers
> 
>   I wrote a hacky prototype [2] that shows significant improvements on
>   various commands in our repo:
>   
>                                Before   After (in seconds)
>       chg bookmark             0.40     0.08
>       chg log -r . -T '{node}' 0.50     0.08
>       chg sl                   0.71     0.27
>       chg id                   0.71     0.37
>   
>   And hg-committed (with 12M obsstore) could benefit from it too mainly
>   because the obsstore becomes preloaded:
>   
>                                Before   After
>       chg bookmark             0.56     0.06
>       chg log -r . -T '{node}' 0.56     0.07
>       chg log -r . -p          0.86     0.08
>       chg sl                   0.84     0.21
>       chg id                   0.54     0.06
>   
>   So I think it's nice to get a formal implementation upstreamed. It's also
>   easier to solve the perf issues individually like building the hidden
>   bitmap, building an serialization format for the radix tree, etc.
>   Although we can also build them later.

Nice. I want this perf number today. :)

> So what state do we store?
> 
>   {repopath: {name: (hash, content)}}. For example:
> 
>     cache = {'/home/foo/repo1': {'index': ('hash', changelogindex),
>                                  'bookmarks': ('hash', bookmarks),
>                                  .... },
>              '/home/foo/repo2': { .... }, .... }
> 
>   The main ideas here are:
>     1) Store the lowest level objects, like the C changelog index.
>        Because higher level objects could be changed by extensions in
>        unpredictable ways. (this is not true in my hacky prototype though)
>     2) Hash everything. For changelog, it's like the file stat of
>        changelog.i. There must be a strong guarantee that the hash matches
>        the content, which could be challenging, but not impossible. I'll
>        cover more details below.
> 
>   The cache is scoped by repo to make the API simpler/easy to use. It may
>   be interesting to have some global state (like passing back the extension
>   path to import them at runtime).

Regarding this and "2) Side-effect-free repo", can or should we design the API
as something like a low-level storage interface? That will allow us to not
make repo/revlog know too much about chg.

I don't have any concrete idea, but that would work as follows:

 1. chg injects an object to select storage backends
    e.g. repo.storage = chgpreloadable(repo.storage, cache)
 2. repo passes it to revlog, etc.
 3. revlog uses it to read indexfile, where in-memory cache may be returned
    e.g. storage.parserevlog(indexfile)

Perhaps, this 'storage' object is similar to the one you call 'baserepository'.

>   Then, repo.chgcache['index'] becomes available in worker processes. When
>   initializing the changelog index, try to use the chg cache:
> 
>       # inside changelog's revlog.__init__:
>       # note: repo.chgcache is empty for non-chg cases, a fallback is needed
>       self.index = repo.chgcache.get('index') or _loadindex(...)
> 
>   The API is the simplest that I can think of, while being also reasonably
>   flexible (for example, we can add additional steps like forcing building
>   the radix tree etc). But I'm open to suggestions.
> 
> Some implementation details
> 
>   (This is the part that I want feedback the most)
> 
>   1) IPC between chg master and forked worker
> 
>      This is how the worker tells the master about what (ex. repo paths) to
>      preload.
> 
>      I think it does not need to be 100% reliable. So I use shared memory in
>      the hacky prototype [2]. Pipes are reliable and may notify master
>      quicker, while have risks of blocking. shm seems much easier to
>      implement and I think the latency of master getting the information is
>      not a big deal. I prefer shm, but other ideas are welcomed.

I like using pipes and it won't be complex so long as an IPC message is short
enough to write atomically (< ~4k). But pipe-vs-shm is merely an implementation
detail, so we can switch them later if needed.

>   2) Side-effect-free repo
> 
>      This is the most "painful" part for the preloading framework to be
>      confident.
> 
>      The preload function has a signature that takes a repo object. But chg
>      could not provide a real repo object. So it's best-effort.
> 
>      On the other hand, most preload functions only need to hash file stats,
>      i.e. just use repo.vfs, repo.svfs etc. They do not need a full-featured
>      repo object.
> 
>      Therefore I think the choices are:
> 
>       a) Provide a repo object that is: localrepository - side effects
>          (maximum compatibility)
>       b) Build a fresh new repo object that has only the minimal part, ex.
>          vfs, svfs etc. (less compatible)
>       c) Do not provide a repo object. Just provide "repo path".
>          (move the burden to the preload function writers)
> 
>      I currently prefer a). The plan is to move part of "localrepository" to 
>      a side-effect free "baserepository" and use "baserepository" in the
>      background preloading thread.

(see the comment above)

>   3) Side effect of extensions
> 
>      The new chg framework will not run uisetup()s. Where the preloading
>      framework does sometimes depend on some side effects of extensions'
>      uisetup()s. For example, the *manifest extension could change greatly
>      about what the manifest structure so the default manifest preloading
>      won't work as expected.
> 
>      I think there are different ways to address this:
> 
>       a) Add a top-level "chgsetup()" which only gets called by chg, and is
>          meant to be side-effect free.
>          
>          Extensions could wrap chg's pseudorepo object, and also register
>          its own preloading functions.
> 
>          This is actually pretty clean. But I'm not sure whether "chgsetup"
>          is a good name or not, or if a top-level function is a good idea in
>          general.
> 
>       b) Have a config option to force chg to load extensions as it does
>          today.
> 
>          The problems are uisetup will accept wrong ui objects, which just
>          confuses developers (and is the motivation of the ongoing
>          refactoring). And it probably makes chg's logic more complex than
>          ideal.
> 
>      I cannot think of other good ideas on this. In this situation, I prefer
>      a).

(a) seems fine, if chgsetup() is 100% optional so the extension works flawlessly
without it (i.e. goes through uncached path.)