Making chg stateful

Sat Feb 4 23:31:29 EST 2017

On Fri, 3 Feb 2017 20:03:18 +0000, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2017-02-04 00:11:22 +0900:
> > On Thu, 2 Feb 2017 16:56:11 +0000, Jun Wu wrote:
> > > Excerpts from Yuya Nishihara's message of 2017-02-03 00:45:22 +0900:
> > > > On Thu, 2 Feb 2017 09:34:47 +0000, Jun Wu wrote:
> > > > > So what state do we store?
> > > > > 
> > > > >   {repopath: {name: (hash, content)}}. For example:
> > > > > 
> > > > >     cache = {'/home/foo/repo1': {'index': ('hash', changelogindex),
> > > > >                                  'bookmarks': ('hash', bookmarks),
> > > > >                                  .... },
> > > > >              '/home/foo/repo2': { .... }, .... }
> > > > > 
> > > > >   The main ideas here are:
> > > > >     1) Store the lowest level objects, like the C changelog index.
> > > > >        Because higher level objects could be changed by extensions in
> > > > >        unpredictable ways. (this is not true in my hacky prototype though)
> > > > >     2) Hash everything. For changelog, it's like the file stat of
> > > > >        changelog.i. There must be a strong guarantee that the hash matches
> > > > >        the content, which could be challenging, but not impossible. I'll
> > > > >        cover more details below.
> > > > > 
> > > > >   The cache is scoped by repo to make the API simpler/easy to use. It may
> > > > >   be interesting to have some global state (like passing back the extension
> > > > >   path to import them at runtime).
> > > > 
> > > > Regarding this and "2) Side-effect-free repo", can or should we design the API
> > > > as something like a low-level storage interface? That will allow us to not
> > > > make repo/revlog know too much about chg.
> > > > 
> > > > I don't have any concrete idea, but that would work as follows:
> > > > 
> > > >  1. chg injects an object to select storage backends
> > > >     e.g. repo.storage = chgpreloadable(repo.storage, cache)
> > > >  2. repo passes it to revlog, etc.
> > > >  3. revlog uses it to read indexfile, where in-memory cache may be returned
> > > >     e.g. storage.parserevlog(indexfile)
> > > >
> > > > Perhaps, this 'storage' object is similar to the one you call 'baserepository'.
> > > 
> > > I'm not sure if I get the idea (probably not). How does the implementation
> > > in the master server look like?
> > 
> > I was just thinking about how to hack the real repo object without introducing
> > much mess. Perhaps the master server wouldn't be that different from your idea.
> > 
> > > It feels more like "repo.chgcache" to me and the difference is that the
> > > vanilla hg will be changed to access objects via it (so the interface looks
> > > more consistent).
> > 
> > Yeah, it might be like repo.chgcache.
> > 
> > Since we shouldn't pass repo to revlog (it's layering violation), I think
> > we'll need a thin wrapper for chgcache anyway.
> 
> I mentioned this in the second mail, "4) Where to get preloaded results (in
> worker)", we could just expose some kind of global state, like a
> "globalcache" module.

Does it mean any low-level objects will directly access to the global cache?
That seems ugly (but maybe I'm biased as I'm really allergic to global data.)

> > > Things to consider:
> > > 
> > >   a) Objects being preloaded have dependency - ex. the obsstore depends on
> > >      changelog and other things. The preload functions run in a defined
> > >      order.
> > 
> > Maybe dependencies could be passed as arguments?
> 
> Ideally, these expensive calculating (ex. obsstore) could be moved to the
> index object. In the reality, that requires too much work, and obsstore
> preloading requires a subset of "repo", including "repo.revs".
> 
> It's possible to decouple obsstore preloading from the repo object, but that
> could be a lot of work too.

My opinion for obsstore is that the most costly part would be populating 100k+
objects from file, and the other costly parts could be mitigated by some higher-
level cache in repoview.py.

But I think this topic was discussed thoroughly between you and pyd before.
I'm not intended to bring it up again.

> > >   b) The index file is not always a single file, depending on "vfs".
> > 
> > Yes. vfs could be owned by storage/chgcache class.
> > 
> > >   c) The user may want to control what to preload. For example, if they have
> > >      an incompatible manifest, they could make changelog preloaded, but not
> > >      manifest.
> > 
> > No idea about (c).
> > 
> > >   d) Users can add other preloading items easily, not only just the
> > >      predefined ones.
> > 
> > So probably we'll need an extensible table of preloadable items.
> 
> If you check my prototype code, it's using a registrar to collect all
> @preload functions.

Yes. I wanted to say we would need this kind of abstraction anyway.