Making chg stateful
Yuya Nishihara
yuya at tcha.org
Thu Feb 2 10:45:22 EST 2017
On Thu, 2 Feb 2017 09:34:47 +0000, Jun Wu wrote:
> Perf Numbers
>
> I wrote a hacky prototype [2] that shows significant improvements on
> various commands in our repo:
>
> Before After (in seconds)
> chg bookmark 0.40 0.08
> chg log -r . -T '{node}' 0.50 0.08
> chg sl 0.71 0.27
> chg id 0.71 0.37
>
> And hg-committed (with 12M obsstore) could benefit from it too mainly
> because the obsstore becomes preloaded:
>
> Before After
> chg bookmark 0.56 0.06
> chg log -r . -T '{node}' 0.56 0.07
> chg log -r . -p 0.86 0.08
> chg sl 0.84 0.21
> chg id 0.54 0.06
>
> So I think it's nice to get a formal implementation upstreamed. It's also
> easier to solve the perf issues individually like building the hidden
> bitmap, building an serialization format for the radix tree, etc.
> Although we can also build them later.
Nice. I want this perf number today. :)
> So what state do we store?
>
> {repopath: {name: (hash, content)}}. For example:
>
> cache = {'/home/foo/repo1': {'index': ('hash', changelogindex),
> 'bookmarks': ('hash', bookmarks),
> .... },
> '/home/foo/repo2': { .... }, .... }
>
> The main ideas here are:
> 1) Store the lowest level objects, like the C changelog index.
> Because higher level objects could be changed by extensions in
> unpredictable ways. (this is not true in my hacky prototype though)
> 2) Hash everything. For changelog, it's like the file stat of
> changelog.i. There must be a strong guarantee that the hash matches
> the content, which could be challenging, but not impossible. I'll
> cover more details below.
>
> The cache is scoped by repo to make the API simpler/easy to use. It may
> be interesting to have some global state (like passing back the extension
> path to import them at runtime).
Regarding this and "2) Side-effect-free repo", can or should we design the API
as something like a low-level storage interface? That will allow us to not
make repo/revlog know too much about chg.
I don't have any concrete idea, but that would work as follows:
1. chg injects an object to select storage backends
e.g. repo.storage = chgpreloadable(repo.storage, cache)
2. repo passes it to revlog, etc.
3. revlog uses it to read indexfile, where in-memory cache may be returned
e.g. storage.parserevlog(indexfile)
Perhaps, this 'storage' object is similar to the one you call 'baserepository'.
> Then, repo.chgcache['index'] becomes available in worker processes. When
> initializing the changelog index, try to use the chg cache:
>
> # inside changelog's revlog.__init__:
> # note: repo.chgcache is empty for non-chg cases, a fallback is needed
> self.index = repo.chgcache.get('index') or _loadindex(...)
>
> The API is the simplest that I can think of, while being also reasonably
> flexible (for example, we can add additional steps like forcing building
> the radix tree etc). But I'm open to suggestions.
>
> Some implementation details
>
> (This is the part that I want feedback the most)
>
> 1) IPC between chg master and forked worker
>
> This is how the worker tells the master about what (ex. repo paths) to
> preload.
>
> I think it does not need to be 100% reliable. So I use shared memory in
> the hacky prototype [2]. Pipes are reliable and may notify master
> quicker, while have risks of blocking. shm seems much easier to
> implement and I think the latency of master getting the information is
> not a big deal. I prefer shm, but other ideas are welcomed.
I like using pipes and it won't be complex so long as an IPC message is short
enough to write atomically (< ~4k). But pipe-vs-shm is merely an implementation
detail, so we can switch them later if needed.
> 2) Side-effect-free repo
>
> This is the most "painful" part for the preloading framework to be
> confident.
>
> The preload function has a signature that takes a repo object. But chg
> could not provide a real repo object. So it's best-effort.
>
> On the other hand, most preload functions only need to hash file stats,
> i.e. just use repo.vfs, repo.svfs etc. They do not need a full-featured
> repo object.
>
> Therefore I think the choices are:
>
> a) Provide a repo object that is: localrepository - side effects
> (maximum compatibility)
> b) Build a fresh new repo object that has only the minimal part, ex.
> vfs, svfs etc. (less compatible)
> c) Do not provide a repo object. Just provide "repo path".
> (move the burden to the preload function writers)
>
> I currently prefer a). The plan is to move part of "localrepository" to
> a side-effect free "baserepository" and use "baserepository" in the
> background preloading thread.
(see the comment above)
> 3) Side effect of extensions
>
> The new chg framework will not run uisetup()s. Where the preloading
> framework does sometimes depend on some side effects of extensions'
> uisetup()s. For example, the *manifest extension could change greatly
> about what the manifest structure so the default manifest preloading
> won't work as expected.
>
> I think there are different ways to address this:
>
> a) Add a top-level "chgsetup()" which only gets called by chg, and is
> meant to be side-effect free.
>
> Extensions could wrap chg's pseudorepo object, and also register
> its own preloading functions.
>
> This is actually pretty clean. But I'm not sure whether "chgsetup"
> is a good name or not, or if a top-level function is a good idea in
> general.
>
> b) Have a config option to force chg to load extensions as it does
> today.
>
> The problems are uisetup will accept wrong ui objects, which just
> confuses developers (and is the motivation of the ongoing
> refactoring). And it probably makes chg's logic more complex than
> ideal.
>
> I cannot think of other good ideas on this. In this situation, I prefer
> a).
(a) seems fine, if chgsetup() is 100% optional so the extension works flawlessly
without it (i.e. goes through uncached path.)
More information about the Mercurial-devel
mailing list