[PATCH 1 of 6 V2] obsstore: add a 'cachekey' method

Wed Jun 7 16:04:02 EDT 2017

On Wed, Jun 7, 2017 at 10:27 AM, Jun Wu <quark at fb.com> wrote:

> Excerpts from Pierre-Yves David's message of 2017-06-07 18:04:22 +0100:
> >
> > On 06/07/2017 03:57 PM, Jun Wu wrote:
> > > Excerpts from Pierre-Yves David's message of 2017-06-07 15:27:41 +0100:
> > >> I'm not advocating for introducing dead code. The obsstore
> > >> indexes(radix-link) are a cache layer to the obsstore. As a cache, the
> > >> indexes need a cache-key and associated logic for incremental update.
> > >
> > > The problem I see for your cache-key approach is it has to have length
> > > information to be able to do incremental update. So it's not a typical
> cache
> > > key.
> >
> > I'm not sure what you mean here. The cache key used to validate the
> > cache and detect changes is only:
> >
> >    <size-of-obsstore, hash-from-content>
> >
> > This is based on the same things as the cache key you uses for indexes
> > (size of obsstore); with an additional content checks (hash) in my
> proposal.
>
> I will not call that a "cache key". I think the common understanding of a
> "cache key" is, it only supports Eq operation and does not support
> PartialEq.
>
> The UUID (or whatever ID) will be a cleaner concept - if the ID changes,
> invalidate the whole cache.
>
> > As of this series was written, directly obtaining markers from an (old,
> > new) offset of the obsstore is not possible. The use of "number of
> > obsmarkers" by the abstract cache is here to work around this. One of
> > the goal of having the code in core was to follow up with a rework of
> > the obsmarkers reading API to allow such partial read and drop the
> > "number of markers" hack. There a comment point this out in the code I
> sent.
> >
> > I see that you have done in your indexes series all the refactoring that
> > I needed (Mostly patches 5,6 and 12). Thanks you for that.
> >
> > (Note: evolve grown a hacky version of this some weeks back and is not
> > longer using the "number markers" in practice:
> >
> > https://www.mercurial-scm.org/repo/evolve/file/85cabd631a1b/
> hgext3rd/evolve/obscache.py#l130
> > )
> >
> > In addition, I'm not sure what you mean by "not typical". Most cache in
> > Mercurial have "tiprev" as part of the cachedkey and uses to compute
> > their incremental update.
> >
> > > I'd prefer the UUID + length approach. We have "length" implemented
> today.
> >
> > The UUID approach requires breaking append-onlyness or using multiple
> > file. It is less friendly with to (including wide transaction rollback)
> > and races are gets annoying to avoid.
>
> Could you give an example of how race could happen and why is it not
> append-only if the id is stored in a separate file?
>
> I think the only real problematic case is old clients.
>
> > The append-only based approach do not suffer from these drawback making
> > it more appealing.
> >
> > >> What I need here is to have this logic "standard" and reusable by
> other
> > >> caches. I can split the obsstore specific part from the changelog part
> > >> and make it available, either used by the obscache or by the indexes
> > >>
> > >> note: current indexes cache key is fragile (file length check only)
> and
> > >> would benefit from the stronger cachekey+logic) available in this
> series.
> > >
> > > I'm not sure checking 200 bytes (in your current approach) is not
> fragile.
> >
> > It is less fragile. Obsmarkers created in the same repo has a high odd
> > to have the same size, the bytes hashing is catching content change.
> > We can increase the number of bytes we read a bit to reinforce it if
> > this turn out to be too low.
>
> But it is still problematic. Do you agree?
>
> > See my other email for idea on how to make it more complete:
> >
> > https://www.mercurial-scm.org/pipermail/mercurial-devel/2017
> -June/099317.html

This thread and the larger conversation around obsolescence perf
improvements is massive.

While it appears that radixlink and obscache are complementary proposals,
they both aim to achieve similar end results. It isn't clear to me that
both of them will be needed.

Given:

a) the radixlink work has use cases outside of obsolescence
b) there are reports from Facebook that obscache isn't huge repo friendly
and therefore not as ready as radixlink
c) it appears that the obscache feature can be implemented out of core
(presumably in the evolve extension) more easily than radixlink can
d) obscache currently only seems to provide incremental performance wins
over radixlink for non-trivial complexity and at least 1 reviewer is
already expressed hesitations (
https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-June/099040.html
)

I favor tabling this obscache series and focusing on landing radixlink.
Best case is we do that, obsolescence scales well enough, we're done, and
we can use radixlink for other things. Worst case is it doesn't pan out, we
rip it out of obsolscence once something better comes along. Realistically,
it lands a perf win today, frees up Jun :), unblocks other radixlink work,
and Pierre-Yves later demonstrates how things could be even better, we
listen to him, and accept his patches, possibly before the 4.3 release.

I hate saying "no" (really: "just put this on hold for a week or two") to
someone and playing kingmaker. But I feel the debate is causing too much
"stop energy" and is demotivating everyone. I prefer we fall forward and
make some progress, even if it isn't perfect. And to maximize the chances
of forward progress, I think we need to focus on one solution to
obsolescence performance at a time. I perceive the implementation with the
most value to be radixlink because it can be used outside obsolescence and
has already demonstrated merit over obscache on huge repos.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20170607/747042f1/attachment.html>