State of pending series related to hidden/obsolete storage

Wed May 24 07:01:46 UTC 2017

There are a handful of in flight series related to storage and performance
of hidden/obsolete data:

1. Durham adds a "hidden" store/cache [1] and teaches things to update it
2. Pierre-Yves adds a cache for obsolescence data [2]
3. Pierre-Yves refactors some algorithms related to visibility to yield
massive speedups [3]
4. Jun applies decent algorithms and C to make obsstore suck less [4]

While the series are mostly unrelated in terms of code, they all are in the
same problem space. I'd like to use this thread to hash out their
relationships before we queue anything so (most) everyone is on the same
page and we don't introduce extra complexity, redundantly solve the same
problems, etc.

While I haven't looked at the patches in extreme detail, the Pierre-Yves
series to improve existing algorithms (mostly in repoview.py) [3] seems
very reasonable and non-controversial. Timings in the commit messages
reveal 75-150x speedups for some operations with time reductions of tens or
even >100 milliseconds. Assuming there isn't something fundamentally wrong
with the implementation that's making it so fast, it seems like an obvious
win. Dropping the cache could be contentious. But unless someone has a repo
demonstrating a significant win with the cache, I'm inclined to jettison
the complexity.

The other series are a bit more... complicated.

Pierre-Yves's obsolescence cache series [2] gave birth to Jun's RFC series
[4] after the two of them were discussing the caching strategy. I agree
with Jun that Pierre-Yves's proposed approach has inherent scaling limits,
I also think that with [3] and [2], we may have bought another year or two
for obsolescence scaling, even at Facebook scale. I'd *really* like to see
numbers from Facebook's repo to prove we need additional complexity,
especially if C code is involved. I'd also kinda like to see some more
attempts made to make the existing C code faster. For example, currents
obsstore parsing is slow partially because we instantiate PyObjects for
each marker which are themselves composed of other PyObjects. I suspect we
can make obsstore "parsing" 10-100x faster by avoiding PyObject overhead
and implementing intelligent APIs in non-Python C. Similar observations
apply to revlog indexes. No, we won't get to the ideal end state without
better data structures and potentially caches. But I think we're in a
perfect is the enemy of good situation, especially since we don't all agree
on what exactly the final state should be. That's a long winded way of
saying "I realize we'll eventually have to implement something like Jun's
radix tree solution, but until I see data showing it is a pressing
performance concern, we should defer the work."

That brings us to Durham's series [1]. As Pierre-Yves pointed out, the
series does a lot of things. Aspects of these things are even implemented
by Pierre-Yves's series (such as more efficient resolution of hidden
changesets). You therefore have to look at Durham's series as a set of new
features and concepts for managing visibility. Before, obsolescence was the
sole source of truth. Afterwards, visibility is its own concept and
obsolescence flows into it. I like this model because it makes the
visibility mechanism more flexible and sets us up for future experiments. I
also don't think it will interfere with evolve too much - at least as long
as the APIs are designed correctly.

If you rolled back time and un-invented obsolescence markers, I'd like to
think that what Durham is attempting to do is how visibility would have
been rolled out in Mercurial. If you wanted to preserve append-only storage
by avoiding stripping, you would declare a way to hide and unhide
changesets. This would consist of some kind of store and an internal API to
tell the visibility subsystem when changesets are changing state. If in
this world we suddenly realized that you could record relationships between
changesets (obsolescence markers), we would have obsolescence markers flow
into the existing visibility mechanism and influence it as appropriate. In
other words, in an alternate reality where obsolsecence didn't exist,
Durham built a Mercurial feature we would all be elated to have because it
opened all kinds of new possibilities. But because we have an existing
concept of visibility derived solely from obsolescence markers, it looks a
bit awkward being introduced today.

In summary, both of Pierre-Yves's series seem reasonable to me. Let's make
reading/computing obsolescence and derived visibility as fast as it needs
to be. I could do with less caching. But if caching is needed, it is a
necessary evil. And being a cache, we can remove it any time should a
better solution come along.

Durham's series is a bit contentious, yes. But I argue it is retroactively
inventing infrastructure that should have been in place to support
obsolescence from the beginning. I have concerns about the current
implementation, such as its half-cache/half-store solution and apparent
redundancy between APIs to update visibility and write obsolescence markers
(surely we can unify that somehow). Formalizing the concept of visibility
outside of obsolescence is an important first step to unlocking a number
goals for various parties. If done right, I see it having minimal impact on
evolve development. And I'm pretty confident Pierre-Yves will let us know
when it isn't being done right and when there will be interaction issues
between generic visibility and evolve.

That's a long way of saying "I generally like what I see and I'm ready to
review/bikeshed some code." Are others on the same page?

[1]
https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-May/098069.html
[2]
https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-May/098257.html
[3]
https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-May/098448.html
[4]
https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-May/098330.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20170524/820df774/attachment.html>