[PATCH 5 of 6] hidden: add hiddenset and updatevisibility logic

Sun Jun 4 13:15:57 UTC 2017

Thanks for your reply. It clarified multiple aspect of your goals and 
constraints (eg: changegroup loading time). That discussion seems on 
good track to me.

On 05/23/2017 12:38 AM, Durham Goode wrote:
> On 5/22/17 2:38 AM, Pierre-Yves David wrote:
>> On 05/19/2017 05:25 AM, Gregory Szorc wrote:
>>> On Thu, May 18, 2017 at 11:23 AM, Durham Goode <durham at fb.com
>>> <mailto:durham at fb.com>> wrote:
>>>
>>>     # HG changeset patch
>>>     # User Durham Goode <durham at fb.com <mailto:durham at fb.com>>
>>>     # Date 1495129620 25200
>>>     #      Thu May 18 10:47:00 2017 -0700
>>>     # Node ID 0979ce9d40803103441b6df1ebefafe80bb9a057
>>>     # Parent  2aef3c4be6ba81e9d2453760515517df6fc2b816
>>>     hidden: add hiddenset and updatevisibility logic
>>
>> TL;DR; This series seems to try to address multiple distinct problem.
>> Multiple of them having much simpler solution (some already available).
>> I think we should clarify our goals here. The result will probably be to
>> focus the effort on adding a simple "local-hidding" mechanism using the
>> existing hiding API.
>>
>> --
>>
>> [longer reply]
>>
>> Having looked at the full series, I've the feeling that there are many
>> distinct topic that this new class tries to address. I've gathered the
>> following:
>>
>>  * hidden API usable by extension,
>>  * speed improvement (using write time computation),
>>  * an API to get notified of relevant changes,
>>  * Local hiding solution,
>>  * unified storage for hiding.
>>
>> There might be other I missed. Durham can you clarify your goals if
>> needed?
>
> The overall goal here is to refactor hiding to be in an isolated part of
> the code base, so we can reason about it and iterate on it more easily.

After a refactoring from Martin and me, the code related to hiding is 
contained in a handful of functions at the top of 
"mercurial/repoview.py" file. Is this isolated enough for your need? 
Should we move them into a small hidden.py module?

> All the goals you list become easier to implement if we have a good
> abstraction for hiding.

(I'm not sure why, but lets skip that part)

> Part of the internal goal of this change is to put pieces in core that
> would let us do hash preservation within Facebook, so we could get rid
> of the inhibit extension.

Last month, Jun and I had good a discussion about a possible way to get 
preservation of the user visible hash while preserving all the existing 
evolution features. It was promising and would allow Facebook to stay on 
the same feature set as the rest of the world.

Latest status was: "Jun is about to write a plan page." What happened to 
that idea?

>> Trying to tackle that many different problem at the same time worries
>> me. Addressing each of them independently should be simpler, in
>> addition, solving issue at a lower level bring benefit to other parts of
>> the code. For example, skipping obs-store loading just needed a straight
>> forward cache, easy to update iteratively. Having that cache in now
>> speeding up all operations that rely on the "obsolete-set" information,
>> and the cache logic can be reused for other related data.
>
>> My short opinion on all this topics are:
>>
>>  * API: extensible API already exists so I'm a bit confused,
>>  * Speed: simpler solutions in review,
>>  * notification: seems redundant with 'tr.changes' and less robust.
>>  * local-hiding: Good idea, go for it (consider using ranges, not roots)
>>  * unifying storage: seems unnecessary and dangerous.
>>
>> Longer opinion available below. I've also included technical comment in
>> the patch itself.
>>
>> API for extensions
>> ------------------
>>
>> Quoting durham email:
>>
>>> This has the nice property that all logic for propagating hiddenness is
>>> contained to one function, and all the consumers just have to report
>>> what
>>> commits they touched. If an external extension wants to modify how
>>> hiding is
>>> computed, it can just wrap this one function.
>>
>> I'm a bit confused about that part of your description, because the
>> current implementation offer a small surface API already.
>>
>> There are two functions "hideablerevs" and "_getdynamicblockers"[1].
>>
>> * "hideablerevs": return the revision we would like to see hidden
>>   (eg: obsolete revision)
>>
>> * "_getdynamicblockers": return revision that must be visible
>>   (eg: bookmarks)
>>
>> Extensions can already overwrite these two functions and augment/alter
>> their return to get any wished behavior.
>
> The piece that is missing from the current API is infrastructure to
> listen for changes that affect hiding at write time. This includes being
> notified of a change, as well as having apis available to incrementally
> update hiding during that change and a pluggable place to store the new
> hidden computation results.

(The "listen for changes" API seems covered by tr.changes and you 
address that part later, so skipping that bit)

After rework, the hidden computation (still in Python) is within the 
millisecond range, with "O(len(mutable))" complexity. With such timing 
we seems in the clear for a long time and we can probably skip the write 
time caching of the final hidden set.

Remains you goal to implement a different methods for computing what is 
hidden. (If I got it right, it is based on the sequence of event instead 
of the on-disck) state of the repository).

With the current state of the API, the simplest way to get there seems 
to feed tr.changes to a custom logic updating a custom dedicated 
storage. Then transmit the content of that storage through 
'hideablerevs()' (replacing the native obsolescence based) and let the 
rest of the hidden computation do its duty (in 1 ms).

Having that running is fairly independent and does not seems to requires 
large rework of the hidden computation machinery.

> This series doesn't remove computehidden() as an extension point, it
> just provides the ability to hook into hiding at more locations (write
> time and storage time, in addition to the existing read time).

That parts confuses me a bit. If the goal is to simplify the hidding API 
and computation, why are we aiming at adding more hook points?

>> Speed
>> -----
>>
>> The current approach for computing hidden can scale well. Actually, I've
>> two series on the list that speed it up down to 1ms order of magnitude.
>> The first one[2] is adding a small cache to retrieve the set of obsolete
>> changeset without loading the obsstore. The second one[3] is updating
>> the old computehidden code to only performs O(len(visible-mutable))
>> python operations.
>>
>> If our goal is to provide local-only hiding, we can get there faster,
>> with simpler code, by simply feeding the "hideablerevs" function.
>>
>
> I eventually want every read operation to be O(number of changes you're
> asking about).  If I run 'hg log -r .' and we go perform ancestor
> queries on commits that are two months old, then that's too slow. We're
> at the point where we're beginning to think about how to avoid loading
> even parts of the changelog .i during standard read operations.

While "O(1)" sound nice, in practice O(mutable) should be fast enough 
even of gigantic repository like yours (since the number of locally 
created changeset is much smaller than the size of the repository).

Obsolete + hidden computation can easily be kept in the 1 ms range. On 
On a repository your size, loading the changelog or even computing 
phases will be some orders of magnitude slower than 1ms. I'm curious 
about your current timing for changelog loading and and phase 
computation in the repos where they get problematic. Can you share some?

It seems like you already have something in mind to avoid loading the 
whole changelog. I'm a bit curious about what it is. Off the top of my 
head I would preserve the nodeâ†’rev indexes to disk somehow and read the 
changelog data in a lazier way. Can you share you actual plan here?

If the time spend computing the phases become an issue, I can see a 
couple of straightforward way to bring this to O(mutable). One way  is 
to naively write the current phases data for each revs directly to disk. 
(and update it incrementally). Another way would be a small 
on-disk-format changes for phases  to that is store "ranges" not just 
"roots". both solution above can bring the computation back to 
O(mutable) with an acceptable timing.

Making this improvement to phases should be simpler than your plan to 
introduce a unified hidden cache (phase change looks like a strict 
subset of the global hidden cache idea). In addition improving phases 
would benefit multiple other parts of the codebase.

With all the improvement above, the hidden computation, including 
changelog and phases access should be in the millisecond range and I 
don't think we'll have to care about it for a while.

What do you think?

> This series doesn't get us to O(1) hidden checks, but it puts in place
> the hidden abstraction that would let us eventually get there. I don't
> think we should have to solve O(1) lookups for all the inputs to hidden
> before we can get O(1) hidden lookup time.
>
>> Notification API
>> ----------------
>>
>> There are a similar efforts already started around the
>> 'transactions.changes' dictionary. It aims at tracking all changes
>> happening within a transaction. If we needs to track changes, I
>> recommend reusing that logic instead of duplicating the effort (adding
>> more odds for bugs and maintenance load).
>>
>> The 'tr.changes' approach the low level function adding data recording
>> these changes. This is lower level than the approach in this patch and
>> will be more robust as We need to update the one function adding data,
>> instead of all the consumer of that function. We just fixed a bug from
>> this lack of collaboration last week[3], so it seems important to keep
>> it simple here.
>
> tr.changes might be the appropriate spot for this. I can definitely look
> into that as a replacement for hidden.updatevisibility().

Thanks!, there are not many data recorded in tr.changes right now, but 
adding the missing one should be simple.

>> unified storage for hiding
>> --------------------------
>>
>> Given we can extract and compute hidden information from obsolescence in
>> the 1ms range, I don't see the advantage of unifying the storage. On the
>> contrary, hidden properties from obsolescence is a critical part of
>> changeset-evolution. Mixing the storage will makes it harder to
>> guarantee this property. So better keeping them separated.
>
> As we've seen, there are differing opinions on the future of hiding
> commits in Mercurial and this patch opens doors to some options without
> breaking evolve. Having this conceptual distinction in core will help
> give people more insight into and confidence in the various options
> around hiding so we can have a more productive discussion next time.

Since we already have the conceptual (and in the API) distinction 
between obsolescence and hidden computation, I'm not sure what you mean 
here. (An alternative/extension just needs to provide it local and storage.

Cheers,

-- 
Pierre-Yves David