[PATCH 1 of 6 V2] obsstore: add a 'cachekey' method

Sat Jun 3 19:24:59 EDT 2017

On 06/03/2017 11:23 PM, Jun Wu wrote:
> Excerpts from Pierre-Yves David's message of 2017-06-03 12:12:55 +0200:
>> Indexes are definitely a good thing and we will want some at some point.
>>
>> However, the radix tree series on the list an early RFC[1] while this
>> obscache series is a port from the evolve extension of code in the
>> running in many places for over a month.
>
> I think I'm nearly completing it. As a side effect I did many small cleanups
> in obsolete.py
>
>> There are various things to clarify and adjust from that RFC series (eg:
>> collaboration with the transaction) before we can consider taking it.
>>
>> So to me the way forward here is to get the existing working solution in
>> now and incorporates the other one later. As a bonus point, they will
>> have fairly similar logic for cache validation and incremental upgrade.
>> So work from obscache will benefit the work on the indexes. If the
>> obscache happens to be "obsoleted" by something new and better (eg:
>> radix tree?), we can drop the obscache at that point. This just happens
>> to the "hiddencache" which got recently dropped since hidden computation
>> got much faster.
>>
>> In addition, it is unclear, we'll want to drop the obscache once the the
>> indexes land. Timing wise, the RFC version (using some C), compute the
>> obsolete set is 45.8ms vs 1.2ms (no extra C) for the obscaches[2]. This
>
> That 44ms data is outdated. My current version is even 10ms faster than
> obscache for "hg id". I didn't expect it to be, though.

There is likely something wrong with your timing is you get a 10ms speed 
up… since the obscache computation is about 1ms. How are you running 
your testing? Do you have your code available somewhere?

(note: you probably have an outdated version on obscache too, I shaved 
it a bit in the past week (1.58ms → 0.95ms))

> If the perf difference is within +/-20ms, I think it's fair to say this
> 500-ish lines obscache implementation seem over complicated.

The obscache logic itself is < 100 lines the rest are cache key + 
incremental upgrade logic common to other things. The changelog/revision 
part is pretty standard and we could factor it out with other similar 
user (but is small enough than the value is small). The obsstore related 
cache key and upgrade is doing something similar to what the indexes 
approach needs so there should be no significant difference in size and 
complexity.

>> is a big save on the startup time.In the same repository (no extension
>> but evolve-obscache), hg version take 80ms seconds. 44ms is over 50% of
>> that. For a more practical command `hg diff -r .^` is 170ms, so tens of
>> millisecond will be tens of percent slow down in impact for fast command
>> (and we have a small amount of obsolete revisions: about 6K).
>> The obscache is using revision indexing, this mean no rev to node
>> conversion and a more direct data access. To compute the set of obsolete
>> changeset (part of startup) this meant it will remains significantly
>> faster than an indexes approach. We independently want the indexes
>> approach to solve other issue.
>>
>> Finally, there are other evolve related area were we will want caches,
>> this series install the necessary infrastructure to easily add such
>> caches. For example obsolescence-markers-discovery is an area were we
>> know we'll need caching using this kind of infrastructure to detect and
>> perform update. In addition, various troubles[3] are a bit complex to
>> compute, I expect us to want to keep them in a cache too. Some of that
>> logic can even be reused for the radix tree implementation.
>
> As I said in another thread, I think a better long term direction is to make
> related revset lazy. People usually run `hg log -r "small-set"`, they won't
> care about things not being outputted. Today when I run "hg log -r .", hg
> checks 5k+ commits for obsolete revset, I don't think we want to keep that
> behavior forever.

Lazyness is nice, but vastely premature here in my opinion. We have all 
the computation fitting in 1 milliseconds. And the computation is 
"O(len(not public))" so it going to scale well even of massive 
repository. So the days were we'll needs to replace it with something 
more complex. Above you mention being ready to dismiss speed up in the 
10ms range to save hundred of lines, so a whole rework of an area in the 
1ms range really seems unnecessary.

> If we go that "lazy" direction, the cache approach seems unnecessary in
> general.
>
> I'm not sure about obsolescence-markers-discovery. But I do have some ideas
> in that area too. I guess we have to see the actual logic in that area to
> judge whether dualsourcecache is worthwhile or not.

Feel free to have a look at the code in the evolve extension.

Cheers,

-- 
Pierre-Yves David