Computed Index

Status: Project

Main proponents: Pierre-YvesDavid

$/!\$ This is a speculative project and does not represent any firm decisions on future behavior.

Add an intermediate layer on data storage, computed from source of truth, but with more guarantee than the current caches.

$/!\$ The page have light content to start the discussion, some of the implementation details are more advanced than what described here.

1. Goal

1.1. The problem

Currently there are two make place where we record data: the store and the cache. The store is the source of truth, and the cache are some data computed from the store that might and might not be present and relevant to the current. We never guarantee the data in cache to be valid, which means we can freely add new cache entries and update their formats without requiring a repository format upgrade and new entries in requires. However this also means that each cache file needs to implement its own validation mechanism and that some important cache can remain out of date.

This last part is problematic for multiple reasons:

* For some repositories, some data are just too expensive to re-compute all if the cache data are not up to date. For example, invalid branchmap and tag's node cache can cost minutes to recompute for some large repositories. Read-only operations will typically not update the cache on disk and pay this cost for each invocation, making the repository practically unusable.

* The most common cache validation key (tiprev, tipnode) is flawed, so cache can appear valid without being valid.

* The item necessary to validate the cache can lead to increased storage (eg: branchcache) or extra computation time (eg: branchmap).

1.2. The Proposed Solution

We introduce a third space the "Index" space. The index contains data fully derived from the store, but are guaranteed to be in sync at the end of the transaction. Each change to the index needs to come with an associated change to requires to make sure client will keep it up to date.

Note: since the data will be spread across multiple files, we'll still need some way to validate we read consistent data (all from the same transaction). However the mechanism can be much simpler.

2. Detailed description

We want to add an index and windex directory, with the associate vfs. Some of the existing cache could be migrated there list TBD. Some of the new feature we write could go directly there.

We want to use append only friendly storage as much as possible, this make the transaction consistently easier. Having extra data (from inprogress/later transaction) at the end of a file can be harmless if properly detected. This is also a good opportunity to introduce a repository wide identifier of the current state of the repository.

Some of the data currently in cache could directly move inside the revlog indexes.

If we use more append only files, we need good handling of strip and rollback.

3. Roadmap

indexvfs and windexvfs
having a "pointer files" atomically updated by transaction to get a consistent view of the repository.
investigate current cache that could become indexes
- either new files in index
- directly into the changelog indexes.

4. See Also

MMapPlan

CategoryDeveloper CategoryNewFeatures

-  ⇤ ← Revision 2 as of 2019-12-16 18:29:28 → 
  Size: 3465
  Editor: Pierre-YvesDavid
  Comment:
+   ← Revision 3 as of 2020-01-11 14:03:49 → ⇥
  Size: 3473
  Editor: aayjaychan
  Comment: spelling and editing
-Deletions are marked like this.
+Additions are marked like this.
 Line 13:
-Add an intermediate layer on data storage, computed from source of truth, but with more garantee than the current caches.
+Add an intermediate layer on data storage, computed from source of truth, but with more guarantee than the current caches.
 Line 21:
-Currently there are two make place where we record data: the `store` and the `cache`. The store is the source of truth, and the cache are some data computed from the `store` that might and might now be present and relevant to the current. We never garantee the data `cache` to be valid this meant we can freely add new cache entry and update their formats without requiring a repository format upgrade and new entries in `requires`. However this also means that each cache file needs to implements its own validation mechanism and that some important cache can remains out of date.
+Currently there are two make place where we record data: the `store` and the `cache`. The store is the source of truth, and the cache are some data computed from the `store` that might and might not be present and relevant to the current. We never guarantee the data in `cache` to be valid, which means we can freely add new cache entries and update their formats without requiring a repository format upgrade and new entries in `requires`. However this also means that each cache file needs to implement its own validation mechanism and that some important cache can remain out of date.
 Line 25:
-* For some repositories, some data are just too expensive to re-compute all if the caches data are not up to data. For example, invalid branchmap and tag's node cache can cost minutes to recompute for some large repository. Read only operation will typically not update the cache on disk and pay this cost for each invoation. Making the repository practically unusable.
+* For some repositories, some data are just too expensive to re-compute all if the cache data are not up to date. For example, invalid branchmap and tag's node cache can cost minutes to recompute for some large repositories. Read-only operations will typically not update the cache on disk and pay this cost for each invocation, making the repository practically unusable.
 Line 27:
-* The most common cache validation key `(tiprev, tipnode)` is flawed, so cache can appear valid without being valid
+* The most common cache validation key `(tiprev, tipnode)` is flawed, so cache can appear valid without being valid.
 Line 33:
-We introduce a third space the "Index" space. The `index` contains data fully derived from the `store`, but are garanteed to be in sync at the end of the transaction. Each changes to the index needs to comes with an associated changes to `requires` to make sure client will keep it up to date.
+We introduce a third space the "Index" space. The `index` contains data fully derived from the `store`, but are guaranteed to be in sync at the end of the transaction. Each change to the index needs to come with an associated change to `requires` to make sure client will keep it up to date.
 Line 35:
-Note: since the data will be spread accross multiple files. We'll still need some way to validate we read consistent data (all from the same transaction). However the mechanism can be much simpler.
+Note: since the data will be spread across multiple files, we'll still need some way to validate we read consistent data (all from the same transaction). However the mechanism can be much simpler.
 Line 40:
-We want to add a `index` and `windex` directory, with the associate vfs. Some of the existing cache could be migrated there list TBD. Some of the new feature we write could go directly there.
+We want to add an `index` and `windex` directory, with the associate vfs. Some of the existing cache could be migrated there list TBD. Some of the new feature we write could go directly there.
 Line 51:
- * {X} having a "pointeur files" atomically updated by transaction to get a consistent view of the repository.
+ * {X} having a "pointer files" atomically updated by transaction to get a consistent view of the repository.

Diff for "ComputedIndexPlan"