Wire protocol futures

Sat Sep 1 05:44:54 UTC 2018

I don't know much about the wire protocol, so this is mostly for my  
understanding...

On Fri, 31 Aug 2018 18:47:34 -0400, Gregory Szorc  
<gregory.szorc at gmail.com> wrote:

> Another problem that seemingly becomes simpler is large file storage. I
> argue that largefiles and LFS today is effectively a hack to facilitate
> non-partial clones despite the presence of large files. We store and
> transfer flagged large files specially. But if your method of accessing
> files data is through a dedicated "get file data" command, when you  
> squint
> hard enough you realize that this is logically very similar to "all files
> are using largefiles/LFS." This leads to questions like "if we have a
> dedicated 'get file data' API, why do we need a special store / endpoint
> for large files?" And if we communicate the sizes of files before file  
> data
> is retrieved or don't transfer revision data over a size threshold unless
> the client asks, this puts clients in the driver's seat about whether to
> fetch large files revisions. We could implement all the benefits of
> largefiles / LFS without it having to be a feature that repositories and
> servers opt in to! i.e. clients could dynamically apply special storage
> settings on large file revisions as they see fit.

Interesting thought!  How does this downloading specific revisions on  
demand intersect with the append only nature of filelogs and (IIUC),  
entries being stored against the parent as deltas?  IOW, the client  
requests file at 100 (so store a full snapshot), and later file at 50, and then  
maybe file at 75.  Is this all rebuilt on the client side, or is this totally  
dependent on an alternate file storage?  Presumably this sort of thing is  
handled in RFL (I've never used it), but would it scale to truly large  
files?

> Let's talk about hashes.
>
> Mercurial uses SHA-1 for content indexing. We know we want to transition
> off of SHA-1 eventually due to security weaknesses. One of the areas
> affected by that is the wire protocol. Changegroups use a fixed-width 20
> byte field to hold node values. That means we need to incur some kind of  
> BC
> break in order to not use SHA-1 over the wire protocol. That's either
> truncating a longer hashing algorithm output to 20 bytes or expanding the
> fixed-width field to accommodate a different hash (likely 32 bytes).  
> Either
> way, it requires a BC break because old clients would barf if they saw  
> data
> with the new format.
>
> In addition, Mercurial has 2 ways to store manifests: flat and tree.
> Unfortunately, any given repository can only use a single manifest type  
> at
> a time. If you switch manifest formats, you change the manifest node
> referenced in the changeset and that changes the changeset hash.
>
> The traditional way we've thought about this problem is incurring some  
> kind
> of flag day. A server/repo operator makes the decision to one day
> transition to a new format that hashes differently. Clients start pulling
> the new data for all new revisions. Every time we talk about this, we get
> uncomfortable because it is a painful transition to inflict.
>
> I think we can do better.
>
> One of the ideas I'm exploring in the new wire protocol is the idea of
> "hash namespaces." Essentially, the server's capabilities will advertise
> which hash flavors are supported. Example hash flavors could be
> "hg-sha1-flat" for flat manifests using SHA-1 and "hg-blake2b-tree" for
> tree manifests using blake2b. When a client makes a request, that request
> will be associated with a "hash namespace" such that any nodes referenced
> by that command are in the requested "hash namespace."
>
> This feature, if implemented, would allow a server/repository to index  
> and
> serve data under multiple hashing methodologies simultaneously. For
> example, pushes to the repository would be indexed under SHA-1 flat,  
> SHA-1
> tree, blake2b flat, and blake2b tree. Assuming the server operator opts
> into this feature, new clones would use whatever format is
> supported/recommended at that time. Existing clones would continue to
> receive SHA-1 flat manifests. New clones would receive blake2b tree
> manifests. No forced transition flag day would be required. Server
> operators could choose to keep around support for legacy formats for as
> long as they deemed necessary. And the "changesetdata" command I'm
> proposing could allow querying the hashes for other namespaces, allowing
> clients to map between hashes.

Would .hgtags and .hgsubstate be re-written on the fly, like convert  
does?  (Maybe .hgsubstate can't be, because those are really subrepo  
hashes.  Unless we take the approach that the subrepo clone inherits its  
parent's selection.)  If yes, then maybe the same mechanism can be used to  
rewrite any hashes in the commit message and extras?