Unbound size of discovery

Tue Jul 1 13:03:57 CDT 2014

On 7/1/14, 10:35 AM, Augie Fackler wrote:
> Pontifications inline. I'm mostly trying to braindump the thinking I've done about this in the past for others to think over.
>
> On Jun 30, 2014, at 7:20 PM, Gregory Szorc <gregory.szorc at gmail.com> wrote:
>
>> The size of the wire protocol payload for discovery requests and responses is proportional to the number of heads in the peer repositories. For esoteric repositories, such as Mozilla's Try repository which grows to over 10,000 heads before it is reset, we can see discovery response payloads grow to over 1 MB! We've also brushed up against default HTTP server limits. Mozilla has hit both HTTP header size and count limits due to x-hgarg-n headers during discovery. Fortunately, we operate our own servers, so we can increase the limits. But sometimes there is a load balancer or security device between your Mercurial server and your users (e.g. EC2 - although I'm not sure ELB imposes such limits).
>
> We have, in the past, considered a POST-based fallback for when the heads list gets *really* huge - we've avoided doing this so far because the protocol currently makes it really easy to configure ACLs: you allow reads via GET, writes via POST, and you're done.
>
> Another option that occurs to me is that we could have the client send the sha1(''.join(sorted(heads))) instead of the full heads list when it gets really big. Not sure if that'd have enough of the right properties, but I think it would?

That's an interesting idea. Although, it is susceptible to the same kind 
of unbound growth problem. However, I'd like to think that heads get 
merged or obsoleted over time and there won't be unbound growth in 
practice (at least assuming an asymmetric client that doesn't maintain a 
full clone of the remote).

>> This kind of unbounded growth is not good for scalability and performance. It may rule out Mercurial as a solution for you.
>>
>> One idea I had was to limit returned heads to only public changesets.
>
> That's an interesting idea. Perhaps the try servers could be configured to accept pushes of draft changes but not advertise them back out to clients?
>
> I can't imagine thousands of heads is an overly common use-case. Would it be better if we could build some kind of dedicated "try server mode" that would store pushes as overlays somehow, and then expose each try run under a unique URL for pulling back out? I've thought about something like that in the past for things like code review series in a code review server.

We have the following use cases for "giga-headed" repos:

1) Try server
2) Code review
3) Code collaboration (push all your feature heads to a central location 
to share)

These are all really the same thing modulo push and post-push side-effects.

Our developers insist on having an HTTP endpoint for referencing pushed 
changesets. hgweb should "just work."

In all our use cases, we care about clients pulling individual heads. It 
is extremely rare for the entire repository to be cloned/pulled. 
Although, I'd like to support full cloning to enable "offline" bulk 
processing.

I've considered writing an extension that "forks" the incoming 
changegroup bundle and persists it to a file/bundle. You could maintain 
a mapping of head/changeset to bundle file and make a custom 
localrepository class that applies an overlay when necessary. I recall 
dismissing this idea after mpm somewhere (I can't remember where) said 
that Mercurial should scale to thousands of heads out of the box. (Or 
maybe that's just what I interpreted him as saying.)