Unbound size of discovery

Augie Fackler raf at durin42.com
Tue Jul 1 12:35:34 CDT 2014


Pontifications inline. I'm mostly trying to braindump the thinking I've done about this in the past for others to think over.

On Jun 30, 2014, at 7:20 PM, Gregory Szorc <gregory.szorc at gmail.com> wrote:

> The size of the wire protocol payload for discovery requests and responses is proportional to the number of heads in the peer repositories. For esoteric repositories, such as Mozilla's Try repository which grows to over 10,000 heads before it is reset, we can see discovery response payloads grow to over 1 MB! We've also brushed up against default HTTP server limits. Mozilla has hit both HTTP header size and count limits due to x-hgarg-n headers during discovery. Fortunately, we operate our own servers, so we can increase the limits. But sometimes there is a load balancer or security device between your Mercurial server and your users (e.g. EC2 - although I'm not sure ELB imposes such limits).

We have, in the past, considered a POST-based fallback for when the heads list gets *really* huge - we've avoided doing this so far because the protocol currently makes it really easy to configure ACLs: you allow reads via GET, writes via POST, and you're done.

Another option that occurs to me is that we could have the client send the sha1(''.join(sorted(heads))) instead of the full heads list when it gets really big. Not sure if that'd have enough of the right properties, but I think it would?

> This kind of unbounded growth is not good for scalability and performance. It may rule out Mercurial as a solution for you.
> 
> One idea I had was to limit returned heads to only public changesets.

That's an interesting idea. Perhaps the try servers could be configured to accept pushes of draft changes but not advertise them back out to clients?

I can't imagine thousands of heads is an overly common use-case. Would it be better if we could build some kind of dedicated "try server mode" that would store pushes as overlays somehow, and then expose each try run under a unique URL for pulling back out? I've thought about something like that in the past for things like code review series in a code review server.

> Another is to allow servers to execute a config-defined revset as part of calculating returned heads. These could likely result in clients sending redundant changeset data to the remote. But for certain scenarios (such as Mozilla's Try where nearly every head stems from a public changeset), the redundancy should be negligible.
> 
> I've also had other crazy ideas such as having the client skip heads and go straight to querying for existence of ancestors in the pushed changeset(s).
> 
> Perhaps these modes of operation are influenced by a capability. e.g. if a remote advertises its heads count, the client can make a determination as to whether classical full-heads-based discovery is appropriate.
> 
> Before I get too far down the rabbit hole, I was curious what solutions have been considered/attempted for dealing with this "discovery bloat."

I don't know that much has been done in this area in particular.

> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20140701/9d29fa6b/attachment.pgp>


More information about the Mercurial-devel mailing list