largefiles: still confused about store vs cache on the client

Benjamin Pollack benjamin at bitquabit.com
Mon Oct 24 15:18:23 CDT 2011


On 10/24/2011 6:15 AM, Na'Tosha Bard wrote:
> 2011/10/23 Greg Ward <greg at gerg.ca <mailto:greg at gerg.ca>>
>
>     Hi all --
>
>     The ensuing thread got us somewhere, and I think the patches sent by
>     Benjamin as a result helped. But I'm still confused about a rather
>     fundamental point: on the client, why do we need *both* a user cache
>     (currently ~/.cache/largefiles) *and* a local store (.hg/largefiles)?
>
>     The server-side is fairly clear: we must have a complete and canonical
>     store containing every revision of every large file in history. That
>     is what .hg/largefiles is for *on the server* (right?). And there is
>     no need for a cache on the server, because no one has a working dir on
>     the server. (And if they did, I suppose you could just take large file
>     revs straight from the store.)
>

There's actually still a strong value into having a server cache: it 
comes when you and I and Na'Tosha all have our own repositories on a 
server, which are not even necessarily related, that share a couple of 
gigabytes of largefiles.  (Note that this doesn't have to be very 
contrived; we may be working on three different video games in a 
franchise that end up sharing almost nothing form a code standpoint, but 
do share many of the same assets.)  In this case, having the server 
configured to use a global cache can dramatically cut down disk space 
usage: the largefiles would never be hardlinked automatically amongst 
the repositories stores, but can trivially be hardlinked from the server 
cache, if available.  The benefits in an environment like Kiln On Demand 
are even stronger: with the server cache, two thousand people on two 
thousand different accounts can all decide that they just have to upload 
the entire Ubuntu 11.10 ISO, but we're only out 650 MB of disk space.

>     More importantly, the very meaning of .hg/largefiles appears to be
>     inconsistent from reading hgext/largefiles/design.txt: on the server,
>     it contains every revision of every largefile ("complete and
>     canonical"). But on the client, it's just a subset of that. So ...
>     it's ... like ... a cache. Except it's not called a cache; that's what
>     ~/.cache/largefiles is. Huh?
>     [snip]
>     Why not .hg/lfoutgoing?
>

It's fuzzy.  At least one repository in largefiles, somewhere, should 
contain all largefiles.  If that's your copy, then .hg/largefiles is the 
store.  If it's not, it's the repository's cache.  I guess there's an 
ideological purity argument to be made for using two different 
directories (which I guess magically get switched in if you either "hg 
clone --all" or happen to eventually get all the largefiles), but I'm 
personally okay with the current setup.

> I think the fundamental thing you are missing here is that it is quite 
> possible for a user to have multiple clones that share the same set of 
> largefiles.  If there is a team that uses branch-by-cloning, this is 
> almost *certainly* the case.  Our team does, and I'm sure there are 
> still others -- which will continue to be the case until either 
> feature-branching-by-named-branches is no longer discouraged or 
> bookmarks are actually supported in the real world (which means by 
> hosting solutions, continuous integration solutions, etc).
>
> By storing a copy of all of the largefiles in a local cache somewhere, 
> the user, when they make a new branch clone, or update to a revision 
> that needs one of the lagefiles that is used by another clone, they 
> can simply copy it out of the cache, rather than re-download it, thus 
> saving bandwidth (which is one of the goals of this extension anyway).

Indeed, this is exactly the purpose.  I can't speak for Na'Tosha, but 
one of our repositories (in fact, the earliest one we began dogfooding 
kbfiles on) is a 100 MB Mercurial repository with about four gigabytes 
of largefiles.  Yes, we can handle a user re-downloading four gigabytes 
from our server...but why bother if they don't have to?  For the common 
case, the local cache means that re-cloning that repo takes about ten 
seconds instead of fifteen minutes.  That's a huge win for our designers.

--Benjamin


More information about the Mercurial-devel mailing list