largefiles: still confused about store vs cache on the client
Benjamin Pollack
benjamin at bitquabit.com
Mon Oct 24 15:18:23 CDT 2011
On 10/24/2011 6:15 AM, Na'Tosha Bard wrote:
> 2011/10/23 Greg Ward <greg at gerg.ca <mailto:greg at gerg.ca>>
>
> Hi all --
>
> The ensuing thread got us somewhere, and I think the patches sent by
> Benjamin as a result helped. But I'm still confused about a rather
> fundamental point: on the client, why do we need *both* a user cache
> (currently ~/.cache/largefiles) *and* a local store (.hg/largefiles)?
>
> The server-side is fairly clear: we must have a complete and canonical
> store containing every revision of every large file in history. That
> is what .hg/largefiles is for *on the server* (right?). And there is
> no need for a cache on the server, because no one has a working dir on
> the server. (And if they did, I suppose you could just take large file
> revs straight from the store.)
>
There's actually still a strong value into having a server cache: it
comes when you and I and Na'Tosha all have our own repositories on a
server, which are not even necessarily related, that share a couple of
gigabytes of largefiles. (Note that this doesn't have to be very
contrived; we may be working on three different video games in a
franchise that end up sharing almost nothing form a code standpoint, but
do share many of the same assets.) In this case, having the server
configured to use a global cache can dramatically cut down disk space
usage: the largefiles would never be hardlinked automatically amongst
the repositories stores, but can trivially be hardlinked from the server
cache, if available. The benefits in an environment like Kiln On Demand
are even stronger: with the server cache, two thousand people on two
thousand different accounts can all decide that they just have to upload
the entire Ubuntu 11.10 ISO, but we're only out 650 MB of disk space.
> More importantly, the very meaning of .hg/largefiles appears to be
> inconsistent from reading hgext/largefiles/design.txt: on the server,
> it contains every revision of every largefile ("complete and
> canonical"). But on the client, it's just a subset of that. So ...
> it's ... like ... a cache. Except it's not called a cache; that's what
> ~/.cache/largefiles is. Huh?
> [snip]
> Why not .hg/lfoutgoing?
>
It's fuzzy. At least one repository in largefiles, somewhere, should
contain all largefiles. If that's your copy, then .hg/largefiles is the
store. If it's not, it's the repository's cache. I guess there's an
ideological purity argument to be made for using two different
directories (which I guess magically get switched in if you either "hg
clone --all" or happen to eventually get all the largefiles), but I'm
personally okay with the current setup.
> I think the fundamental thing you are missing here is that it is quite
> possible for a user to have multiple clones that share the same set of
> largefiles. If there is a team that uses branch-by-cloning, this is
> almost *certainly* the case. Our team does, and I'm sure there are
> still others -- which will continue to be the case until either
> feature-branching-by-named-branches is no longer discouraged or
> bookmarks are actually supported in the real world (which means by
> hosting solutions, continuous integration solutions, etc).
>
> By storing a copy of all of the largefiles in a local cache somewhere,
> the user, when they make a new branch clone, or update to a revision
> that needs one of the lagefiles that is used by another clone, they
> can simply copy it out of the cache, rather than re-download it, thus
> saving bandwidth (which is one of the goals of this extension anyway).
Indeed, this is exactly the purpose. I can't speak for Na'Tosha, but
one of our repositories (in fact, the earliest one we began dogfooding
kbfiles on) is a 100 MB Mercurial repository with about four gigabytes
of largefiles. Yes, we can handle a user re-downloading four gigabytes
from our server...but why bother if they don't have to? For the common
case, the local cache means that re-cloning that repo takes about ten
seconds instead of fifteen minutes. That's a huge win for our designers.
--Benjamin
More information about the Mercurial-devel
mailing list