largefiles: still confused about store vs cache on the client

Mon Oct 24 19:55:16 CDT 2011

On Mon, Oct 24, 2011 at 6:15 AM, Na'Tosha Bard <natosha at unity3d.com> wrote:
>> But back on the client, where I pull and push and update and commit,
>> what purpose does .hg/largefiles server? Having a local cache is
>> obviously a good thing, although it's not essential. (I never got
>> around to implementing caching with bfiles, and we've lived without
>> it. It wastes bandwidth and increases network uptime requirements, but
>> our LAN at work is fast and reliable. And our biggest bfile is ~30 MB:
>> peanuts by game developer standards.)
>>
>> More importantly, the very meaning of .hg/largefiles appears to be
>> inconsistent from reading hgext/largefiles/design.txt: on the server,
>> it contains every revision of every largefile ("complete and
>> canonical"). But on the client, it's just a subset of that. So ...
>> it's ... like ... a cache. Except it's not called a cache; that's what
>> ~/.cache/largefiles is. Huh?
[...]
>
> I think the fundamental thing you are missing here is that it is quite
> possible for a user to have multiple clones that share the same set of
> largefiles.  If there is a team that uses branch-by-cloning, this is almost
> *certainly* the case.  Our team does, and I'm sure there are still others --
> which will continue to be the case until either
> feature-branching-by-named-branches is no longer discouraged or bookmarks
> are actually supported in the real world (which means by hosting solutions,
> continuous integration solutions, etc).
>
> By storing a copy of all of the largefiles in a local cache somewhere, the
> user, when they make a new branch clone, or update to a revision that needs
> one of the lagefiles that is used by another clone, they can simply copy it
> out of the cache, rather than re-download it, thus saving bandwidth (which
> is one of the goals of this extension anyway).

Huh? I never questioned the utility of a local cache on the
user's machine. I think it's a great idea. And Benjamin makes a
pretty good case for a system-wide cache on the server.

What I am questioning is why we have a cache in
~/.cache/largefiles *and* a <something> in .hg/largefiles. I know
what a cache is: you trade in local disk space (cheap) and get
back time and bandwidth (expensive): good deal. You can nuke
~/.cache/largefiles if local space is tight. You don't have to
backup ~/.cache. Etc.

But I don't entirely understand what the <something> in
.hg/largefiles is. Lemme quote Benjamin before continuing:

> It's fuzzy. At least one repository in largefiles, somewhere,
> should contain all largefiles. If that's your copy, then
> .hg/largefiles is the store. If it's not, it's the repository's
> cache. I guess there's an ideological purity argument to be
> made for using two different directories (which I guess
> magically get switched in if you either "hg clone --all" or
> happen to eventually get all the largefiles), but I'm
> personally okay with the current setup.

Ahhh, OK, I think I'm starting to get it: you've gone and
decentralized the "central store" idea from bfiles. Nice! I
think.

Core tenet of DVCS: "all repositories are equal, but some are more
equal than others". With bfiles, there was one central store from
which everyone downloaded, but largefiles now says that "all stores
are equal, but some are more equal than others". Right?

The obvious downside is that someone has to make damn sure that
there is at least one complete and canonical store, *or* has to
explicitly decide "we don't care about revs from waaay back
then".

As for caching on the server:

> There's actually still a strong value into having a server cache: it
> comes when you and I and Na'Tosha all have our own repositories on a
> server, which are not even necessarily related, that share a couple
> of gigabytes of largefiles. (Note that this doesn't have to be very
> contrived; we may be working on three different video games in a
> franchise that end up sharing almost nothing form a code standpoint,
> but do share many of the same assets.) In this case, having the
> server configured to use a global cache can dramatically cut down
> disk space usage: the largefiles would never be hardlinked
> automatically amongst the repositories stores, but can trivially be
> hardlinked from the server cache, if available. The benefits in an
> environment like Kiln On Demand are even stronger: with the server
> cache, two thousand people on two thousand different accounts can
> all decide that they just have to upload the entire Ubuntu 11.10
> ISO, but we're only out 650 MB of disk space.

Yup, that makes a lot of sense for a hosting service. (Aside: do you
also save the 1,999 x 650 MB of bandwidth from all those redundant
uploads?)

*But* the downside here is that by sticking largefiles from unrelated
repos into one big cache, it's hard to purge the largefiles for a repo
you no longer care about. It would be nice if we had the option of
splitting up the largefile cache by repo -- and I don't mean "clone as
branch" repo, I mean distinct projects with unrelated repos.

Greg