A thought on subrepos

Tue Apr 19 21:39:20 CDT 2011

On 04/14/11 at 03:00P, Matt Mackall wrote:
> It seems many projects with subrepos are structured like:
> 
> app/ <- main repo
>  lib/ <- a subrepo
> 
> This is perhaps the most obvious way to do things, but is not really
> ideal. A better way is:
> 
> build/ <- main repo
>  app/ <- subrepo
>  lib/ <- subrepo
> 
> For starters, this does away with most of the "I didn't mean to
> recursively commit" issues as commits at the top-level will be much less
> common.

This does do away with most of those issues.  I still don't like the recursive commit
behavior, because even in this case committing in the main repository and committing
in the subrepo are semantically two different things:

* Sub repo: Add API method Foo.
* Main repo: Take advantage of the new Foo API method.

I can't think of a case where one commit message is perfect for both commits.  Maybe
there is one, but I think in *most* cases it makes more sense to have two separate
commits with separate messages.

I'm not going to fight the backwards-compatibility battle of changing the behavior of
'hg commit' though.  That's how it's worked for a while now, so I'll deal with it.

> This also greatly lowers the degree of dependence between app/ and lib/,
> but still gives you the ability to commit and tag coherent combinations
> of app and lib.
> 
> A general statement of this approach is: "if a repo contains real code,
> it shouldn't contain subrepos."

You're right, at the pure-Mercurial, philosophical level.  But from what I've found,
subrepos aren't perfect (or usable enough for new Mercurial users) in practice yet.

Here are the problems I've encountered myself, or heard from other people:

First, no one works like this.  Or at least: very few people do this.  People hear
"SUBrepos" and immediately jump to the first structure you mentioned: a SUBrepo as
a SUBdirectory of their project.

If this is the "ideal" subrepo structure it needs to be documented on the subrepo
wiki page, and preferably in any tutorial/guide/help-file/whatever that mentions
subrepos.  Or at least the most popular ones.

This documentation should explain why this way of working is better, and why the
other way can't possibly work.

It has to do this because this way is simply more work for repository maintainers.
You have another repo to keep track of, make public, and commit in.  You also need to
explain what the different repos are (and how to use them) to your users.

Not only is there the extra work on maitaining another repository, but the directory
structure is now dictated by your workflow and not what's most natural for your
project.  Maybe your particular environment prefers library Foo to be at /project/foo
instead of /project/../foo.  Now you're got to work around this in your build
scripts (or whatever).

The other main problem, and the main reason I don't use subrepos much myself, is that
subrepos still feel fragile.  Especially when using non-Mercurial subrepos it's not
too difficult to get into a state where 'hg update --clean REV' aborts with an error.

To me, 'hg update --clean REV' means "go to REV, dammit, I don't care what you throw
away to get there." If there's another command I should be using to say "just go to
this rev, working directory and subrepos be damned" then it's not obvious.

I'd write some test cases for this if I had the free time, but right now I don't.
I just know that I've encountered this in the very few times I've used subrepos (my
dotfiles repo, mainly).

Another problem is that switching subrepo paths is manual operation at this point in
time.  Here's a real-life example:

* I add syntastic.vim as a subrepo to repository X.
* Everything's fantastic for a month.
* I find a bug in Syntastic, fix it, fork/push it on/to BitBucket/GitHub/whatever.
* I commit in my dotfiles repo, which records the new revision hash of Syntastic.
* Now I need to update the subrepo path by hand in all checkouts of my dotfiles,
  otherwise 'hg update' aborts with an error about not being able to find the
  appropriate revision.

Updating the path in .hgsub works for new checkouts, but not for already existing
ones.

For purely "library" subrepos the path changes could be automated, but for subrepos
where you're working in both the subrepo and the main repo at the same time automated
changes could get annoying.

The last thing that bugs me about subrepos is that they turn 'hg update' into
a non-local operation.  I seem to recall us talking about directory caching on IRC --
is this something that's still of interest?

For DVCS subrepos I can see a solution through caching:

* I pull some changesets.
* Mercurial looks at .hgsub in each changeset I just pulled.
* If there's a new remote path for a subrepo, run
  'hg/git init .hg/cache/subrepos/[encoded subrepo path]/[hash of remote subrepo target url]'
* Mercurial runs 'hg/git pull/fetch --R [cached subrepo path] REV' for each
  REV+remote-subrepo-target-path in .hgsub in any pulled rev.

That would cache any needed subrepo revisions locally, so 'hg update' could become
purely local again.

That sounds nice, but everything goes to hell once you add in Subversion subrepos.
I'm not sure how to handle those.  Even so, I still think it would be worth it to
implement git/hg subrepo caching because it would help in a large amout of cases.

TL;DR version of this longer-than-intended email: Matt's right about this workflow,
but if we want to get people to work this way we need to document it thoroughly and
fix some of the pain points about working with subrepos.

-- 
Steve Losh