[PATCH 1 of 3] pull: add --subrepos flag

Angel Ezquerra angel.ezquerra at gmail.com
Fri Feb 22 00:28:44 CST 2013

On Thu, Feb 21, 2013 at 5:55 AM, Matt Harbison <matt_harbison at yahoo.com> wrote:
> Angel Ezquerra wrote:
>> On Wed, Feb 20, 2013 at 6:57 AM, Matt
>> Harbison<matt_harbison at yahoo.com>  wrote:
>>> On Sun, 17 Feb 2013 13:19:16 +0100, Angel Ezquerra wrote:
>>>> # HG changeset patch # User Angel
>>>> Ezquerra<angel.ezquerra at gmail.com> # Date 1360519226 -3600 # Node
>>>> ID abbd26cca35280fb8f784b3f2c02eef71696c47b # Parent
>>>> 55b9b294b7544a6a144f627f71f4b770907d5a98 pull: add --subrepos
>>>> flag
>>>> The purpose of this new flag is to ensure that you are able to
>>>> update to any incoming revision without requiring any network
>>>> access. The idea is to make sure that the repository is
>>>> self-contained after doing hg pull --subrepos, as long as it
>>>> already was self-contained before the pull).
>>>> When the --subrepos flag is enabled, pull will also pull (or
>>>> clone) all subrepos that are present on the current revision and
>>>> those that are referenced by any of the incoming revisions.
>>> I haven't gotten a chance to really play with this yet, so I'm
>>> going more off the comments here- I apologize if these answers
>>> should be obvious, but I'm not familiar enough with some of the
>>> code.
>>> - Is there an easy way to tell if the repo is/was self contained?
>>> (Maybe incoming -S?)
>> No there is not. I don't think incoming -S would do the trick since
>> that would just tell you if there are _new_ incoming revisions on
>> some of the _current_ subrepos. A repo is "self-contained" if it is
>> possible to update to any of its revisions withing requiring a pull
>> of one or more of its subrepos.
>> I don't know of any existing mercurial command that would be able to
>> give you that information.
>>> - Is the 'self-contained' bit to limit overhead on each pull, or is
>>> there another reason this can't ensure the result is self
>>> contained?  'Push' and 'outgoing -S' recognize (almost) everything
>>> going in the other direction, so it might be nice to have the same
>>> capability with a form of pull.  (I may have found a push bug that
>>> I haven't gotten back to yet.)
>> I'm not sure I understand what you mean.
> Consider this (contrived) case:
>   1) push a repo and subrepo to remote (remote is now self contained)
>   2) strip or rollback the remote subrepo
>   3) a top level local repo push will repopulate the remote subrepo
> Now reverse it:
>   1) pull a repo and subrepo from remote (local is self contained)
>   2) strip or rollback the local subrepo
>   3) nothing is incoming top level, so the subrepo isn't repopulated (if
> you've updated to a working dir without that subrepo).
> I'm not sure if there's a less contrived case, or if this matters too
> much.  I guess I was just wondering aloud about the symmetry between
> push and pull -S (this is certainly much better than it was).

I don't think this is a very big deal because "pull --subrepos" also
performs a pull on all the repositories that are found on the current
revision. I think this would fix your contrived case.

The only corner case that would not be properly handled by this would
be the case in which pull --subrepos cloned a _new_ subrepo that is
not found on the current revision (i.e. that is introduced on one of
the cloned revisions) and then you striped some revisions from that
"future subrepo". In that case would would still be able to get the
missing revisions as you would do today, i.e. by updating to the
offending revision. You could also use the onsub extension (which is
shipped with TortoiseHg).

BTW, the more I use the onsub extension the more I am convinced that
it should be distributed with mercurial, and even that the onsub
command itself should become a built-in command.

That being said, if we did want to also handle this very small corner
case there are at least a couple of possible solutions:

1. make the --subrepos flag take a revset that would tell it which
revisions to look for subrepos to pull.
2. make --subrepos also look for subrepos on the descendants of the
current revision.

#1 would be nice _if_ we could also skip the revset to get the
behavior that the current patch proposes. However I don't think that
is possible to have a flag that can optionally take a parameter.

#2... maybe that is too much overhead for such a small corner case?

Additionally, I wonder if it would make sense to change the current
pull behavior a little, by checking if any of the subrepos that are
present on the _current_ parent revision point to a revision that does
not exist, and if that is the case fulling from them _up to that
revision_? I think the parent pull time is an excellent time to do
this kind of stuff since you already know that you have an internet

> I realize you can't do such a thing without crawling most of the
> history.  Are there large public repositories that use subrepos?  I'm
> wondering what the performance hit would be.  (It's easy for me to think
> something is a good idea when I only have small repos and wouldn't
> notice the hit.)

We have repositories with thousands of files and more than 20
subrepos. I don't know if you would consider that big...

> FWIW, largefiles works the same way- if you don't clone or pull with
> --all-largefiles, there's no single command to go back and get the files
> for all revisions that are not incoming.  That leaves the user wondering
> if they really can disconnect from the central repo.

That is true. It would be nice to be able to do that. However there is
a difference between largefiles and subrepos. Largefiles tries to only
get the files that you need, when you need them _by design_. Subrepos
is not meant (IMHO) to do that (or at least it is not necessary for
subrepos to fulfill its main purpose, which is to track "submodules").

>> I don't think you (we?) must give too much importance to this
>> "self-contained" concept. It is just a way for me to explain the
>> purpose of the patch, and specially to explain why we must look for
>> subrepos on all the new incoming revisions, and why we cannot just
>> limit ourselves to pulling the subrepos on the current revisions
>> (short answer: because new subrepos may appear on the new, incoming
>> revisions).
>> My patch explicitly says that hg pull -S will only make your subrepo
>> self-contained if it was already self-contained before. This is in
>> order to avoid having to look for subrepos on all the repo history,
>> rather than just looking for subrepos on the incoming revision (and
>> the current one).
>>> - The full subrepo gets pulled, even revs not committed to the
>>> parent?  I think that's a good thing, because regularly get burned
>>> when I 'pull -u' the tree to another machine and then go to apply
>>> the rest of a patch queue to the subrepo.
>> Yes. It is perhaps not optimal but I think it is simpler. In
>> addition if different parent repo revisions point to different
>> revisions on a subrepo there is no way for us to tell which of those
>> subrepo revisions is the one that is closes to tip, or which ones
>> are ancestors of the other ones, etc. As a result we would need to
>> perform as many pulls on a given repo as the number of different
>> revisions of that subrepo that were referenced on the parent repo.
>> That is complex and slow, so it is much simpler and possibly faster
>> (in some cases at least) to just pull all revisions from each
>> subrepo.
> OK, I misread the code- I thought each subrepo was getting a pull at
> each revision, which I figured would be slow.  I attached a test patch
> below- there's nothing special about it, but it helped me with my pull
> and outgoing changes (some comments probably still reflect this), so I
> changed that to pull and incoming to test your patch.
> - I think I see a double pull of a subrepo (search for "hg pull -S -r 3").

You are right. There is a problem with the current version of the
patch. It will clone the same repo multiple times if the same subrepo
refers to different revisions. I will resend with a fix.

BTW, where do you think the tests for this new feature should go?
Should it go into one of the test-pull*.t files or in one of the
test-subrepo*.t files? Do you think the test that you propose in that
patch would be enough to test this new feature? Maybe it could be made
a bit simpler?

> - I wonder if the code in this patch can be leveraged to make incoming
> print all of the stuff 'pull -S' will grab in the future.

In theory that would be certainly possible. At least we could run the
same "look for subrepos to pull" code and show the list of subrepos
that would be pulled and then run incoming on them. The main problem
is that incoming already has a --subrepos flag, which does something
slightly different... I'm not sure that could be changed...

>>> I'll try to experiment with this some in the next few days.  I ran
>>> into issues with what I'm working on (push, outgoing) with deeply
>>> nested subrepos, and also when a parent locks in an earlier subrepo
>>> version.  I wonder if deeply nested subrepos will be a problem here
>>> since hgsubrepo.pull() doesn't walk its subrepos and pull them.
>> I must confess that I have not tried that too much. We should
>> definitely do this recursively. That being said I hope to get some
>> feedback on the current version that I sent to the list first.
> Sorry, I got crossed up on that too.  hgsubrepo.pull() ends up calling
> _repo.pull(), so it does recurse.

You are right. That's nice! :-)

> The test below indicates that clone won't
> recurse- it reminds that an update is needed.  (Maybe clone needs a -S too
> as part of these changes?  If you aren't walking the history, I
> don't see a way around that because nothing is incoming after a clone,
> so you won't see subrepos of a cloned subrepo.)

I guess that would make sense in another patch.

> The other thing worth a test is largefiles- the --all-largefiles
> option isn't passed to subrepos, so as it stands, 'pull -S' won't let
> you really disconnect, because largefiles in subrepos won't be cached. That
> can be fixed later.

I agree. We do not really use largefiles (yet) so we have not come
across this (yet) but I'm sure this is something we would like to
improve as well.

One last thing, did you have time to have a look at the code of the
patch series itself? I plan to send an slighly changed version of the
patch series that will only change the first patch to fix the issue
you found (the fix is minimal so the code should be almost the same).
Do you have any comments?

Thanks for looking into this.


More information about the Mercurial-devel mailing list