[PATCH 1 of 3] pull: add --subrepos flag

Matt Harbison matt_harbison at yahoo.com
Sun Feb 24 02:20:52 CST 2013


Angel Ezquerra wrote:
> On Thu, Feb 21, 2013 at 5:55 AM, Matt Harbison<matt_harbison at yahoo.com>  wrote:
>> Angel Ezquerra wrote:
>>> On Wed, Feb 20, 2013 at 6:57 AM, Matt
>>> Harbison<matt_harbison at yahoo.com>   wrote:
>>>> On Sun, 17 Feb 2013 13:19:16 +0100, Angel Ezquerra wrote:
>>>>
>>>>> # HG changeset patch # User Angel
>>>>> Ezquerra<angel.ezquerra at gmail.com>  # Date 1360519226 -3600 # Node
>>>>> ID abbd26cca35280fb8f784b3f2c02eef71696c47b # Parent
>>>>> 55b9b294b7544a6a144f627f71f4b770907d5a98 pull: add --subrepos
>>>>> flag
>>>>>
>>>>> The purpose of this new flag is to ensure that you are able to
>>>>> update to any incoming revision without requiring any network
>>>>> access. The idea is to make sure that the repository is
>>>>> self-contained after doing hg pull --subrepos, as long as it
>>>>> already was self-contained before the pull).
>>>>>
>>>>> When the --subrepos flag is enabled, pull will also pull (or
>>>>> clone) all subrepos that are present on the current revision and
>>>>> those that are referenced by any of the incoming revisions.
>>>> I haven't gotten a chance to really play with this yet, so I'm
>>>> going more off the comments here- I apologize if these answers
>>>> should be obvious, but I'm not familiar enough with some of the
>>>> code.
>>>>
>>>> - Is there an easy way to tell if the repo is/was self contained?
>>>> (Maybe incoming -S?)
>>>
>>> No there is not. I don't think incoming -S would do the trick since
>>> that would just tell you if there are _new_ incoming revisions on
>>> some of the _current_ subrepos. A repo is "self-contained" if it is
>>> possible to update to any of its revisions withing requiring a pull
>>> of one or more of its subrepos.
>>>
>>> I don't know of any existing mercurial command that would be able to
>>> give you that information.
>>>
>>>> - Is the 'self-contained' bit to limit overhead on each pull, or is
>>>> there another reason this can't ensure the result is self
>>>> contained?  'Push' and 'outgoing -S' recognize (almost) everything
>>>> going in the other direction, so it might be nice to have the same
>>>> capability with a form of pull.  (I may have found a push bug that
>>>> I haven't gotten back to yet.)
>>>
>>> I'm not sure I understand what you mean.
>>
>> Consider this (contrived) case:
>>
>>    1) push a repo and subrepo to remote (remote is now self contained)
>>    2) strip or rollback the remote subrepo
>>    3) a top level local repo push will repopulate the remote subrepo
>>
>> Now reverse it:
>>
>>    1) pull a repo and subrepo from remote (local is self contained)
>>    2) strip or rollback the local subrepo
>>    3) nothing is incoming top level, so the subrepo isn't repopulated (if
>> you've updated to a working dir without that subrepo).
>>
>> I'm not sure if there's a less contrived case, or if this matters too
>> much.  I guess I was just wondering aloud about the symmetry between
>> push and pull -S (this is certainly much better than it was).
>
> I don't think this is a very big deal because "pull --subrepos" also
> performs a pull on all the repositories that are found on the current
> revision. I think this would fix your contrived case.
 >
> The only corner case that would not be properly handled by this would
> be the case in which pull --subrepos cloned a _new_ subrepo that is
> not found on the current revision (i.e. that is introduced on one of
> the cloned revisions) and then you striped some revisions from that
> "future subrepo". In that case would would still be able to get the
> missing revisions as you would do today, i.e. by updating to the
> offending revision. You could also use the onsub extension (which is
> shipped with TortoiseHg).
>
> BTW, the more I use the onsub extension the more I am convinced that
> it should be distributed with mercurial, and even that the onsub
> command itself should become a built-in command.

I'd be fine with that.  I haven't used it, but from what I've read, it seems 
like a useful crutch to do things that aren't currently possible because they 
aren't well defined, like tagging all repos or opening a branch.  But ideally 
there wouldn't be a need for it with commands that accept a -S.  The principle 
of least surprise says those commands can handle subrepos already.

> That being said, if we did want to also handle this very small corner
> case there are at least a couple of possible solutions:
>
> 1. make the --subrepos flag take a revset that would tell it which
> revisions to look for subrepos to pull.

Why not use the -r option?  But really, I think in this case 'pull -S' == 'pull 
-S -r all()', so I'm not sure if that helps.

> 2. make --subrepos also look for subrepos on the descendants of the
> current revision.
>
> #1 would be nice _if_ we could also skip the revset to get the
> behavior that the current patch proposes. However I don't think that
> is possible to have a flag that can optionally take a parameter.

Agreed.  A special case parameter isn't nice.

> #2... maybe that is too much overhead for such a small corner case?

Not sure.. That's why I was asking if there's large repos to benchmark on :-). 
The other thing I'm wondering is, since we seem to have various caches, maybe 
caching subrepo info would speed things up?  I have no idea what form this would 
take, or even how the existing caches are invalidated, detect corruption, etc. 
The construct I've got in my push patches is:

      f = repo.wjoin('.hgsubstate')
      revs = repo.revs('(adds(%s) or modifies(%s)) and ::(%ln)', f, f, revs)

I'm not sure if this helps cut down overhead- you do have to traverse the whole 
repo, but you don't have to process the .hgsubstate file for revisions where the 
subrepos don't change.  (Though you referenced having the latest path for the 
repo, and that can change without affecting .hgsubstate, so this may not be 
useful anyway.)


> Additionally, I wonder if it would make sense to change the current
> pull behavior a little, by checking if any of the subrepos that are
> present on the _current_ parent revision point to a revision that does
> not exist, and if that is the case fulling from them _up to that
> revision_? I think the parent pull time is an excellent time to do
> this kind of stuff since you already know that you have an internet
> connection...

Can you show a short test case or pseudo test case of this?  I'm not sure what 
you're suggesting here.  Wouldn't the existing pull bring in all of the changes 
to the tip of each subrepo now?  Why back off on that without a -r option?

>> I realize you can't do such a thing without crawling most of the
>> history.  Are there large public repositories that use subrepos?  I'm
>> wondering what the performance hit would be.  (It's easy for me to think
>> something is a good idea when I only have small repos and wouldn't
>> notice the hit.)
>
> We have repositories with thousands of files and more than 20
> subrepos. I don't know if you would consider that big...

Not sure.  I'm assuming that the number of commits is the governing factor when 
walking the history?  But that's almost certainly going to be larger than the 
toy repos I make, or even the real ones I use at work (still less than 1000 in 
the subrepo, and probably less than 200 on the parent).  Just looking for 
something to run benchmarks on.

>> FWIW, largefiles works the same way- if you don't clone or pull with
>> --all-largefiles, there's no single command to go back and get the files
>> for all revisions that are not incoming.  That leaves the user wondering
>> if they really can disconnect from the central repo.
>
> That is true. It would be nice to be able to do that. However there is
> a difference between largefiles and subrepos. Largefiles tries to only
> get the files that you need, when you need them _by design_. Subrepos
> is not meant (IMHO) to do that (or at least it is not necessary for
> subrepos to fulfill its main purpose, which is to track "submodules").

I agree with you about subrepos, but the --all-largefiles flag is a new(ish) 
feature that lets you override that original largefiles design, so you can 
disconnect from the central repo.  (So a very similar motivation for this.)  I 
was going to put in a command to download missing largefiles, but never got 
around to it.  But I don't think a new command would be an option for subrepos 
if we don't get this right, because it really is just a pull.  If we can change 
the pull command in the future if there's a need to handle this, then it may not 
be that big of a deal now.  And I suspect the only hinderance to adding it later 
will be if there's a big performance drop.

>>> I don't think you (we?) must give too much importance to this
>>> "self-contained" concept. It is just a way for me to explain the
>>> purpose of the patch, and specially to explain why we must look for
>>> subrepos on all the new incoming revisions, and why we cannot just
>>> limit ourselves to pulling the subrepos on the current revisions
>>> (short answer: because new subrepos may appear on the new, incoming
>>> revisions).
>>>
>>> My patch explicitly says that hg pull -S will only make your subrepo
>>> self-contained if it was already self-contained before. This is in
>>> order to avoid having to look for subrepos on all the repo history,
>>> rather than just looking for subrepos on the incoming revision (and
>>> the current one).
>>>
>>>> - The full subrepo gets pulled, even revs not committed to the
>>>> parent?  I think that's a good thing, because regularly get burned
>>>> when I 'pull -u' the tree to another machine and then go to apply
>>>> the rest of a patch queue to the subrepo.
>>>
>>> Yes. It is perhaps not optimal but I think it is simpler. In
>>> addition if different parent repo revisions point to different
>>> revisions on a subrepo there is no way for us to tell which of those
>>> subrepo revisions is the one that is closes to tip, or which ones
>>> are ancestors of the other ones, etc. As a result we would need to
>>> perform as many pulls on a given repo as the number of different
>>> revisions of that subrepo that were referenced on the parent repo.
>>> That is complex and slow, so it is much simpler and possibly faster
>>> (in some cases at least) to just pull all revisions from each
>>> subrepo.
>>
>> OK, I misread the code- I thought each subrepo was getting a pull at
>> each revision, which I figured would be slow.  I attached a test patch
>> below- there's nothing special about it, but it helped me with my pull
>> and outgoing changes (some comments probably still reflect this), so I
>> changed that to pull and incoming to test your patch.
>>
>> - I think I see a double pull of a subrepo (search for "hg pull -S -r 3").
>
> You are right. There is a problem with the current version of the
> patch. It will clone the same repo multiple times if the same subrepo
> refers to different revisions. I will resend with a fix.
 >
> BTW, where do you think the tests for this new feature should go?
> Should it go into one of the test-pull*.t files or in one of the
> test-subrepo*.t files? Do you think the test that you propose in that
> patch would be enough to test this new feature? Maybe it could be made
> a bit simpler?
>

I assumed it would go in test-subrepo*t, so anyone hacking on subrepos doesn't 
have to find it in an obscure test, but I can see the same argument from someone 
working on pull().  Probably what tips it to subrepo*t though is that there is 
some existing subrepo infrastructure that doesn't exist in test-pull*t.

I'm not sure that it's comprehensive, and there's probably some duplication.  I 
put it together to e.g. test 'push -r 2' then 'push -r 4', and make sure the 
grandchild repo code path got exercised.  Since you aren't doing anything with 
-r, there may be duplicate pull tests.  The incoming tests probably aren't 
relevant either, though I think they should agree on what they are doing eventually.

Those tests also didn't exercise the subrepo pull path (the dests were empty, so 
clone was used).  I marked various potential issues in there with 'XXX'.
One thing I did specifically was to skew the rev values between the parent and 
child, so parent rev 1 doesn't lock in subrepo rev 1.  This makes it easier to 
test with the -r option (which I realize you aren't doing now, but might be a 
nice to have in the future).

>> - I wonder if the code in this patch can be leveraged to make incoming
>> print all of the stuff 'pull -S' will grab in the future.
>
> In theory that would be certainly possible. At least we could run the
> same "look for subrepos to pull" code and show the list of subrepos
> that would be pulled and then run incoming on them. The main problem
> is that incoming already has a --subrepos flag, which does something
> slightly different... I'm not sure that could be changed...

I think it just recurses into the subrepo and does an incoming on it.  If it 
does something else, it's too subtle for me to have noticed.

I've probably stated bits and pieces of this over the course of a week or so, 
but just to explicitly put it all together, I think in an ideal world we would 
have new pull/incoming, and push/outgoing agree with the same options, and each 
group is the reverse of the other.  It may be fantasy, but:

    incoming       pull          operation
    <no arg>       <no arg>      current behavior, parent repo only
    xxx [1]        -u            current behavior, pull subrepo on update
    -S             -S            (this patch) pull parent + all subrepos to tip
    -S -r <rev>    -S -r <rev>   (future) pull parent to rev, revs of subrepos
                                 used up to that rev

[1] It doesn't seem sensible to add -u to incoming for this, so leaving no 
equivalent form is probably best.

    outgoing       push          operation
    <no arg> [2]   <no arg>      current, parent only and all respectively
    -S             <no arg> [3]  current, push parent + all subrepos to tip
    -S -r <rev>    -r <rev>      (future) push parent to rev, revs of subrepos
                                 used up to that rev

[2] It seems buggy to me that the no arg forms disagree, since outgoing's help 
says "These are the changesets that would be pushed if a push was requested." 
But that ship has probably already sailed.

[3] There is no form of push that takes -S

So the last two lines in each table align in functionality and (roughly) in 
arguments.


>>>> I'll try to experiment with this some in the next few days.  I ran
>>>> into issues with what I'm working on (push, outgoing) with deeply
>>>> nested subrepos, and also when a parent locks in an earlier subrepo
>>>> version.  I wonder if deeply nested subrepos will be a problem here
>>>> since hgsubrepo.pull() doesn't walk its subrepos and pull them.
>>>
>>> I must confess that I have not tried that too much. We should
>>> definitely do this recursively. That being said I hope to get some
>>> feedback on the current version that I sent to the list first.
>>
>> Sorry, I got crossed up on that too.  hgsubrepo.pull() ends up calling
>> _repo.pull(), so it does recurse.
>
> You are right. That's nice! :-)
>
>> The test below indicates that clone won't
>> recurse- it reminds that an update is needed.  (Maybe clone needs a -S too
>> as part of these changes?  If you aren't walking the history, I
>> don't see a way around that because nothing is incoming after a clone,
>> so you won't see subrepos of a cloned subrepo.)
>
> I guess that would make sense in another patch.
>
>> The other thing worth a test is largefiles- the --all-largefiles
>> option isn't passed to subrepos, so as it stands, 'pull -S' won't let
>> you really disconnect, because largefiles in subrepos won't be cached. That
>> can be fixed later.
>
> I agree. We do not really use largefiles (yet) so we have not come
> across this (yet) but I'm sure this is something we would like to
> improve as well.
>
> One last thing, did you have time to have a look at the code of the
> patch series itself? I plan to send an slighly changed version of the
> patch series that will only change the first patch to fix the issue
> you found (the fix is minimal so the code should be almost the same).
> Do you have any comments?

I did, but I'm not as well versed in this area of code- that's why I've kept the 
conversation fairly high level.  I was able to follow your commit message and it 
looks like the code does what it says.  Just curious- is specifying None as the 
revision the right thing to do in pull()?  AFAICT, None gives you the working 
directory... so does that change the results based on what the subrepo is 
currently updated to?

> Thanks for looking into this.
>
> Angel
>



More information about the Mercurial-devel mailing list