[PATCH] subrepo: do not push "clean" subrepos when the parent repo is pushed

Angel Ezquerra angel.ezquerra at gmail.com
Thu Feb 14 02:39:55 CST 2013


On Thu, Feb 14, 2013 at 2:31 AM, Matt Mackall <mpm at selenic.com> wrote:
> On Thu, 2013-02-14 at 01:06 +0100, Angel Ezquerra wrote:
>> # HG changeset patch
>> # User Angel Ezquerra <angel.ezquerra at gmail.com>
>> # Date 1360795816 -3600
>> # Node ID 26276460d54aecdeb107c82c4e3f2ca7c0c6a8b3
>> # Parent  55b9b294b7544a6a144f627f71f4b770907d5a98
>> subrepo: do not push "clean" subrepos when the parent repo is pushed
>>
>> A clean subrepo is defined as one that has not had its dirstate, bookmarks or
>> phases modified.
>>
>> This patch works by adding a "clean" method to subrepos. In the case of
>> mercurial subrepos, this method calculates a "stamp" (i.e. a set of file hashes)
>> of the repository state at the time of push with a similar "stamp" that was
>> stored on a file when the subrepo was cloned or pushed to a given remote target.
>> If the stamps match the subrepo has no changes that must be pushed to the target
>> repository and thus the push can be skipped.
>>
>> Note that we calculate the stamp file by calculating hashes for several key
>> repository files, such as the dirstate, the bookmarks file and the phaseroots
>> file. This means that our "clean" detection is not perfect, in the sense that
>> if the working directory has been updated to a different revision we will
>> assume that the subrepo is not clean. However, if we update to another revision
>> and back to the original revision the clean() method will correctly detec the
>> subrepo as being clean.

> Why is the dirstate interesting? I would posit that we're only
> interested in things that we'd push or pull. Dirstate being modified
> should not force us to push.

I think I may have misunderstood you when we discussed this during the
sprint. I thought you mentioned that we should check the direstate,
the bookmarks and the phaseroots.

I don't know the mercurial store format well enough to tell which
files we need to hash. Would hashing "store/00changelog.i" and
"store/00manifest.i" be enough? An alternative would be to hash the
list of repository heads and consider the repo unclean if that
changed?

>> Also note that a subrepo being "clean" is not the opposite of it being "dirty".
>> A subrepo is dirty if it updated to a different revision that the one that is
>> pointed to by the subrepo parent or if its working directory is not clean. This
>> is a different concept.
>
> Ok, let's give it a different name. How about storeclean() so it's
> explicit it's about the store?

OK, as long as bookmarks and phases can be considered part of the store?

> I think this needs to be (at least) a few different patches:
>
> 1) introduce the helper functions
> 2) record clean state on clone/pull
> 3) check clean state on push

Will do.

>> diff --git a/mercurial/subrepo.py b/mercurial/subrepo.py
>> --- a/mercurial/subrepo.py
>> +++ b/mercurial/subrepo.py
>> @@ -300,6 +300,16 @@
>>
>>  class abstractsubrepo(object):
>>
>> +    def clean(self):
>> +        """
>> +        returns true if the repository has not changed since it was last
>> +        cloned or pulled.
>> +        Note that this is very different and definitely not the opposite
>> +        of the repository being "dirty", which is related to having changes
>> +        on the working directory or the current revision.
>> +        """
>> +        return False
>> +
>>      def dirty(self, ignoreupdate=False):
>>          """returns true if the dirstate of the subrepo is dirty or does not
>>          match current stored state. If ignoreupdate is true, only check
>> @@ -426,6 +436,73 @@
>>          self._repo.ui.setconfig('ui', '_usedassubrepo', 'True')
>>          self._initrepo(r, state[0], create)
>>
>> +    def clean(self, path):
>> +        """
>> +        returns true if the repository has not changed since it was last
>> +        cloned or pulled.
>> +        Note that this is very different and definitely not the opposite
>> +        of the repository being "dirty", which is related to having changes
>> +        on the working directory or the current revision.
>> +        """
>
> Duplicate docs.
>
>> +        return self._calcrepostamp(path) == self._readrepostamp(path)
>
> This is suboptimal, especially if any of the files are large. It'd be
> better to be able to break after we find the first changed file.

Perhaps _calcrepostamp could be a generator. clean() (i.e. cleanstore)
could then just check each individual line.

>> +    def _getfilestamp(self, filename):
>> +        data = ''
>> +        if os.path.exists(filename):
>> +            fd = open(filename)
>> +            data = fd.read()
>> +            fd.close()
>> +        return util.sha1(data).hexdigest()
>
> Most of this doesn't want to be member functions. We'll need it for git.

Sure, although I know even less about git's internals than mercurial's
to be able to know what should be hashed inside the .git folder.

Should these be kept in subrepos.py or perhaps moved to
mercurial.util? Also _calcrepostamp could be moved out as well by
making the "file list" a parameter.

>> +    def _calcrepostamp(self, remotepath):
>> +        '''calculate a unique "stamp" for the current repository state
>> +
>> +        This method is used to to detect when there are changes that may
>> +        require a push to a given remote path.'''
>> +        filelist = ('dirstate', 'bookmarks', 'store/phaseroots')
>> +        stamp = ['# %s\n' % remotepath]
>> +        lock = self._repo.lock()
>> +        try:
>> +            for relname in filelist:
>> +                absname = os.path.normpath(self._repo.join(relname))
>> +                stamp.append('%s = %s\n' % (absname, self._getfilestamp(absname)))
>> +        finally:
>> +            lock.release()
>> +        return stamp
>> +
>> +    def _getstampfilename(self, remotepath):
>> +        '''get a unique filename for the remote repo stamp'''
>> +        fname = util.sha1(remotepath).hexdigest()
>> +        return self._repo.join(os.path.join('stamps', fname))
>
> Probably don't want a 40-character name here. This should go
> in .hg/cache/ on hg repos and have a less generic name than stamp.

OK, I'll just keep the first 12 hash characters.

>> +    def _readrepostamp(self, remotepath):
>> +        '''read an existing remote repository stamp'''
>> +        stampfile = self._getstampfilename(remotepath)
>> +        if not os.path.exists(stampfile):
>> +            return ''
>> +        fd = open(stampfile, 'r')
>> +        stamp = fd.readlines()
>> +        fd.close()
>> +        return stamp
>> +
>> +    def _updaterepostamp(self, remotepath):
>> +        '''
>> +        Calc the current repo stamp saving it into a remote repo stamp file
>> +        Each remote repo requires its own stamp file, because a subrepo may
>> +        be clean versus a given remote repo, but not versus another.
>> +        '''
>> +        # save it to the clean file
>> +        # We should lock the repo
>> +        stampfile = self._getstampfilename(remotepath)
>> +        # [FIXME] should lock the repo? it is already locked by _calcrepostamp
>
> No, the lock should be in the callers of _calcrepostamp.

OK. How would that work on the case of git repos? Is it possible to
"lock" a git repo?

Matt, thanks for the detailed review!

Angel


More information about the Mercurial-devel mailing list