Quick way to find added/modified/removed across revisions?

Matt Mackall mpm at selenic.com
Tue Jul 1 12:24:13 CDT 2008


On Tue, 2008-07-01 at 17:42 +0200, Jesper Noehr wrote:
> On Jul 1, 2008, at 4:53 PM, Matt Mackall wrote:
> >
> > On Tue, 2008-07-01 at 15:03 +0200, Jesper Noehr wrote:
> >> Hi list,
> >>
> >> I'm trying to sort out a quick way to figure out how many files were
> >> added/modified/removed across a change. I want to display these
> >> numbers on the shortlog page together with the date, description and
> >> author. First try was calling repo.status(), which is a very heavy
> >> operation. Next, I tried to read the manifest of the cset, together
> >> with the manifest of the parent, and do some cheap comparison on
> >> those. This turned out also to be extremely expensive (20 seconds for
> >> 25 revisions!)
> >
> > [...]
> >
> >> My profiler tells me that most of the time is spent in  
> >> zlib.decompress
> >> (to read the manifest from a compressed file, I guess), and there's
> >> also a lot of load on revlog.py:chunk and manifest.py:parse.
> >
> > As you seem to be reading the manifest in the forward direction, it
> > should be caching most of this operation. I would expect the mpatch  
> > code
> > to show up most prominently.
> >
> > In my tests (long ago), manifests tend to be large files with many  
> > small
> > changes. So the deltas end up being very small and quick to  
> > decompress,
> > while moving data around to apply the deltas tends to dominate. For
> > example, if we make 1000 single file changes to a 1M delta, we've  
> > got to
> > do at least 1G of memcpy to reconstruct them all, but probably less  
> > than
> > 1M of uncompress.
> >
> > The verify command has a trick to avoid reconstructing the entire
> > manifest which should speed things up: reading only the delta text.  
> > Each
> > changeset has a files field which shows all files changed. Compare  
> > that
> > with the data from manifest.readdelta() and you should be able to  
> > figure
> > out which ones were added (new entry), modified (changed entry), or
> > removed (listed in changeset but not in manifest delta).
> 
> .readdelta() sure is promising, much faster than reading the entire  
> manifest. The problem I'm having now is how to read the results of it.  
> For example, say that I just grab one rev and these are the results:
> 
> delta: {'django/trunk/django/db/models/sql/query.py': '\x98E 
> \x90\xd8\xaa\xe6\xfc"\x98P\xc4\t\xd7\x15\xab\xb1Ecr\xcb'}
> files: ['django/trunk/django/db/models/sql/query.py']
> 
> How do I figure out what happened to the file in this instance? Sorry  
> if I'm asking a stupid question here.

Well that tells us that query.py was either modified or added. To know
which, we have to know whether query was in the parent:

    for f in files:
        if f not in delta:
            removed.append(f)
            del manifest[f]
        elif f not in manifest:
            added.append(f)
            manifest[f] = delta[f]	
        else:
            modified.append(f)
	    manifest[f] = delta[f]



> 
> Jesper
-- 
Mathematics is the supreme nostalgia of our time.



More information about the Mercurial-devel mailing list