I have a medium sized Mercurial repo (for a working copy, 9k files and 33MB .hg). During testing I am replacing around 30 files (300MB) in the repository with a data sets that is over all 900MB large. This dramatically slows down both "hg id" and "hg status". ktrace gives a 300MB trace, so it seems like it is reading the whole files instead of just stat(2) them to see that the file size changed.
The relation between the two commands is simple: "hg id" does a status() call to check if anything has changed in the working directory. If I read correctly dirstate.status, here is the behavior: for each file in working_dir: st = stat(fn) if st.size >= 0 and (sizechange(st, fn) or modechange(st, fn)): modified.append(fn) elif timechange(st, fn): checklater.append(fn) else: clean.append(fn) And only when checklater contains files, the actual data is read and loaded into memory to compare it to what's stored in .hg. So the current code is supposed to be doing what you're suggesting: if the size has changed, mark the file as modified and check the next file. Data should only be loaded when size has not changed. Maybe you could tell us more about the state of your working directory when you're performing that status/id call? What is the total size of files that have unchanged stat() filesize since last commit, for example? :)
@nickdumz I think joerg was implying that the size had changed.
@nicdumz see profile here: http://www.netbsd.org/~joerg/hg.profile
Can you check if this helps, and maybe give us an updated profile after the change? Thank you :) diff --git a/mercurial/filelog.py b/mercurial/filelog.py --- a/mercurial/filelog.py +++ b/mercurial/filelog.py @@ -53,14 +53,27 @@ if self.renamed(node): return len(self.read(node)) + # XXX if "\1\n" in self.read(node), this returns (size+4) return revlog.revlog.size(self, rev) def cmp(self, node, text): - """compare text with a given file revision""" + """compare text with a given file revision - # for renames, we have to go the slow way - if text.startswith('\1\n') or self.renamed(node): + returns True if text is different than what is stored. + """ + + t = text + if text.startswith('\1\n'): + t = '\1\n\1\n' + text + + samehashes = not revlog.revlog.cmp(self, node, t) + if samehashes: + return False + + # renaming a file produces a different hash, even if the data + # remains unchanged. Check if it's the case (slow): + if self.renamed(node): t2 = self.read(node) return t2 != text - return revlog.revlog.cmp(self, node, text) + return True
@joerg: I think I know the problem. Can you do: hg debugstate and see what is the status of the big modified files ?
Relevant part of the output is attached. From the IRC discussion, it might be a good idea if class localrepository exposes a function to tell if there are any filters active. If that is not the case, the call to data() for pushing it down to cmp() in localrepo.py:1062 can be short cut.
There are a bunch of unanswered questions here and references to an IRC discussion. Not enough info to figure out current state of this issue. Setting to need-eg.
@nicdumz Profile is attached. It seems to help, so I will run with it for a while and see if the issue is resolved.
@nicdumz With the patch, the size of the trace is down to 2MB, which seems acceptable given the repository and tree size.
Great, we'll push this soon. I'm still investigating how to give more hints to status() with ctx.size() comparisons, but encode/decode filters and extensions such as keyword do not make this too easy. There is room for improvement imho.
After more testing, the patch doesn't address the issue of "hg status" after "hg update". Attached is a test case.
Yes, true, this script is quite slow (rewriting it a bit so anyone can run it): """ hg init test1 hg init test2 cd test1; dd if=/dev/zero of=test seek=1024k bs=100 count=1; hg ci -Am test dd if=/dev/zero of=test seek=1024k bs=200 count=1 cd ../test2; hg pull ../test1; hg up; echo foo > test2; hg ci -Am test2 cd ../test1; hg pull ../test2; hg up; time hg stat; time hg stat """ It's interesting because 1.5.3 takes about 100ms for this when 1.6 takes 1.8s With the (experimental and broken) local patch I have using ctx.size in localrepo.status, time goes down to 0.8s. More work needed! I also have to check where this slowness was introduced.
Just bisect'ed this. This comes from a manifestmerge change: The first bad revision is: changeset: 11466:ad27428c59ce branch: stable user: Matt Mackall <mpm@selenic.com> date: Wed Jun 30 17:34:20 2010 -0500 summary: update: synchronize permissions in the dirstate (issue1473) $ hg diff -c11466 diff --git a/mercurial/merge.py b/mercurial/merge.py --- a/mercurial/merge.py +++ b/mercurial/merge.py @@ -182,7 +182,9 @@ rflags = fmerge(f, f, f) a = ma.get(f, nullid) if n == m2[f] or m2[f] == a: # same or local newer - if m1.flags(f) != rflags: + # is file locally modified or flags need changing? + # dirstate flags may need to be made current + if m1.flags(f) != rflags or n[20:]: act("update permissions", "e", f, rflags) elif n == a: # remote newer act("remote is newer", "g", f, rflags) I dont understand how it affects status for now; if the change is correct; or why it is slower. But the difference for the last status call from msg13084 is quite clear: * before: 0.26user 0.02system 0:00.30elapsed * after: 1.52user 0.31system 0:01.85elapsed
Unfortunately, that's a real bug fix. But that suggests the issue is related to the execute bit being seen as changed by dirstate and that localrepo.status is instead needlessly checking the file _contents_.
It is checking the file contents in localrepo.status, because dirstate returns it in "lookup" list. Expensive read is done during the: f not in ctx1 or ctx2.flags(f) != ctx1.flags(f) or ctx1[f].cmp(ctx2[f].data()) check, since ctx2[f] if a fileworkingctx and .data() reads the file. Why is it in lookup mode? The fix in r11466 has the consequence of calling normallookup() in merge.recordupdates In fact, I think that the bugfix we're talking of interacts badly with workingctx-generated manifests. In the case of our test, the manifest line for test will be: '\\gC\x95Wlm\tE\xe8d\x01{\xce\x15!\xa8\xdf\x11\\m' Note the "m" flag that was appended here in workingctx._manifest, because status("test") = "m". So we're back to one of my first questions when I looked at the 11466:ad27428c59ce fix: _what_ should be stored in manifest, after the node id, in manifest_line[20:] ? mercurial.merge seems to expect permissions/executable flags, whereas workingctx._manifest stores status-like flags. What should really be here? I tried to look around, but couldn't really find the answer. A quick addition to http://mercurial.selenic.com/wiki/Manifest would be very much welcome ;) (I also think that I'll end up adding some http://mercurial.selenic.com/wiki/Flags kind of page (you-name-it), because I always confuse dirstate flags, permission flags, status flags, etc, etc... But that's off-topic :)
Don't be confused by the way manifests are stored in manifest.py and generated by workingctx. manifest.py is effectively storing two columns: file hash and file flags. They get split apart at read time - you'll never see a hash internally with an 'x' or 'l' flag. workingctx is simply amending the file hash so that it won't compare as equal to the hash of the parent. This is cheaper than calculating a completely new hash. It's a little bogus to do this in the cases where only that permissions change but the contents haven't, but status isn't giving us enough info about what's actually changed.
@joerg: after http://mercurial.selenic.com/hg/hg/rev/4147a292c508 the particular test script we were considering should be twice faster. Can you let us know if there's anything else we could do?
--- Bug imported by bugzilla@serpentine.com 2012-05-12 09:10 EDT --- This bug was previously known as _bug_ 2273 at http://mercurial.selenic.com/bts/issue2273 Imported an attachment (id=1424) Imported an attachment (id=1425) Imported an attachment (id=1426)