Bug 2273 - "hg id" and "hg status" do too much IO
Summary: "hg id" and "hg status" do too much IO
Status: RESOLVED FIXED
Alias: None
Product: Mercurial
Classification: Unclassified
Component: Mercurial (show other bugs)
Version: unspecified
Hardware: All All
: normal bug
Assignee: Bugzilla
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-07-04 16:22 UTC by Joerg Sonnenberger
Modified: 2012-05-13 04:59 UTC (History)
6 users (show)

See Also:
Python Version: ---


Attachments
(34 bytes, application/octet-stream)
2010-07-05 12:25 UTC, Joerg Sonnenberger
Details
(34 bytes, application/octet-stream)
2010-07-07 16:18 UTC, Joerg Sonnenberger
Details
(34 bytes, application/octet-stream)
2010-07-12 11:31 UTC, Joerg Sonnenberger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Joerg Sonnenberger 2010-07-04 16:22 UTC
I have a medium sized Mercurial repo (for a working copy, 9k files and 33MB
.hg). During testing I am replacing around 30 files (300MB) in the
repository with a data sets that is over all 900MB large. This dramatically
slows down both "hg id" and "hg status".

ktrace gives a 300MB trace, so it seems like it is reading the whole files
instead of just stat(2) them to see that the file size changed.
Comment 1 Nicolas Dumazet 2010-07-04 20:29 UTC
The relation between the two commands is simple: "hg id" does a status()
call to check if anything has changed in the working directory.

If I read correctly dirstate.status, here is the behavior:
for each file in working_dir:
  st = stat(fn)
  if st.size >= 0 and (sizechange(st, fn) or modechange(st, fn)):
    modified.append(fn)
  elif timechange(st, fn):
    checklater.append(fn)
  else:
    clean.append(fn)

And only when checklater contains files, the actual data is read and loaded
into memory to compare it to what's stored in .hg.

So the current code is supposed to be doing what you're suggesting: if the
size has changed, mark the file as modified and check the next file. Data
should only be loaded when size has not changed.

Maybe you could tell us more about the state of your working directory when
you're performing that status/id call?
What is the total size of files that have unchanged stat() filesize since
last commit, for example? :)
Comment 2 Benoit Boissinot 2010-07-05 00:02 UTC
@nickdumz

I think joerg was implying that the size had changed.
Comment 3 Benoit Boissinot 2010-07-05 00:06 UTC
@nicdumz

see profile here: http://www.netbsd.org/~joerg/hg.profile
Comment 4 Nicolas Dumazet 2010-07-05 04:54 UTC
Can you check if this helps, and maybe give us an updated profile after the
change?

Thank you :)

diff --git a/mercurial/filelog.py b/mercurial/filelog.py
--- a/mercurial/filelog.py
+++ b/mercurial/filelog.py
@@ -53,14 +53,27 @@
         if self.renamed(node):
             return len(self.read(node))
 
+        # XXX if "\1\n" in self.read(node), this returns (size+4)
         return revlog.revlog.size(self, rev)
 
     def cmp(self, node, text):
-        """compare text with a given file revision"""
+        """compare text with a given file revision
 
-        # for renames, we have to go the slow way
-        if text.startswith('\1\n') or self.renamed(node):
+        returns True if text is different than what is stored.
+        """
+
+        t = text
+        if text.startswith('\1\n'):
+            t = '\1\n\1\n' + text
+
+        samehashes = not revlog.revlog.cmp(self, node, t)
+        if samehashes:
+            return False
+
+        # renaming a file produces a different hash, even if the data
+        # remains unchanged. Check if it's the case (slow):
+        if self.renamed(node):
             t2 = self.read(node)
             return t2 != text
 
-        return revlog.revlog.cmp(self, node, text)
+        return True
Comment 5 Benoit Boissinot 2010-07-05 08:54 UTC
@joerg:

I think I know the problem. Can you do: hg debugstate and see what is the 
status of the big modified files ?
Comment 6 Joerg Sonnenberger 2010-07-05 12:25 UTC
Relevant part of the output is attached.

From the IRC discussion, it might be a good idea if class localrepository
exposes a function to tell if there are any filters active. If that is not
the case, the call to data() for pushing it down to cmp() in
localrepo.py:1062 can be short cut.
Comment 7 Matt Mackall 2010-07-07 16:08 UTC
There are a bunch of unanswered questions here and references to an IRC
discussion. Not enough info to figure out current state of this issue.
Setting to need-eg.
Comment 8 Joerg Sonnenberger 2010-07-07 16:18 UTC
@nicdumz

Profile is attached. It seems to help, so I will run with it for a while and
see if the issue is resolved.
Comment 9 Joerg Sonnenberger 2010-07-12 08:17 UTC
@nicdumz

With the patch, the size of the trace is down to 2MB, which seems acceptable
given the repository and tree size.
Comment 10 Nicolas Dumazet 2010-07-12 08:20 UTC
Great, we'll push this soon.

I'm still investigating how to give more hints to status() with ctx.size()
comparisons, but encode/decode filters and extensions such as keyword do not
make this too easy. There is room for improvement imho.
Comment 11 Joerg Sonnenberger 2010-07-12 11:31 UTC
After more testing, the patch doesn't address the issue of "hg status" after
"hg update". Attached is a test case.
Comment 12 Nicolas Dumazet 2010-07-12 17:29 UTC
Yes, true, this script is quite slow (rewriting it a bit so anyone can run it):

"""
hg init test1
hg init test2

cd test1; dd if=/dev/zero of=test seek=1024k bs=100 count=1; hg ci -Am test
dd if=/dev/zero of=test seek=1024k bs=200 count=1
cd ../test2; hg pull ../test1; hg up; echo foo > test2; hg ci -Am test2
cd ../test1; hg pull ../test2; hg up; time hg stat; time hg stat
"""

It's interesting because 1.5.3 takes about 100ms for this when 1.6 takes 1.8s

With the (experimental and broken) local patch I have using ctx.size in
localrepo.status, time goes down to 0.8s. More work needed!
I also have to check where this slowness was introduced.
Comment 13 Nicolas Dumazet 2010-07-12 19:53 UTC
Just bisect'ed this. This comes from a manifestmerge change:

The first bad revision is:
changeset:   11466:ad27428c59ce
branch:      stable
user:        Matt Mackall <mpm@selenic.com>
date:        Wed Jun 30 17:34:20 2010 -0500
summary:     update: synchronize permissions in the dirstate (issue1473)

$ hg diff -c11466
diff --git a/mercurial/merge.py b/mercurial/merge.py
--- a/mercurial/merge.py
+++ b/mercurial/merge.py
@@ -182,7 +182,9 @@
             rflags = fmerge(f, f, f)
             a = ma.get(f, nullid)
             if n == m2[f] or m2[f] == a: # same or local newer
-                if m1.flags(f) != rflags:
+                # is file locally modified or flags need changing?
+                # dirstate flags may need to be made current
+                if m1.flags(f) != rflags or n[20:]:
                     act("update permissions", "e", f, rflags)
             elif n == a: # remote newer
                 act("remote is newer", "g", f, rflags)


I dont understand how it affects status for now; if the change is correct;
or why it is slower.

But the difference for the last status call from msg13084 is quite clear:
* before: 0.26user 0.02system 0:00.30elapsed
* after: 1.52user 0.31system 0:01.85elapsed
Comment 14 Matt Mackall 2010-07-13 16:03 UTC
Unfortunately, that's a real bug fix.

But that suggests the issue is related to the execute bit being seen as
changed by dirstate and that localrepo.status is instead needlessly checking
the file _contents_.
Comment 15 Nicolas Dumazet 2010-07-16 01:07 UTC
It is checking the file contents in localrepo.status, because dirstate
returns it in "lookup" list.

Expensive read is done during the:

f not in ctx1 or ctx2.flags(f) != ctx1.flags(f) or ctx1[f].cmp(ctx2[f].data())

check, since ctx2[f] if a fileworkingctx and .data() reads the file.

Why is it in lookup mode?
The fix in r11466 has the consequence of calling normallookup() in
merge.recordupdates




In fact, I think that the bugfix we're talking of interacts badly with
workingctx-generated manifests.
In the case of our test, the manifest line for test will be:
  '\\gC\x95Wlm\tE\xe8d\x01{\xce\x15!\xa8\xdf\x11\\m'
Note the "m" flag that was appended here in workingctx._manifest, because
status("test") = "m".

So we're back to one of my first questions when I looked at the
11466:ad27428c59ce fix: _what_ should be stored in manifest, after the node
id, in manifest_line[20:] ?
mercurial.merge seems to expect permissions/executable flags, whereas
workingctx._manifest stores status-like flags. What should really be here? 

I tried to look around, but couldn't really find the answer. A quick
addition to http://mercurial.selenic.com/wiki/Manifest would be very much
welcome ;)

(I also think that I'll end up adding some
http://mercurial.selenic.com/wiki/Flags kind of page (you-name-it), because
I always confuse dirstate flags, permission flags, status flags, etc, etc...
But that's off-topic :)
Comment 16 Matt Mackall 2010-07-16 07:12 UTC
Don't be confused by the way manifests are stored in manifest.py and
generated by workingctx.

manifest.py is effectively storing two columns: file hash and file flags.
They get split apart at read time - you'll never see a hash internally with
an 'x' or 'l' flag.

workingctx is simply amending the file hash so that it won't compare as
equal to the hash of the parent. This is cheaper than calculating a
completely new hash.

It's a little bogus to do this in the cases where only that permissions
change but the contents haven't, but status isn't giving us enough info
about what's actually changed.
Comment 17 Nicolas Dumazet 2010-10-14 22:59 UTC
@joerg: after http://mercurial.selenic.com/hg/hg/rev/4147a292c508 the
particular test script we were considering should be twice faster.

Can you let us know if there's anything else we could do?
Comment 18 Bugzilla 2012-05-12 09:10 UTC

--- Bug imported by bugzilla@serpentine.com 2012-05-12 09:10 EDT  ---

This bug was previously known as _bug_ 2273 at http://mercurial.selenic.com/bts/issue2273
Imported an attachment (id=1424)
Imported an attachment (id=1425)
Imported an attachment (id=1426)