Bug 5801 - pure-python manifest code produces corrupt unsorted manifests
Summary: pure-python manifest code produces corrupt unsorted manifests
Status: RESOLVED FIXED
Alias: None
Product: Mercurial
Classification: Unclassified
Component: Mercurial (show other bugs)
Version: default branch
Hardware: PC Mac OS
: normal bug
Assignee: Bugzilla
URL:
Keywords: regression
: 6052 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-02-22 11:57 UTC by Augie Fackler
Modified: 2019-11-13 01:02 UTC (History)
5 users (show)

See Also:
Python Version: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Augie Fackler 2018-02-22 11:57 UTC
The only steps to reproduce I have right now involve a repo from timeless that's a conversion of hg-git, but I can reliably build a corrupt commit with a merge of two revisions if I use HGMODULEPOLICY=py. If I use HGMODULEPOLICY=c, the problem doesn't happen.

This does appear to be rare, and the resulting manifest can't even be checked out so it's obvious *right away* something is borked, but we should still address this promptly, probably with backports to older versions...

Here's the traceback:

Traceback (most recent call last):
  File "/opt/hg/bin/hg", line 41, in <module>
    dispatch.run()
  File "/opt/hg/lib/python/mercurial/dispatch.py", line 88, in run
    status = (dispatch(req) or 0) & 255
  File "/opt/hg/lib/python/mercurial/dispatch.py", line 183, in dispatch
    ret = _runcatch(req)
  File "/opt/hg/lib/python/mercurial/dispatch.py", line 324, in _runcatch
    return _callcatch(ui, _runcatchfunc)
  File "/opt/hg/lib/python/mercurial/dispatch.py", line 332, in _callcatch
    return scmutil.callcatch(ui, func)
  File "/opt/hg/lib/python/mercurial/scmutil.py", line 154, in callcatch
    return func()
  File "/opt/hg/lib/python/mercurial/dispatch.py", line 314, in _runcatchfunc
    return _dispatch(req)
  File "/opt/hg/lib/python/mercurial/dispatch.py", line 917, in _dispatch
    cmdpats, cmdoptions)
  File "/opt/hg/lib/python/mercurial/dispatch.py", line 674, in runcommand
    ret = _runcommand(ui, options, cmd, d)
  File "/opt/hg/lib/python/mercurial/dispatch.py", line 925, in _runcommand
    return cmdfunc()
  File "/opt/hg/lib/python/mercurial/dispatch.py", line 914, in <lambda>
    d = lambda: util.checksignature(func)(ui, *args, **strcmdopt)
  File "/opt/hg/lib/python/mercurial/util.py", line 1228, in check
    return func(*args, **kwargs)
  File "/opt/hg/lib/python/mercurial/util.py", line 1228, in check
    return func(*args, **kwargs)
  File "/opt/hg/lib/python/hgext/mq.py", line 3590, in mqcommand
    return orig(ui, repo, *args, **kwargs)
  File "/opt/hg/lib/python/mercurial/util.py", line 1228, in check
    return func(*args, **kwargs)
  File "/opt/hg/lib/python/mercurial/commands.py", line 1963, in export
    opts=patch.diffallopts(ui, opts))
  File "/opt/hg/lib/python/mercurial/cmdutil.py", line 1555, in export
    repo, ctx, match, switch_parent, rev, seqno, write, opts)
  File "/opt/hg/lib/python/mercurial/cmdutil.py", line 1499, in _exportsingle
    for chunk, label in patch.diffui(repo, prev, node, match, opts=diffopts):
  File "/opt/hg/lib/python/mercurial/patch.py", line 2486, in difflabel
    for chunk in func(*args, **kw):
  File "/opt/hg/lib/python/mercurial/patch.py", line 2339, in diff
    losedatafn=losedatafn, prefix=prefix, relroot=relroot, copy=copy,
  File "/opt/hg/lib/python/mercurial/patch.py", line 2396, in diffhunks
    changes = repo.status(ctx1, ctx2, match=match)
  File "/opt/hg/lib/python/mercurial/localrepo.py", line 2055, in status
    listsubrepos)
  File "/opt/hg/lib/python/mercurial/context.py", line 362, in status
    listunknown)
  File "/opt/hg/lib/python/mercurial/context.py", line 131, in _buildstatus
    mf2 = self._buildstatusmanifest(s)
  File "/opt/hg/lib/python/mercurial/context.py", line 109, in _buildstatusmanifest
    return self.manifest()
  File "/opt/hg/lib/python/mercurial/context.py", line 189, in manifest
    return self._manifest
  File "/opt/hg/lib/python/mercurial/util.py", line 966, in __get__
    result = self.func(obj)
  File "/opt/hg/lib/python/mercurial/context.py", line 547, in _manifest
    return self._manifestctx.read()
  File "/opt/hg/lib/python/mercurial/manifest.py", line 1471, in read
    self._data = manifestdict(text)
  File "/opt/hg/lib/python/mercurial/manifest.py", line 423, in __init__
    self._lm = _lazymanifest(data)
ValueError: Manifest lines not in sorted order.

(from running `hg export` on a thusly corrupt revision)
Comment 1 Augie Fackler 2018-02-22 12:11 UTC
I *think* this might have been introduced in 1cbf144fd8a1. If I change the line

        if len(changes) < 1000:

in manifest's fastdelta(self, base, changes) method to be always-true with an "or True", I no longer get the corruption.

Adding Durham since he wrote the code and it also seems likely to be a risk for FB's large repo.
Comment 2 Durham Goode 2018-02-22 12:54 UTC
Does HGMODULEPOLICY=py mean that mdiff.textdiff runs a python version?  Seems like that would be the likely place for the bug.  Perhaps you could compare the fastdelta result with the mdiff.textdiff result and when they are different you can get the inputs to mdiff.textdiff to narrow down the repro.
Comment 3 Augie Fackler 2018-02-22 13:45 UTC
It does, but I just wrote a unit test using these exact texts and mdiff.textdiff is exonerated. If I dump the base and arraytext values to files in /tmp, the base values match between pure and c, but the arraytext values differ.

I'm technically on vacation today, but I'll probably keep poking at this because it's a baffling problem.
Comment 4 Bugzilla 2018-03-07 00:00 UTC
Bug marked urgent for 10 days, bumping
Comment 5 Bugzilla 2018-03-17 00:00 UTC
Bug marked urgent for 10 days, bumping
Comment 6 Bugzilla 2018-03-28 00:00 UTC
Bug marked urgent for 10 days, bumping
Comment 7 Bugzilla 2018-04-08 00:00 UTC
Bug marked urgent for 10 days, bumping
Comment 8 Bugzilla 2018-04-19 00:00 UTC
Bug marked urgent for 11 days, bumping
Comment 9 Bugzilla 2018-04-30 00:00 UTC
Bug marked urgent for 10 days, bumping
Comment 10 Kevin Bullock 2018-05-05 14:11 UTC
(In reply to Augie Fackler from comment #3)
Did you ever make any headway on this?

Downgrading to "normal" since it doesn't seem to be a widespread issue.
Comment 11 Bugzilla 2018-10-03 00:01 UTC
Bug was inactive for 150 days, archiving
Comment 12 Augie Fackler 2018-10-03 09:35 UTC
Still a thing that needs fixed.
Comment 13 Bugzilla 2019-03-03 00:01 UTC
Bug was inactive for 150 days, archiving
Comment 14 Augie Fackler 2019-03-04 10:02 UTC
Still an important issue.
Comment 15 Matt Harbison 2019-05-20 23:06 UTC
I bisected it back to here with pure on Windows, with a repo from timeless:

changeset:   30042:d24e03da24b5
user:        Maciej Fijalkowski <fijall@gmail.com>
date:        Mon Sep 12 13:37:14 2016 +0200
summary:     lazymanifest: write a more efficient, pypy friendly version of lazymanifest

It seems that what's happening is the first byte of some filenames are being overwritten with NUL:

$ hg -R ../../timeless_corruption/openj9 debugindex -m | tail -n 4
  7712    7716 70ba4155c5fa a9dfc50dd9e7 a3e82536e5ee
  7713    7717 eaef44749065 70ba4155c5fa 000000000000
  7714    7718 8a3514d0a092 eaef44749065 000000000000
  7715    7719 c698c9d83e22 8a3514d0a092 000000000000

$ hg -R ../../timeless_corruption/openj9 debugdata -m c698c9d83e22 | sed 's/\x0/*/g' | grep -n -A3 -B3 '^*'
1601-jcl/src/jdk.jcmd/share/classes/openj9/tools/attach/diagnostics/target/TargetDiagnosticsProviderImpl.java*94b507ba985856bf727dd30abee97fcb9016c984
1602-jcl/src/jdk.jcmd/share/classes/openj9/tools/attach/diagnostics/tools/JStack.java*676f860c8cdb91b62ed8c528abbf012c1a644e79
1603-jcl/src/jdk.jcmd/share/classes/openj9/tools/attach/diagnostics/tools/Jps.java*6151a513600ed5967f4f712ad001e8a95e65ef3e
1604:*cl/src/jdk.jcmd/share/classes/openj9/tools/attach/diagnostics/tools/Jstack.java*06bfc368388e7acd8cdb9af586eb61207530ce6b
1605-jcl/src/jdk.jlink/share/classes/jdk/tools/jlink/internal/plugins/GenerateJLIClassesPlugin.java*32511e395dd618fbef840a027c67f80fb5a89d15
1606-jcl/src/jdk.management/share/classes/com/ibm/lang/management/AvailableProcessorsNotificationInfo.java*8b66c4e8fd370c7c5d29c0dbb47f6d32aabeca23
1607-jcl/src/jdk.management/share/classes/com/ibm/lang/management/CpuLoadCalculationConstants.java*d95c95f886a80cb337acd17432ea07d03edeb495
--
7227-test/functional/Java8andUp/src/org/openj9/test/attachAPI/TestConstants.java*ebfa9130bc93ab296d40cbaf02ee3d53cda08260
7228-test/functional/Java8andUp/src/org/openj9/test/attachAPI/TestJStack.java*5ed2794399576ff742f1f28cbe60543b9023e72a
7229-test/functional/Java8andUp/src/org/openj9/test/attachAPI/TestJps.java*e5078daeb42f2ba5c8d8af252bf6ef30506aa515
7230:*est/functional/Java8andUp/src/org/openj9/test/attachAPI/TestJstack.java*fc38375ee5d956d6f417e1a98462db7b946f23d9
7231-test/functional/Java8andUp/src/org/openj9/test/attachAPI/TestLoadAgent.java*84e943730bbc2764df0ab67722b8111764e238bf
7232-test/functional/Java8andUp/src/org/openj9/test/attachAPI/TestLoadAgentLibrary.java*c9485f7a0887aafef6d524ecac2c0121771b07df
7233-test/functional/Java8andUp/src/org/openj9/test/attachAPI/TestLoadAgentPath.java*53a39a6d85b1671ca2fa98d511f0c899d10b041f

The filenames look otherwise properly sorted.
Comment 16 Matt Harbison 2019-05-20 23:34 UTC
(In reply to Augie Fackler from comment #1)

Not sure if I hit a different case, but neither True nor False in that conditional work for me on d24e03da24b5 or stable (ce5f1232631f).
Comment 17 Matt Harbison 2019-05-22 17:37 UTC
I checked the repo at work that hit this problem (unfortunately I can't share it), and this is almost certainly a combination of a lot of changes and renames in the mix.  Dumping the bad manifest showed the same stray NUL over some file names that timeless has.  100% of those bad names were the source of a move from the parent commit, and should have been dropped.  They weren't, and the filelog hash corresponded to what was in the parent manifest.  Some moved files did have their source properly removed though.

The `else` case in the conditional that Augie referenced in comment #1 is a straight up text diff, ignoring the list of removed files that got passed in, so it is relying on the fact that the deleted entries really are removed from the text.  _lazymanifest.text() is how this is created, and that calls _lazymanifest._compact().  Apply this diff, and behold the strangeness:

diff --git a/mercurial/manifest.py b/mercurial/manifest.py
--- a/mercurial/manifest.py
+++ b/mercurial/manifest.py
@@ -301,7 +301,13 @@ class _lazymanifest(object):
                 if end_cut != -1:
                     end_cut += 1
                 offset += end_cut - cur
-                l.append(self.data[last_cut:end_cut])
+
+                data = self.data[last_cut:end_cut]
+                if data[0] == '\x00' or '\n\x00' in data:
+                    print('adding bogus value from position[%d]==%d (%s)' % (i, cur, data))
+                elif data.count('\n') > 1:
+                    print('adding multiline position[%d]==%d (%s)' % (i, cur, data))
+                l.append(data)
             else:
                 while i < len(self.positions) and self.positions[i] < 0:
                     cur = self.positions[i]

That produces output that seems to show multiple lines being added at once:

    adding multiline position[1834]==96009 (ZModem/Src/zmr.c db4a010599d04d3824aade3bf1d632d37d2802ae
    ZModem/Src/zms.c 0cbe0742951eb444f4a761f40abe9cfd0c1e636d
    )

And in some cases, the deleted file entries are getting sucked in:

    adding bogus value from position[1725]==88875 (Usnet/Include/telnet.h 158a60d934b79be80af5311677524e509db59157
    Usnet/Include/uconsts.h f74a94d5be5f60399aae27801810426e915fba90
     snet/License.txt bd0e1dc9c722ed3b7d67e1e85d20550b91034951
    Usnet/Netsrc/SConscript 7145df3b66bd53771e0dd72ed27d18577d0ab177
    Usnet/Netsrc/arp.c 3d257f14797129c1d9783dcdb3b4dd92f5d99cd2
    Usnet/Netsrc/bootp.c d32912a8b066588604f446729e2631767b543fc5
    )

I'm not sure what this function is trying to do- __delitem__() is removing self.position entries, so I'd expect this to be a straight traversal.  That `while True` loop is doing something though, because if I drop it, the commit hangs.

There are no merges in the repo, and nothing exotic about it other than a lot of files and it was originally converted from bzr.  I'm triggering this with an hg -> hg conversion.
Comment 18 Matt Harbison 2019-05-22 20:57 UTC
(In reply to Matt Harbison from comment #16)

I double checked that hardcoding the conditional doesn't fix timeless's openj9 rebase issue.  But it is the same problem where NUL bytes are slipping into the text for removed files.  It's just getting there a different way, via this:

https://www.mercurial-scm.org/repo/hg/file/2338bdea4474/mercurial/manifest.py#l1511
Comment 19 HG Bot 2019-05-28 10:00 UTC
Fixed by https://mercurial-scm.org/repo/hg/rev/0546ead39a7e
Matt Harbison <matt_harbison@yahoo.com>
manifest: avoid corruption by dropping removed files with pure (issue5801)

Previously, removed files would simply be marked by overwriting the first byte
with NUL and dropping their entry in `self.position`.  But no effort was made to
ignore them when compacting the dictionary into text form.  This allowed them to
slip into the manifest revision, since the code seems to be trying to minimize
the string operations by copying as large a chunk as possible.  As part of this,
compact() walks the existing text based on entries in the `positions` list, and
consumed everything up to the next position entry.   This typically resulted in
a ValueError complaining about unsorted manifest entries.

Sometimes it seems that files do get dropped in large repos- it seems to
correspond to there being a new entry that would take the same slot.  A much
more trivial problem is that if the only changes were removals, `_compact()`
didn't even run because `__delitem__` doesn't add anything to `self.extradata`.
Now there's an explicit variable to flag this, both to allow `_compact()` to
run, and to avoid searching the manifest in cases where there are no removals.

In practice, this behavior was mostly obscured by the check in fastdelta() which
takes a different path that explicitly drops removed files if there are fewer
than 1000 changes.  However, timeless has a repo where after rebasing tens of
commits, a totally different path[1] is taken that bypasses the change count
check and hits this problem.

[1] https://www.mercurial-scm.org/repo/hg/file/2338bdea4474/mercurial/manifest.py#l1511

(please test the fix)
Comment 20 Augie Fackler 2019-05-28 10:05 UTC
Very impressive work Matt! I've gone through this a couple of times and never figured out the defect.
Comment 21 Matt Harbison 2019-05-28 11:04 UTC
(In reply to Augie Fackler from comment #20)

Thanks!

The various manifest corruption errors on the thg bug tracker had me really concerned it was a C extension problem, so I’m glad this got nailed down.  (Although I’m still slightly concerned because those were all Windows users, and none of the hg and thg packages should be using pure on Windows.  And timeless was pretty sure he was using cext on macOS the other day, though I couldn’t repeat it on Windows and the repo at work that gave me problems works fine with cext.)
Comment 22 Bugzilla 2019-06-05 00:00 UTC
Bug was set to TESTING for 7 days, resolving
Comment 23 Matt Harbison 2019-11-13 01:02 UTC
*** Bug 6052 has been marked as a duplicate of this bug. ***