The only steps to reproduce I have right now involve a repo from timeless that's a conversion of hg-git, but I can reliably build a corrupt commit with a merge of two revisions if I use HGMODULEPOLICY=py. If I use HGMODULEPOLICY=c, the problem doesn't happen. This does appear to be rare, and the resulting manifest can't even be checked out so it's obvious *right away* something is borked, but we should still address this promptly, probably with backports to older versions... Here's the traceback: Traceback (most recent call last): File "/opt/hg/bin/hg", line 41, in <module> dispatch.run() File "/opt/hg/lib/python/mercurial/dispatch.py", line 88, in run status = (dispatch(req) or 0) & 255 File "/opt/hg/lib/python/mercurial/dispatch.py", line 183, in dispatch ret = _runcatch(req) File "/opt/hg/lib/python/mercurial/dispatch.py", line 324, in _runcatch return _callcatch(ui, _runcatchfunc) File "/opt/hg/lib/python/mercurial/dispatch.py", line 332, in _callcatch return scmutil.callcatch(ui, func) File "/opt/hg/lib/python/mercurial/scmutil.py", line 154, in callcatch return func() File "/opt/hg/lib/python/mercurial/dispatch.py", line 314, in _runcatchfunc return _dispatch(req) File "/opt/hg/lib/python/mercurial/dispatch.py", line 917, in _dispatch cmdpats, cmdoptions) File "/opt/hg/lib/python/mercurial/dispatch.py", line 674, in runcommand ret = _runcommand(ui, options, cmd, d) File "/opt/hg/lib/python/mercurial/dispatch.py", line 925, in _runcommand return cmdfunc() File "/opt/hg/lib/python/mercurial/dispatch.py", line 914, in <lambda> d = lambda: util.checksignature(func)(ui, *args, **strcmdopt) File "/opt/hg/lib/python/mercurial/util.py", line 1228, in check return func(*args, **kwargs) File "/opt/hg/lib/python/mercurial/util.py", line 1228, in check return func(*args, **kwargs) File "/opt/hg/lib/python/hgext/mq.py", line 3590, in mqcommand return orig(ui, repo, *args, **kwargs) File "/opt/hg/lib/python/mercurial/util.py", line 1228, in check return func(*args, **kwargs) File "/opt/hg/lib/python/mercurial/commands.py", line 1963, in export opts=patch.diffallopts(ui, opts)) File "/opt/hg/lib/python/mercurial/cmdutil.py", line 1555, in export repo, ctx, match, switch_parent, rev, seqno, write, opts) File "/opt/hg/lib/python/mercurial/cmdutil.py", line 1499, in _exportsingle for chunk, label in patch.diffui(repo, prev, node, match, opts=diffopts): File "/opt/hg/lib/python/mercurial/patch.py", line 2486, in difflabel for chunk in func(*args, **kw): File "/opt/hg/lib/python/mercurial/patch.py", line 2339, in diff losedatafn=losedatafn, prefix=prefix, relroot=relroot, copy=copy, File "/opt/hg/lib/python/mercurial/patch.py", line 2396, in diffhunks changes = repo.status(ctx1, ctx2, match=match) File "/opt/hg/lib/python/mercurial/localrepo.py", line 2055, in status listsubrepos) File "/opt/hg/lib/python/mercurial/context.py", line 362, in status listunknown) File "/opt/hg/lib/python/mercurial/context.py", line 131, in _buildstatus mf2 = self._buildstatusmanifest(s) File "/opt/hg/lib/python/mercurial/context.py", line 109, in _buildstatusmanifest return self.manifest() File "/opt/hg/lib/python/mercurial/context.py", line 189, in manifest return self._manifest File "/opt/hg/lib/python/mercurial/util.py", line 966, in __get__ result = self.func(obj) File "/opt/hg/lib/python/mercurial/context.py", line 547, in _manifest return self._manifestctx.read() File "/opt/hg/lib/python/mercurial/manifest.py", line 1471, in read self._data = manifestdict(text) File "/opt/hg/lib/python/mercurial/manifest.py", line 423, in __init__ self._lm = _lazymanifest(data) ValueError: Manifest lines not in sorted order. (from running `hg export` on a thusly corrupt revision)
I *think* this might have been introduced in 1cbf144fd8a1. If I change the line if len(changes) < 1000: in manifest's fastdelta(self, base, changes) method to be always-true with an "or True", I no longer get the corruption. Adding Durham since he wrote the code and it also seems likely to be a risk for FB's large repo.
Does HGMODULEPOLICY=py mean that mdiff.textdiff runs a python version? Seems like that would be the likely place for the bug. Perhaps you could compare the fastdelta result with the mdiff.textdiff result and when they are different you can get the inputs to mdiff.textdiff to narrow down the repro.
It does, but I just wrote a unit test using these exact texts and mdiff.textdiff is exonerated. If I dump the base and arraytext values to files in /tmp, the base values match between pure and c, but the arraytext values differ. I'm technically on vacation today, but I'll probably keep poking at this because it's a baffling problem.
Bug marked urgent for 10 days, bumping
Bug marked urgent for 11 days, bumping
(In reply to Augie Fackler from comment #3) Did you ever make any headway on this? Downgrading to "normal" since it doesn't seem to be a widespread issue.
Bug was inactive for 150 days, archiving
Still a thing that needs fixed.
Still an important issue.
I bisected it back to here with pure on Windows, with a repo from timeless: changeset: 30042:d24e03da24b5 user: Maciej Fijalkowski <fijall@gmail.com> date: Mon Sep 12 13:37:14 2016 +0200 summary: lazymanifest: write a more efficient, pypy friendly version of lazymanifest It seems that what's happening is the first byte of some filenames are being overwritten with NUL: $ hg -R ../../timeless_corruption/openj9 debugindex -m | tail -n 4 7712 7716 70ba4155c5fa a9dfc50dd9e7 a3e82536e5ee 7713 7717 eaef44749065 70ba4155c5fa 000000000000 7714 7718 8a3514d0a092 eaef44749065 000000000000 7715 7719 c698c9d83e22 8a3514d0a092 000000000000 $ hg -R ../../timeless_corruption/openj9 debugdata -m c698c9d83e22 | sed 's/\x0/*/g' | grep -n -A3 -B3 '^*' 1601-jcl/src/jdk.jcmd/share/classes/openj9/tools/attach/diagnostics/target/TargetDiagnosticsProviderImpl.java*94b507ba985856bf727dd30abee97fcb9016c984 1602-jcl/src/jdk.jcmd/share/classes/openj9/tools/attach/diagnostics/tools/JStack.java*676f860c8cdb91b62ed8c528abbf012c1a644e79 1603-jcl/src/jdk.jcmd/share/classes/openj9/tools/attach/diagnostics/tools/Jps.java*6151a513600ed5967f4f712ad001e8a95e65ef3e 1604:*cl/src/jdk.jcmd/share/classes/openj9/tools/attach/diagnostics/tools/Jstack.java*06bfc368388e7acd8cdb9af586eb61207530ce6b 1605-jcl/src/jdk.jlink/share/classes/jdk/tools/jlink/internal/plugins/GenerateJLIClassesPlugin.java*32511e395dd618fbef840a027c67f80fb5a89d15 1606-jcl/src/jdk.management/share/classes/com/ibm/lang/management/AvailableProcessorsNotificationInfo.java*8b66c4e8fd370c7c5d29c0dbb47f6d32aabeca23 1607-jcl/src/jdk.management/share/classes/com/ibm/lang/management/CpuLoadCalculationConstants.java*d95c95f886a80cb337acd17432ea07d03edeb495 -- 7227-test/functional/Java8andUp/src/org/openj9/test/attachAPI/TestConstants.java*ebfa9130bc93ab296d40cbaf02ee3d53cda08260 7228-test/functional/Java8andUp/src/org/openj9/test/attachAPI/TestJStack.java*5ed2794399576ff742f1f28cbe60543b9023e72a 7229-test/functional/Java8andUp/src/org/openj9/test/attachAPI/TestJps.java*e5078daeb42f2ba5c8d8af252bf6ef30506aa515 7230:*est/functional/Java8andUp/src/org/openj9/test/attachAPI/TestJstack.java*fc38375ee5d956d6f417e1a98462db7b946f23d9 7231-test/functional/Java8andUp/src/org/openj9/test/attachAPI/TestLoadAgent.java*84e943730bbc2764df0ab67722b8111764e238bf 7232-test/functional/Java8andUp/src/org/openj9/test/attachAPI/TestLoadAgentLibrary.java*c9485f7a0887aafef6d524ecac2c0121771b07df 7233-test/functional/Java8andUp/src/org/openj9/test/attachAPI/TestLoadAgentPath.java*53a39a6d85b1671ca2fa98d511f0c899d10b041f The filenames look otherwise properly sorted.
(In reply to Augie Fackler from comment #1) Not sure if I hit a different case, but neither True nor False in that conditional work for me on d24e03da24b5 or stable (ce5f1232631f).
I checked the repo at work that hit this problem (unfortunately I can't share it), and this is almost certainly a combination of a lot of changes and renames in the mix. Dumping the bad manifest showed the same stray NUL over some file names that timeless has. 100% of those bad names were the source of a move from the parent commit, and should have been dropped. They weren't, and the filelog hash corresponded to what was in the parent manifest. Some moved files did have their source properly removed though. The `else` case in the conditional that Augie referenced in comment #1 is a straight up text diff, ignoring the list of removed files that got passed in, so it is relying on the fact that the deleted entries really are removed from the text. _lazymanifest.text() is how this is created, and that calls _lazymanifest._compact(). Apply this diff, and behold the strangeness: diff --git a/mercurial/manifest.py b/mercurial/manifest.py --- a/mercurial/manifest.py +++ b/mercurial/manifest.py @@ -301,7 +301,13 @@ class _lazymanifest(object): if end_cut != -1: end_cut += 1 offset += end_cut - cur - l.append(self.data[last_cut:end_cut]) + + data = self.data[last_cut:end_cut] + if data[0] == '\x00' or '\n\x00' in data: + print('adding bogus value from position[%d]==%d (%s)' % (i, cur, data)) + elif data.count('\n') > 1: + print('adding multiline position[%d]==%d (%s)' % (i, cur, data)) + l.append(data) else: while i < len(self.positions) and self.positions[i] < 0: cur = self.positions[i] That produces output that seems to show multiple lines being added at once: adding multiline position[1834]==96009 (ZModem/Src/zmr.c db4a010599d04d3824aade3bf1d632d37d2802ae ZModem/Src/zms.c 0cbe0742951eb444f4a761f40abe9cfd0c1e636d ) And in some cases, the deleted file entries are getting sucked in: adding bogus value from position[1725]==88875 (Usnet/Include/telnet.h 158a60d934b79be80af5311677524e509db59157 Usnet/Include/uconsts.h f74a94d5be5f60399aae27801810426e915fba90 snet/License.txt bd0e1dc9c722ed3b7d67e1e85d20550b91034951 Usnet/Netsrc/SConscript 7145df3b66bd53771e0dd72ed27d18577d0ab177 Usnet/Netsrc/arp.c 3d257f14797129c1d9783dcdb3b4dd92f5d99cd2 Usnet/Netsrc/bootp.c d32912a8b066588604f446729e2631767b543fc5 ) I'm not sure what this function is trying to do- __delitem__() is removing self.position entries, so I'd expect this to be a straight traversal. That `while True` loop is doing something though, because if I drop it, the commit hangs. There are no merges in the repo, and nothing exotic about it other than a lot of files and it was originally converted from bzr. I'm triggering this with an hg -> hg conversion.
(In reply to Matt Harbison from comment #16) I double checked that hardcoding the conditional doesn't fix timeless's openj9 rebase issue. But it is the same problem where NUL bytes are slipping into the text for removed files. It's just getting there a different way, via this: https://www.mercurial-scm.org/repo/hg/file/2338bdea4474/mercurial/manifest.py#l1511
Fixed by https://mercurial-scm.org/repo/hg/rev/0546ead39a7e Matt Harbison <matt_harbison@yahoo.com> manifest: avoid corruption by dropping removed files with pure (issue5801) Previously, removed files would simply be marked by overwriting the first byte with NUL and dropping their entry in `self.position`. But no effort was made to ignore them when compacting the dictionary into text form. This allowed them to slip into the manifest revision, since the code seems to be trying to minimize the string operations by copying as large a chunk as possible. As part of this, compact() walks the existing text based on entries in the `positions` list, and consumed everything up to the next position entry. This typically resulted in a ValueError complaining about unsorted manifest entries. Sometimes it seems that files do get dropped in large repos- it seems to correspond to there being a new entry that would take the same slot. A much more trivial problem is that if the only changes were removals, `_compact()` didn't even run because `__delitem__` doesn't add anything to `self.extradata`. Now there's an explicit variable to flag this, both to allow `_compact()` to run, and to avoid searching the manifest in cases where there are no removals. In practice, this behavior was mostly obscured by the check in fastdelta() which takes a different path that explicitly drops removed files if there are fewer than 1000 changes. However, timeless has a repo where after rebasing tens of commits, a totally different path[1] is taken that bypasses the change count check and hits this problem. [1] https://www.mercurial-scm.org/repo/hg/file/2338bdea4474/mercurial/manifest.py#l1511 (please test the fix)
Very impressive work Matt! I've gone through this a couple of times and never figured out the defect.
(In reply to Augie Fackler from comment #20) Thanks! The various manifest corruption errors on the thg bug tracker had me really concerned it was a C extension problem, so I’m glad this got nailed down. (Although I’m still slightly concerned because those were all Windows users, and none of the hg and thg packages should be using pure on Windows. And timeless was pretty sure he was using cext on macOS the other day, though I couldn’t repeat it on Windows and the repo at work that gave me problems works fine with cext.)
Bug was set to TESTING for 7 days, resolving
*** Bug 6052 has been marked as a duplicate of this bug. ***