Analysis of issue 1327

Fri Nov 6 13:29:39 CST 2009

On Fri, 2009-11-06 at 17:07 +0100, Sune Foldager wrote:
> This mail describes the issue of file revlogs not being a subgraph of
> the changelog, as well as it explores several solutions for rectifying,
> or mitigating, the problem. (The problem has been reported as issue1327
> in terms of merge being wrong, but this mail describes the root cause
> for it).
> 
> Since this is a rather long mail, and we do hope that people who care
> about Mercurial's inner workings actually want to read it through to the
> end, we provide a TOC to make the mail more manageable (hopefully). The
> sections are in all caps so it should be relatively painless to search
> for them.
> 
> 1. BACKGROUND
> 2. SOLUTIONS
> 2.1. FIXING THE FILE REVLOG
> 2.1.1. NON-BREAKING METHOD: USING EXISTING METADATA CAPABILITIES
> 2.1.2. INTRODUCING A NEW REVLOG SUB-FORMAT SUPPORTING METADATA AND SALT
> 2.1.2. THE FUTURE: OTHER FORMATS
> 2.2. FIXING USES OF THE FILE REVLOG
> 3. SUMMARY
> A. WHAT WE'RE GONNA DO, IN-HOUSE
> 
> Each subsection of the solutions will also present some advantages and
> disadvantages of the individual solution, and we will try to make some
> coherent recommendation in the summary. So grab some cookies and a
> hot drink. :o)
> 
> 
> 1. BACKGROUND
> 
> In short, the following very simple script illustrates what is wrong:
> 
> hg init a
> cd a
> echo foo > foo
> hg ci -Amfoo
> echo bar > foo
> hg ci -mbar
> hg backout -r tip -mbackout
> hg up 0
> echo bar > foo
> hg ci -mbar2
> 
> This produces the following changelog graph:
> 
> @  3: bar2
> |
> | o  2: backout
> | |
> | o  1: bar
> |/
> o  0: foo
> 
> Notice in particular that 3 is not an ancestor of 2. If we, however,
> look at the file revlog it produces the following graph:
> 
> o  2: backout
> |
> o  1: bar
> |
> o  0: foo
> 
> Oops.
> 
> The reason this happens is because a change's node is computed as
> "hash(text,p1, p2)", where (p1, p2) are the filelog's parents. Seeing as
> p1 and p2 are the same for the change of foo in the changelog's
> revisions 1 and 3, the latter will be seen as being the exact identical
> commit to the first one, which it is not.

You're glossing over an important detail here, which is linkrevs. For
our readers, I'll expand on it. Filelogs only know about file contents,
and from the viewpoint of file contents and their parent revisions,
filelogs are in fact as correct as they can be given the information
they have available to them.

Linkrevs connect filelog entries "up the tree" to the earliest changeset
they appear in. In your graphs, the strings you're showing are the
changeset description associated with the linkrev for that revision. 
So again, your filelog graph is correct as far as that goes.

Where we get into trouble is when we make incorrect assumptions about
the properties of the schema. The schema was actually designed with an
eye to avoiding a very similar problem: this is why we hash in the
parent revs. This lets us avoid issues like:

foo->bar->foo

The second foo is distinguishable from the first by virtue of having bar
as a parent. If we have two people separately make the graphs

foo->bar->foo

foo->bar->foo

and merge them, we get the "right" *local* result: no duplicates. But
globally, things can be confused. If the identical file-level changes
are part of different higher level changes (other files, permissions,
user, branch, date, description, etc.) then the same file change will be
part of more than one changeset, something the schema can't directly
record in the single linkrev slot we have available. 

The linkrev is not "wrong", it's simply pointing to the first changeset
containing the change. That's usually but not always all the changesets.

The three operations where we currently use file level DAGs and linkrev
are:

a) file-level merges, for finding the ancestor

This is what the cited issue is about. Here we're currently ignoring
that given file revisions might be part of multiple changesets and that
the file local history is a (not-quite) subgraph of the changeset graph.
We should generally defer to the changeset graph here, no tinkering with
the file graph is really needed.

b) annotating a file

This can get confused if you have identical changes on unmerged
branches. On merged branches (and with backouts), the output will be
correct in the sense that it will report the first appearance of lines.

c) showing logs of single files

Similarly, this will also report only the first of identical changes,
and notably not report backouts. It also won't report deletions, because
deletions happen at the manifest level.

So we've got three separate issues:

1) file revisions don't map to a unique changeset
2) the file DAG is not a proper subgraph of the changeset DAG
3) deletions aren't represented at the file level

Note that all the required information to avoid problems in (a), (b) and
(c) *is* in the tree. We can still always follow the pointers down from
the top-level to the file level to find all the changesets, a more
expensive operation. Git, which has nothing equivalent to linkrevs, has
to do this for all its file-level operations. Which means some of its
operations (eg annotate) are much slower.

> 2.1.1. NON-BREAKING METHOD: USING EXISTING METADATA CAPABILITIES
> 
> Currently the filelogs have metadata that tracks whether the file has
> been copied and the revision it has been copied from. If such metadata
> exists, the file contents text will be rewritten to
> "\1\n{metadata}\1\n{actual data}" This is the actual text that is hashed
> so if we e.g. include a hash of the date, commit message, and the user
> name in the metadata, the two different commits will be disambiguated,
> and the file revlog graph will, in our example, be identical to the
> changelog graph.
> 
> This means, in pseudo notation, that in localrepository._filecommit we
> would add a:
> "meta['salt'] = sha1(ctx.date()[0], ctx.description(), ctx.user())"

I'm afraid such a salt is insufficient. Ideally, you'd want to put the
changeset's hash id in here, but we haven't calculated that yet. And in
fact, we can't: there's a cryptographically strong chicken and egg
problem there. The above salt will break if, say, you import identical
exported patches, except one's a git patch with a +x bit, and another in
a traditional patch that loses the bit. Now you've got two different
changesets still sharing file revisions, and when you try to back one of
them out, you're back where you started.

I don't think this problem is fixable, but I'm just going to assume it
is for the sake of argument for the rest of the message.

Alternately, your salt could be a random number, but this would make
identical patches applied appear distinct, which is undesirable.

(When we discovered this linkrev problem about four years ago, I
realized my parent hashing trick was probably a mistake, because we only
really care about it at the topmost level and including the parent
hashes directly in the changelog contents would have dealt with it
there. Something for MercurialNG, perhaps.)

> in an appropriate place. This, however, means that we need to strip away
> the metadata for each file read, using a Python substring, making it
> possibly slower (in filelog.read). We, furthermore, need to modify
> filelog.cmp to always take the slow path for the same reason.

Both of these overheads are significant, especially for larger files.
The other downside is that it means that you've now got "duplicate"
revisions in the filelog. If these aren't immediately adjacent, the size
impact can be significant. Also, this means adding some extra
uncompressible data to every file revision.

Also note that this approach slows down EVERY SINGLE FILE REVISION
ACCESS to fix a fairly rare case (identical file revisions in different
changesets with the same parent file revisions). Even if it's a 1%
overhead (and I'm sure it's bigger than that), on the net it's a loss.
Is it more important that diff and update be fast, or annotate?

> Doing this means that old and new clients can coexist seamlessly on the
> same repository. Old clients will still write the collated revlogs if
> "hash(text,p1, p2)" are the same (i.e. if the file has the same contents
> and parents), but if the other revision has been written by a new
> client, the revisions will not be collated.
> 
> It does, however, not give correct merges for already written revisions
> that have been erroneously collated, nor will it give correct
> annotations or graphlogs got those revisions.
> 
> The performance issues introduced by having to strip away metadata and
> do string comparisons for each .cmp call could be mitigated by moving
> the filelog data reading into C, returning a string tuple with the
> metadata and file contents respectively. This will provide a good deal
> of the same speed of the revlog change (see below), but without the
> disadvantage of not working with older clients.

Actually, I don't think it can, as Python strings 'own' their buffers.
This also doesn't help the .cmp call, which can usually skip
reconstructing revisions entirely. And it actually gets used a fair
amount.

> In short, these are the advantages of this approach:
> - No format changes; old and new clients can co-exist
> - Few and rather simple code changes
> 
> Disadvantages:
> - Performance regression for metadata access
> - Kinda hackish (but hardly more than current metadata access).

Unfortunately, it's a hack to the -schema- as opposed to a hack to the
code, which means we're stuck with its impacts going forward. All things
being equal, code hacks are much preferable.

All things aren't equal here, of course, so let's weigh the pros and
cons. My preferred approach right now is to fix only case (a) merge by
calculating the file ancestor more carefully. On the plus side:

- this gets merges right with people's -current history-
- is a reasonably simple, isolated change that might be doable for 1.4
- has no impact on performance or storage size elsewhere

On the downside:

- it doesn't fix (b) "hg annotate <file>"
- it doesn't fix (c) "hg log <file>"

On the other hand, both of those issues are relatively minor. If we want
to make the latter more accurate (and show deletes), we probably need to
skip the fast path anyway. 

So that leaves us with slowing down the rarely used annotate command to
fix an uncommon and relatively harmless bug vs a schema change with
performance and size impacts across the board.

I'm unconvinced. I'm going to take a stab at fixing it my way before
1.4, we can talk about this problem further later.

-- 
http://selenic.com : development and support for Mercurial and Linux