Analysis of issue 1327

Fri Nov 6 10:07:04 CST 2009

This mail describes the issue of file revlogs not being a subgraph of
the changelog, as well as it explores several solutions for rectifying,
or mitigating, the problem. (The problem has been reported as issue1327
in terms of merge being wrong, but this mail describes the root cause
for it).

Since this is a rather long mail, and we do hope that people who care
about Mercurial's inner workings actually want to read it through to the
end, we provide a TOC to make the mail more manageable (hopefully). The
sections are in all caps so it should be relatively painless to search
for them.

1. BACKGROUND
2. SOLUTIONS
2.1. FIXING THE FILE REVLOG
2.1.1. NON-BREAKING METHOD: USING EXISTING METADATA CAPABILITIES
2.1.2. INTRODUCING A NEW REVLOG SUB-FORMAT SUPPORTING METADATA AND SALT
2.1.2. THE FUTURE: OTHER FORMATS
2.2. FIXING USES OF THE FILE REVLOG
3. SUMMARY
A. WHAT WE'RE GONNA DO, IN-HOUSE

Each subsection of the solutions will also present some advantages and
disadvantages of the individual solution, and we will try to make some
coherent recommendation in the summary. So grab some cookies and a
hot drink. :o)

1. BACKGROUND

In short, the following very simple script illustrates what is wrong:

hg init a
cd a
echo foo > foo
hg ci -Amfoo
echo bar > foo
hg ci -mbar
hg backout -r tip -mbackout
hg up 0
echo bar > foo
hg ci -mbar2

This produces the following changelog graph:

@  3: bar2
|
| o  2: backout
| |
| o  1: bar
|/
o  0: foo

Notice in particular that 3 is not an ancestor of 2. If we, however,
look at the file revlog it produces the following graph:

o  2: backout
|
o  1: bar
|
o  0: foo

Oops.

The reason this happens is because a change's node is computed as
"hash(text,p1, p2)", where (p1, p2) are the filelog's parents. Seeing as
p1 and p2 are the same for the change of foo in the changelog's
revisions 1 and 3, the latter will be seen as being the exact identical
commit to the first one, which it is not.

This might be seen as a problem for just this simple toy case, but
imagine you have a very large repository with tens of thousands of
files, an individual file may not be touched in hundreds of revisions
and this will trigger the same behaviour as the parents are the
filelog's parents and not the changelog's parents.

If we try to merge our two heads, then according to the current filelog
there are no multiple heads and a merge is not required, thus the result
is merely that the backout wins. This is rather surprising if you try to
reason about the contents of the files according to the changelog graph
as one head contains "foo", another contains "bar", and the ancestor
contains "foo", then normal merging rules dictate that the result ought
to be "bar". (One could argue that it would be preferable for the
backout to win in this case, but that sidesteps the issue of the file
revlog being wrong, i.e. there has been introduced an ancestral
relationship in the filelog graph that simply doesn't exist; also, it
would be far better if this case produced a merge conflict rather than
a simple merge, but that requires a closer cooperation with the merger
to work and is beside the point of this mail; in general, there is no
'correct' way to merge things like this, but in order to make an
informed decision, a future merge program, will at least need the
correct relationships.)

2. SOLUTIONS

In general, there are two ways to address this: Fix the filelogs to be
correct, or fix all the uses of the filelog. The first can be done
without breaking compatibility, albeit at a performance cost. It can
also be done much better by introducing slight (flag-controlled) changes
to the revlog format, which will as a side-effect optimize metadata
access (currently used for file copies). Both solutions will be
described below.

Fixing everything else is much more complicated, as the Mercurial code
throughout tacitly assumes that the filelog has the 'subgraph property',
i.e. that it's a subgraph of the changelog. Note, that having this
property certainly has its benefits, most notably efficient access to
filelogs when using hg log <filename>. In this case, this is also the
problem: There are quite a few places where the code goes directly
through the filelog, instead of via the changelog, and merge is just one
of them. Changing this will also incur a performance penalty for all
those operations (more or less; merge should be the same, for instance).

Finally, note that the manifest revlog currently has the same property,
but since the graph structure in it is never really used for anything,
this is not a problem.

2.1. FIXING THE FILE REVLOG

Both methods discussed below are based on "salting the file revlog",
i.e. making sure the hash is taken over more than the file data and the
parents. What exactly the salt should be is somewhat orthogonal to the
design, but we assume that it's based on commit data (message, date,
extra dictionary...), so as to uphold the following properties:

P1: If two commits with same parents are exactly identical, the revlogs
    on all three levels should be merged.

P2: If two commits with same parents (and files) differ in some way, be
    it date, branch, description or otherwise, no revlog merges should
    occur on any level.

Property P1 is nice to avoid duplicate commits. It holds in the current
system, and will hold with the solutions discussed below as well. P2
doesn't hold in the current system, but will (to an extent which can be
controlled by choosing what to include in the salt) hold in the
solutions below.

The salt can be a SHA1 of the chosen data, and used to update the file
revlog hash in some way ranging from simple exclusive-or to
a SHA1-update.

2.1.1. NON-BREAKING METHOD: USING EXISTING METADATA CAPABILITIES

Currently the filelogs have metadata that tracks whether the file has
been copied and the revision it has been copied from. If such metadata
exists, the file contents text will be rewritten to
"\1\n{metadata}\1\n{actual data}" This is the actual text that is hashed
so if we e.g. include a hash of the date, commit message, and the user
name in the metadata, the two different commits will be disambiguated,
and the file revlog graph will, in our example, be identical to the
changelog graph.

This means, in pseudo notation, that in localrepository._filecommit we
would add a:
"meta['salt'] = sha1(ctx.date()[0], ctx.description(), ctx.user())"
in an appropriate place. This, however, means that we need to strip away
the metadata for each file read, using a Python substring, making it
possibly slower (in filelog.read). We, furthermore, need to modify
filelog.cmp to always take the slow path for the same reason.

Doing this means that old and new clients can coexist seamlessly on the
same repository. Old clients will still write the collated revlogs if
"hash(text,p1, p2)" are the same (i.e. if the file has the same contents
and parents), but if the other revision has been written by a new
client, the revisions will not be collated.

It does, however, not give correct merges for already written revisions
that have been erroneously collated, nor will it give correct
annotations or graphlogs got those revisions.

The performance issues introduced by having to strip away metadata and
do string comparisons for each .cmp call could be mitigated by moving
the filelog data reading into C, returning a string tuple with the
metadata and file contents respectively. This will provide a good deal
of the same speed of the revlog change (see below), but without the
disadvantage of not working with older clients.

In short, these are the advantages of this approach:
- No format changes; old and new clients can co-exist
- Few and rather simple code changes

Disadvantages:
- Performance regression for metadata access
- Kinda hackish (but hardly more than current metadata access).

2.1.2. INTRODUCING A NEW REVLOG SUB-FORMAT SUPPORTING METADATA AND SALT

A better (with respect to design, prettyness and performance) approach
is to change the revlog format slightly to support metadata in a native
way. This can be done in the following way:

1. Introduce a new global revlog flag: HAS_METADATA
2. Use the current unused per-entry flag field (16 bits) for metadata
   length when that flag is set; the flags are then moved down to the
   start of the metadata (or dropped, since they are unused at the
   moment).
3. Introduce a new global revlog flag: SALTED
   (which must be used with HAS_METADATA)

The local flags field will now encode the metadata length, in bytes,
which allows for 64kiB of metadata, which should be much more than
needed. Metadata will come before the actual filedata, which no
delimiters needed, since the size is now known. Access will be efficient
for the same reason. If the log is salted, metadata will start with the
salt in a well-known format (and size), and can be removed silently by
revlog so the metadata API doesn't see it.

The same for the local flags, if we move those down. Depending on taste,
metadata can be kept in the .i always, or moved to .d if there is a .d
file. Also depending on taste and further discussions, metadata can be
compressed or just be prepended after compression. Apart from salt,
metadata will only be used for file copies (which is the only thing it's
used for currently).

Compared to 2.1.1 there are a number of advantages to this approach:
- Revlogs will have metadata as a first-class concept, which may have
  uses in the future.
- Access to salt and other metadata will be efficient; more so
  than today.
- It's not really a hack in the same way current metadata seems to be.
- We can add another global revlog flag: HAS_GLOBALMETADATA while we're
  at it, since we're breaking the format anyway. Such global metadata
  can then, later on, be efficiently used for light-weight copies.
  Global metadata would presumably be placed in the index file, before
  the first revision entry.

There are obvious drawbacks as well:
- Adding new flags makes revlogs incompatible for reading.
- It will need a new requirement as well, since backwards wire-
  compatibility will be hard to achieve with this method.

2.1.2. THE FUTURE: OTHER FORMATS

This is a topic of much debate, I am sure, but I personally think that
we should, on the long term, move towards a solution a bit like this:

- A revlog containing the changelog and possibly manifest. No real
  reason for the two levels, as I see it.
- A revstore for file data, supporting multiple objects (one per file),
  so we can do delta-compression like we do now, and light-weight
  renames. Either by giving files a unique id on birth and track renames
  in the manifest, or by using revstore metadata.
- If the manifest is not combined into the changelog, revstore can be
  used for it as well. Revstore should of course support 'parent diff',
  as should revlog.

The disadvantage is again that logs of individual files will be slower,
so if that's an important issue we could retain a (correct) revlog for
each file instead, and still do renaming by assigning unique id's to
files at birth. There are many possibilities.

2.2. FIXING USES OF THE FILE REVLOG

The other approach is to accept that the file revlog has a broken graph,
and work around it; essentially by never using its graph structure like
the case with the manifest. As already discussed, this requires changing
many places where the file revlog is assumed to be a subgraph of the
changelog. This would also be what the user expects, if she reads hg
help log and similar. It doesn't say anything about hg log <filename>
showing an approximate filelog, or similar.

We will not go into much detail with this approach, which we see as less
desirable, although it does have an obvious advantage:
- New and old clients can co-exits. No format changes are required.

But:
- Performance regression in all access paths which start with
  the file revlog.
- A lot of code changes required, in some cases somewhat complex.

Of course one can settle for a compromise, fixing only merge, say, and
leaving the others to fight for the scraps :-)

3. SUMMARY

All approaches, like ever so often, has its pros and cons. If it's
important to fix things quickly for new changesets, (2.1.1) could be
used. If it's also important to fix it quickly for old changesets,
perhaps (2.2) as well just for merges, to keep it simple (although the
merge code is complex) and non-regressing.

On the other hand, a design based on (2.1.2) would fix the revlogs in
a nice non-regressing way, and at the same time make metadata access
faster for the existing use (currently file copies), while of course not
fixing it 'in the past'.

It would also enable us to prepare revlogs for global metadata useful
for future light-weight copy information.

A. WHAT WE'RE GONNA DO, IN-HOUSE

Once the revlog (or at least ancestor choosing) problem is fixed, we'll
experiment (at work) with extending kdiff3 in the following way: A new
command line argument which receives the three changelog hashes (which
are in env-vars), and, then, when it encounters a line in a file
like this:

(base)  (local)  (other)
   X       Y        X

it will consult the three hashes: If hash for base and other are the
same, resolve in favor of local. Otherwise, generate a conflict. This
will generate conflicts for all backout cases, for which there is no
plausible way for the merger to decide what to pick. Currently, kdiff3
would always pick local.

To get this up and running quickly, we'll initially implement (2.1.1) at
work, and patch kdiff3 to see how it works. The present situation is
untenable for us. Longer term, we would prefer something like (2.1.2).

-- 
Sune Foldager
Henrik Stuart