Analysis of issue 1327

Fri Nov 6 13:49:55 CST 2009

Matt Mackall wrote:

> You're glossing over an important detail here, which is linkrevs. For
> our readers, I'll expand on it. [...]

I know quite well how they work, so I'll snip this part...

> Where we get into trouble is when we make incorrect assumptions about
> the properties of the schema.

Sure, I guess.. but almost all Mercurial code involving files then makes
these incorrect assumptions.

> and merge them, we get the "right" *local* result: no duplicates. But
> globally, things can be confused. If the identical file-level changes
> are part of different higher level changes (other files, permissions,
> user, branch, date, description, etc.) then the same file change will be
> part of more than one changeset, something the schema can't directly
> record in the single linkrev slot we have available.

This is one way to look at it. I prefer to look at it like this: The
nodes have been incorrectly merged in the filelog, where they should
have been separate nodes.

[snip...]

> I'm afraid such a salt is insufficient. Ideally, you'd want to put the
> changeset's hash id in here, but we haven't calculated that yet.

We know, which is why we salt with part of the data used to calculate it.

> And in
> fact, we can't: there's a cryptographically strong chicken and egg
> problem there.

We are aware of that also ;-).

> The above salt will break if, say, you import identical
> exported patches, except one's a git patch with a +x bit, and another in
> a traditional patch that loses the bit. Now you've got two different
> changesets still sharing file revisions, and when you try to back one of
> them out, you're back where you started.
> 
> I don't think this problem is fixable, but I'm just going to assume it
> is for the sake of argument for the rest of the message.

By including more data in the salt, perhaps. This example is highly
contrieved IMO, and this problem isn't unique to this solution is it?

> Alternately, your salt could be a random number, but this would make
> identical patches applied appear distinct, which is undesirable.

Agreed.

>> in an appropriate place. This, however, means that we need to strip away
>> the metadata for each file read, using a Python substring, making it
>> possibly slower (in filelog.read). We, furthermore, need to modify
>> filelog.cmp to always take the slow path for the same reason.
> 
> Both of these overheads are significant, especially for larger files.
> The other downside is that it means that you've now got "duplicate"
> revisions in the filelog. If these aren't immediately adjacent, the size
> impact can be significant. Also, this means adding some extra
> uncompressible data to every file revision.

You only get those for backouts and cherry-picks, though. I am sure the
performance overhead is significant, but how about the performance
overhead involved in not using the filelog for history information anymore?

> Also note that this approach slows down EVERY SINGLE FILE REVISION
> ACCESS to fix a fairly rare case (identical file revisions in different
> changesets with the same parent file revisions).

Sadly, the case is far more common that it would seem. We have it all
the time at work. We're also many users on large repos. We realize, of
course, that this slows down all access.

> Even if it's a 1%
> overhead (and I'm sure it's bigger than that), on the net it's a loss.
> Is it more important that diff and update be fast, or annotate?

Or hg log <filename>, which would probably suffer the most? I don't
know. But it's not a rare issue, at least.

> Unfortunately, it's a hack to the -schema- as opposed to a hack to the
> code, which means we're stuck with its impacts going forward. All things
> being equal, code hacks are much preferable.

Yes, I agree.

> All things aren't equal here, of course, so let's weigh the pros and
> cons. My preferred approach right now is to fix only case (a) merge by
> calculating the file ancestor more carefully. On the plus side:
> 
> - this gets merges right with people's -current history-
> - is a reasonably simple, isolated change that might be doable for 1.4
> - has no impact on performance or storage size elsewhere
> 
> On the downside:
> 
> - it doesn't fix (b) "hg annotate <file>"
> - it doesn't fix (c) "hg log <file>"
> 
> On the other hand, both of those issues are relatively minor. If we want
> to make the latter more accurate (and show deletes), we probably need to
> skip the fast path anyway. 

Yes, deletes skip the incorrect-path currently. I also misses the graph
relationship, but I don't think that's an inherent property.

> I'm unconvinced. I'm going to take a stab at fixing it my way before
> 1.4, we can talk about this problem further later.

I was pretty sure you would be ;-). At any rate, my real preference is
2.1.2, involving code and revlog changes. Although the rare issue with
backing out one of almost-identical patches remains there, of course.
But so does it with the way you're going to solve it.

/Sune