Future of copy metadata

Gregory Szorc gregory.szorc at gmail.com
Sun Dec 18 16:31:25 EST 2016


Mercurial currently stores file copy/rename metadata as a "header" in
filelog revision data. Furthermore, there is some wonkiness with p1 and p2
in the filelog when copies are at play (see _filecommit() in localrepo.py).
This metadata means copies/renames can be followed without expensive
run-time "similarity" detection, which is great, especially for large
repositories.

However, people or automated processes don't always perform the necessary
invocations of `hg copy` or `hg rename` to record copy/rename metadata. And
historically there have been a number of bugs or feature deficiencies where
copy/rename metadata is lost or not recorded where it should have been.
Coupled with the design of having copy metadata in the filelog data (which
is part of the hash and the merkle tree contributing to the changeset
node), this means that if copy metadata isn't correct from the beginning,
it is wrong forever. That's a pretty painful constraint.

The subject of copy/rename inaccuracy is a frequent complaint among Mozilla
developers doing lots of code archeology - in short they can't trust it and
they fall back to a Git conversion of the repo when they know
copies/renames are in play (Git performs copy/rename detection at operation
run-time).

I recall a very informal conversation with mpm at the 3.8 Sprint in March
about this topic and he seemed to express a desire to move copy/rename
detection/metadata out of filelogs. I vaguely recall him suggesting it be
computed at run-time and cached if performance dictates. I also recall him
saying something about modern research in the area of copy detection has
enabled better solutions than "measure the percentage of identical lines."

I was wondering if there have been any formal discussions or proposals on
the future of copy metadata. I am most interested in:

* Whether there are plans for (or even an extension implementation of) a
supplemental copy metadata "database." The goal would be to correct
deficiencies in the set-in-stone filelog-based metadata.
* Whether there are plans to move copy metadata out of filelog revisions
completely. (This would make the filelogs simpler and more clearly separate
file content from metadata.)
* If we're talking about new designs for copy/rename metadata, should
improvements to linkrev be discussed at the same time?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20161218/252f9a61/attachment.html>


More information about the Mercurial-devel mailing list