Bug 883 - File/dir renames consume extra space in repository
Summary: File/dir renames consume extra space in repository
Status: CONFIRMED
Alias: None
Product: Mercurial
Classification: Unclassified
Component: Mercurial (show other bugs)
Version: unspecified
Hardware: All All
: normal feature
Assignee: Bugzilla
URL:
Keywords:
: 5575 (view as bug list)
Depends on:
Blocks:
 
Reported: 2007-12-19 15:11 UTC by Jesse Glick
Modified: 2023-03-07 02:17 UTC (History)
43 users (show)

See Also:
Python Version: 3.8


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jesse Glick 2007-12-19 15:11 UTC
When files or dirs are renamed in Hg, repository size is increased, I guess by
about the compressed size of those files:

$ hg init
$ cp /boot/vmlinuz-2.6.22-14-generic f
$ hg add f
$ hg ci -m 1
$ du --si
1.8M	./.hg/store/data
1.8M	./.hg/store
1.8M	./.hg
3.6M	.
$ hg ren f g
$ hg ci -m 2
$ du --si
3.5M	./.hg/store/data
3.5M	./.hg/store
3.5M	./.hg
5.3M	.
$ ls -Rl .hg/store/data
.hg/store/data:
total 3328
-rw-r--r-- 1 jglick jglick 1692145 2007-11-09 05:42 f.d
-rw-r--r-- 1 jglick jglick      64 2007-11-09 05:42 f.i
-rw-r--r-- 1 jglick jglick 1692204 2007-11-09 05:42 g.d
-rw-r--r-- 1 jglick jglick      64 2007-11-09 05:42 g.i

For a repository which is already hundreds of megabytes, doing major source
reorganizations is out of the question for this reason. This is a serious
drawback compared to Subversion; or even arguably to CVS, where moving a dir
means you only pay a penalty in history, not future usage.

mpm has written regarding implementation:

"Currently fixing the renaming issue would present a layering
violation. That is, individual revlogs have no knowledge of any other
revlog. So when we ask a revlog to retrieve version <x> of some file,
it has to have all the data internally."
Comment 1 Thomas Arendsen Hein 2007-12-20 10:15 UTC
Generally having support for referencing other revlogs could allow for other
usages, too, e.g. splitting revlogs if they grow to big, either to circumvent fs
or backup limitations, or to prevent new changes breaking hard links for already
huge revlogs.
Comment 2 Vadim Lebedev 2008-01-31 10:38 UTC
In response to mps's:
"Currently fixing the renaming issue would present a layering
violation. That is, individual revlogs have no knowledge of any other
revlog. So when we ask a revlog to retrieve version <x> of some file,
it has to have all the data internally."

Actually we can store in revlog a reference to generic external object,
identified by some kind of "url" and (maybe) hash
Initially it can be used to implement renames and copies but it can evolve
into some kind of super svn:external facility later (like hg repo which
retrieves file directrly form extrenal svn or git)
Comment 3 Matt Mackall 2008-03-09 16:48 UTC
Ok, here's a proposed fix and the problems that subsequently crawl out from
under the rock:

In filelog, override revlog.revision. Add metadata that says "the revision
returned by revlog is not a full revision as promised but a revision of file
x@rev + the body here treated as a delta." Then filelog.revision can instantiate
a temporary filelog object for x, get the specified revision, and apply the
delta. Do the appropriate steps in filelog.add to make this work.

Now with a little luck, getting the -next- revision from the filelog will just
work. Otherwise, we'll need to hack revlog.revision to call itself (and thereby
filelog.revision) to grab the base revision.

So now we've got a scheme that mostly does away with the layering violations as
revlog doesn't have to have any special knowledge about other revlogs (it's all
in the filelog class, which already knows how to find and open revlog from a
pathname). It even gets the case where c@z is a copy of b@y which is a copy of
a@x right automatically. 

But we've also got a huge compatibility problem. An old client can't just pull
this data and expect it to work. Instead, we've got to add a new version of the
wire protocol that allows us to send these sorts of deltas to new clients, but
sends full revisions to old clients. And a new client would like to take old
client data and deltify the copies, which may not be possible at pull time (for
instance, if the destination revlog is sent before the source revlog). Also,
hashes at the revlog layer and at the filelog layer no longer agree. Ouch.

In short: not an easy problem.

Marking deferred.
Comment 4 Jred 2008-03-27 15:21 UTC
@mpm: I would argue that the two problems -- revlog index cross-references, and
the wire protocol -- could be viewed as 2 completely separate problems. 

One of the main problems right now in Mercurial seems to be a lack of viable
cross-path-rev referencing method, in the revlog index scheme. If the index
scheme was allowed to reference URI's from other paths (internal or external),
instead of just revlog data with a matching name, that would be a simple fix for
a whole list of issues. 

This reminds me of the discussion in the mailing list about combining
HistoryTrimming, PartialCloning, Overlays, and Obliterate methods. An in-place
replacement of revlog data with its hash value, and a "reason for missing data"
that includes a URI for a third-party data source, could be a combined fix for
all of these features/issues. That "third-party data source URI" could just as
easily reference paths and revs inside the same repository, as external
repository URI's.

Now, separating the wire protocol, so that older clients get what they expect,
rather than what data is actually held locally by the revlog, is not necessarily
easy. It is possible, provided all the requested data is online *somewhere*.
Attempts to push-pull revlog data that isn't available online could be a defined
failure condition, for the "old client" wire protocol. So I would say that
internal repository reference URI's are probably the easiest, to interpret into
this "old client" wire protocol. 

Does Mercurial already have any way of signaling current repository version,
and/or available extensions, on each end of a push-pull connection? That would
be an easy way of signaling which wire protocol can be used optimally, in any
given transfer. If it doesn't already exist, maybe a push/pull flag or attribute
could be added, like a "wire protocol version specifier"?
Comment 5 Matt Mackall 2008-03-27 18:13 UTC
Incompatibility with old clients is a non-starter, so viewing it as two problems
is as well.

Current clients have file revision hashes that include the current metadata for
the copy info. If we change what we store, we break the hash -> old clients
break. So we've either got to fake the contents (and destroy the concept of
revlog id = hash of contents) or break compatibility.
Comment 6 Eric Hopper 2008-04-01 10:18 UTC
My feeling is that it's possible to make this happen without changing the
essential meaning of either the index or data files.

One rather unsubtle and probably bad idea would be do allow index files to
reference other data files via a combination of numerical linkrev (referencing a
changeset) and filerev hash (referencing a manifest entry in that changeset). 
If the filerev hash were null then the information would be ignored.  If not
they would be taken as a 'base' on which to build the current file image, along
with the delta range stuff from the main data file that's already there.

Keeping the wire protocol unaffected after doing so will be tricky but I
definitely think it's doable.  If the wire protocol is unchanged though,
divining the need for the new way of storing references to other data files for
incoming changesets is going to be a pain.  Incoming changes will have to be
scanned for copies.
Comment 7 Matt Mackall 2008-04-01 13:59 UTC
You're missing the first conceptual hurdle: if we change what we're storing in
the revlog, we change the hashes. Revlog is a self-contained black box. You hand
it "data", it hands you back an identifier hash. If we change our data from
"copy + full revision" to "copy + delta", revlog will hand us back a different
identifier. Thus, old and new clients will disagree about the hash for "file x
containing X, copied from y@z".

To get past this, we would need to hoist both the hash calculation and checking
up out of revlog into filelog (and changelog, and manifest). Then when we
checked in a copy, we'd have to first calculate the hash for "copy + full
revision", then calculate the delta, then tell revlog "please store 'copy +
delta' but with the hash for 'copy + full revision'". 

To recover a revision, we'd have to get "copy + delta", look up the copy,
reconstruct that revision, apply the delta to get the full revision, then
calculate the hash of "copy + full revision" and compare it with the identifier
we were asked to retrieve.

On pull over the existing wire protocol, we'd have to do the above, and then
take our reconstructed "copy + full revision" and turn it into a delta (usually,
but not always, against an empty file).
Comment 8 Eric Hopper 2008-04-02 08:22 UTC
I understand that.  Perhaps instead of moving that much work up the revlog could
be given an external data handler when you asked it for data.  And for the write
side you could give it an optional argument with the data for revlog to use as
the base for whatever diffing algorithm it might choose to use.

The contract would be that the external data handler you passed on read would be
able to retrieve that base for any revision in which you passed on such a base
on write.
Comment 9 Eric Hopper 2008-04-02 08:26 UTC
Oh, better idea for write...

Pass in an optional external data handler on write.  If there is one it should
be able to provide the data for the base of the revision for diff purposes, and
it should be able to provide a cookie that will be given to the external data
handler for read.

That way the external data handler doesn't have to remember any associations
between the revision and the data.  It will be able to the revlog to hand it the
cookie which will allow it to fetch those.
Comment 10 Kirill Smelkov 2008-06-29 05:47 UTC
Guys, I understand there are technical challenges in this issue, but maybe

  Something Could Be Done?

I think this issue should be one in the major list -- people usually convert 
their svn repos with hg and git and compare sizes to see which DVCS to use.

And you know, because of this issue hg often looses.
Comment 11 Dirkjan Ochtman 2009-02-27 03:44 UTC
So here's an idea: discard the idea of redoing historical renames. People who
want to do efficient renames for their history will have to do a full hg-to-hg
conversion and work from there. Only future renames are supported. Would that be
acceptable? It would open up a whole host of options.
Comment 12 Matt Mackall 2009-02-27 06:56 UTC
djc: It would be the first format change that older versions would be unable to
pull from.

That means a MAJOR flag day. Keep in mind that we're regularly hearing from
people running 0.9.5 and there are operating systems that have just been
released containing 1.0.1. We really don't ever want to break old versions.

And that still leaves us with the large conceptual hurdle: cross-revlog hash
calculation.
Comment 13 Matt Mackall 2009-02-27 08:08 UTC
I've summarized the plan I've outlined here at:

http://www.selenic.com/mercurial/wiki/index.cgi/RenameSpaceSavingPlan
Comment 14 Jesse Glick 2009-02-27 09:35 UTC
Agreed with mpm. In my case, we have a multi-100Mb repo with >100k revs in
active use across at least a dozen public clones, by dozens of developers on
several continents using an unknown mixture of Hg client versions, with rev
hashes referred to in numerous public documents and issue reports. We would be
unlikely to ever undertake a repo conversion unless moving to another SCM or (in
the absence of shallow/narrow clones) splitting the repo into smaller pieces.

Ideally, upon release of an Hg version supporting cheap renames, we would
convert the server clones over a few maintenance hours during the weekend
sometime, using whatever tool was recommended; and then recommend to developers
that at their leisure they get the new version and either make fresh clones or
convert their local clones. Wire compatibility is not absolutely essential if
hashes are preserved - we could wait to change format for, say, a year after the
version of Hg with cheap rename support was released, so that everyone gets a
chance to upgrade - but certainly desirable.
Comment 15 Arne Babenhauserheide 2009-03-02 02:18 UTC
When renames can use pointers, could similar pointers also be used for shallow
copies, so past revisions can be loaded lazily? 

Actually a shallow copy only needs the data since the last snapshot, so
requesting earlier revisions could trigger a similar second request for the
data as cheap renames, but in that case for downloading it and then reckecking
the missing changeset.
Comment 16 Dirkjan Ochtman 2009-03-02 02:38 UTC
ArneBab, no, that doesn't make sense.

And, it's entirely off-topic for this issue.
Comment 17 Vsevolod Solovyov 2009-06-21 16:14 UTC
I'm working on it as it's my GSoC project.
Comment 18 Matt Mackall 2009-07-01 15:37 UTC
Degrading to bug.
Comment 19 Martin Geisler 2009-07-11 17:26 UTC
Progress report:

  http://markmail.org/message/bz46xb62hid57ewx

(for those who are not following the mailing list)
Comment 20 Matt Mackall 2010-01-01 13:24 UTC
No longer in progress
Comment 21 Jose Miguel Hernandez Miramontes 2010-01-09 01:21 UTC
Hello

I apologize if this was discussed before.

The problem of extra space is because  a new file is created on the
repo to track the future history of the renamed file. It repeats the
data to conserve history, More or less if i understood correctly.

I read here http://mercurial.selenic.com/wiki/RenameSpaceSavingPlan
that there is a lot of complications to handle revlogs as deltas (they
become not self contained)

is there a way to workaround this with another approach? for example

Instead of creating deltas, the space could be saved by
compressing/packing the related revlogs and keep them compressed
together.
It could be a new operation of the filelog (the filelog tracks the
renames?) to decompress/compress the revlogs.

As all the history is in one revlog, may be only one of them would
need to be uncompressed at a specific time.

I don't know if this makes sense but i was thinking that it might be
easier to implement it keeping backward compatibility as the revlogs
content will not change, so the hash do not need to change. It is just
one extra layer.

What do you think?

Regards
Comment 22 Eugene 2010-07-10 07:01 UTC
My 2c: Version compatibility would not be possible anyway when repos are on
a shared folder or when they're copied as files. Both of these methods are
appealing due to ease/simplicity.

Also the impact of preserving compatibility to Mercurial's relatively clean
code/design should be considered.. IMHO it's quite valuable. Plus, I'm sure
some admins would prefer to force an upgrade than have increased io/cpu
usage on their servers.

If there are optional default-off optimisations as part of, perhaps,
"Mercurial 2" people can choose if and when to force an upgrade.

I myself need this for a proposed project which would store mp3 podcasts in
hg to version-control their ID3 tags and filenames. Mercurial is great
except for the renames part.
Comment 23 gidyn 2012-05-01 02:00 UTC
The usual response when this is brought up on a mailing list is that a
compressed text file doesn't take up much space. This obviously doesn't
apply to binary resources, but doesn't always apply to text files either. If
someone does a lot of refactoring, it is indeed possible that rename/move
copying will take up more space in the repository than actual changesets.
Comment 24 Bugzilla 2012-05-12 08:46 UTC

--- Bug imported by bugzilla@serpentine.com 2012-05-12 08:46 EDT  ---

This bug was previously known as _bug_ 883 at http://mercurial.selenic.com/bts/issue883
Comment 25 Martin F 2013-01-29 04:44 UTC
Are there any plans in fixing this issue?
Comment 26 Matt Mackall 2013-01-29 05:01 UTC
There are plans:

http://mercurial.selenic.com/wiki/RenameSpaceSavingPlan

But plans do not translate to development resources or timetables.
Comment 27 Bugzilla 2015-02-10 01:01 UTC
Bug was inactive for 728 days, archiving
Comment 28 cowwoc2020 2015-02-10 01:19 UTC
Reopening because I believe there is high demand for this feature (48 users CCed).
Comment 29 Matt Mackall 2015-03-03 15:24 UTC
Bulk change: standard priority for features is 'wish'.
Comment 30 Matt Mackall 2015-04-17 13:28 UTC
Bulk move open feature requests to wish priority.
Comment 31 Bugzilla 2015-10-18 16:24 UTC
Bug was inactive for 184 days, archiving
Comment 32 gidyn 2015-11-12 07:16 UTC
Does bundle2 do anything to make this more feasible?
Comment 33 Pierre-Yves David 2015-11-13 18:17 UTC
Bundle2 make it easier to carry new changelog format. On the other hand implementing this probably requires hash changes (If I remember sprint discussion right) which make it it a significant jump. We might make that jump with the tree manifest one (and the sha2 one).
Comment 34 Matt Mackall 2015-11-16 11:46 UTC
The hash issue is easy to work around. The primary problem is implicit forward references on clone/pull:

- rename B to A
- add a bunch of revisions to A
- do a clone
- client receives file data in alphabetical order
- client receives first A revision, delta again B (fine)
- client receives a bunch more A deltas, has to store a full A
- hasn't received B yet, so can't construct A

Because we can have simultaneous renames of A to B and B to A, no filename reordering can fix this problem.
Comment 35 Bugzilla 2016-07-18 00:00 UTC
Bug was inactive for 150 days, archiving
Comment 36 Sudarshan S 2016-07-18 00:14 UTC
Re-opening bug; hasn't yet been fixed AFAIK and I'm still interested in a fix, especially since Java repos tend to get bloated without this.
Comment 37 Bugzilla 2017-02-09 00:00 UTC
Bug was inactive for 206 days, archiving
Comment 38 gidyn 2017-02-09 07:08 UTC
Re-opening as still an issue
Comment 39 Gábor Stefanik 2017-02-13 08:05 UTC
I don't really see the difficulties here, assuming generaldelta + bundle2.

It should be possible to simply extend the "delta base" field in the revlog entry header to allow e.g. revision 124 in the filelog for a.txt specify a delta base of "b.txt@123". We still hash the final full text, as opposed to the delta, so no hash change required.

As for constructing full revisions, it's only needed when you're redeltaing for some reason. Otherwise, the receiving end trusts the sender's deltas, and stores them without trying to redelta - IIRC this is what we do for most GD->GD operations. And since 4.1, we even have in-place repository upgrade and optimization, so we don't have to rely on redelta-on-clone.

So, mpm's scenario can play out like this:

- rename B to A
- add a bunch of revisions to A
- do a clone
- client receives file data in alphabetical order
- client receives first A revision as a delta against B, stores it
- client receives a bunch more A deltas => just store them
- client eventually receives a full A (if a full revision is needed on the client side, it's also needed on the server side), and stores it

In debugupgraderepo or local clone, we have random access to the source and destination revlogs, so AFAIK we can simply construct the new revlogs changeset-by-changeset instead of file-by-file.

Such an extension would, by analogy with generaldelta, need a new repo requirement - I propose the name foreigndelta.
Comment 40 Arne Babenhauserheide 2017-03-24 04:51 UTC
I’m also still interested in a fix. I am using Mercurial to track my maildir (originally intended as a pressure test of Mercurial, but it became part of my general workflow), and that’s close to the worst case for rename space consumption since files are typically stored as new and then renamed as read (or when other flags are added), but the content typically does not change.

I have 9.6GiB of textual data with the history requiring 14GiB. I would happily use that repository for testing the effectiveness of a fix.
Comment 41 Yuya Nishihara 2017-05-28 06:18 UTC
*** Bug 5575 has been marked as a duplicate of this bug. ***
Comment 42 Bugzilla 2017-12-11 00:00 UTC
Bug was inactive for 150 days, archiving
Comment 43 Bugzilla 2018-05-10 00:00 UTC
Bug was inactive for 150 days, archiving
Comment 44 Arne Babenhauserheide 2018-05-10 12:47 UTC
The new link to the space saving plan is https://www.mercurial-scm.org/wiki/RenameSpaceSavingPlan
Comment 45 Gábor Stefanik 2018-09-21 08:15 UTC
As an alternative to my previous "foreigndelta" proposal, we now have a new option thanks to the new sparse revlog feature (bug 5480).
With sparse revlogs, it can now be economical to use the same filelog for multiple files - e.g. not opening a new filelog when a file is renamed/copied, but instead storing revisions for both filenames in the pre-existing filelog, or even using a single large filelog (or just a few) for all files in the repository. In this setup, the pre- and post-rename revisions end up in the same revlog, so no extension of the delta base field is required.

Previously, this would have been uneconomical due to the "delta chain span" constraint (and before generaldelta, the requirement for deltas to always be based on the previous revlog entry). Sparse revlogs relax this constraint, so having unrelated revisions stored in the same revlog file between the relevant deltas should no longer incur a performance penalty.
Comment 46 Arne Babenhauserheide 2019-01-15 15:26 UTC
Is there something I can do to help make this real? 

My time is severely constrained but I could at least test stuff and would also happily donate towards this.

I have a local repository whose .hg contains around 600k files (find ~/.local/share/mail/.hg | wc -l → 608907), so I could easily contribute some performance testing on a real, rename-heavy repository. It is a maildir and most files got renamed at least once. The .hg weights around 17GiB now.
Comment 47 Bugzilla 2019-06-15 00:00 UTC
Bug was inactive for 150 days, archiving
Comment 48 Faheem Mitha 2019-06-15 00:20 UTC
Sadly, still present.

And I see this insane auto-archive thing is still active too.
Comment 49 Arne Babenhauserheide 2019-10-30 05:11 UTC
@Gábor Stefanik: Are you working on your sparse reflog using idea? If yes, could you note it in the wiki? https://www.mercurial-scm.org/wiki/RenameSpaceSavingPlan
Comment 50 Bugzilla 2020-03-29 00:01 UTC
Bug was inactive for 150 days, archiving
Comment 51 Faheem Mitha 2020-03-29 02:26 UTC
Still very much present, unfortunately.
Comment 52 Bugzilla 2020-08-26 00:00 UTC
Bug was inactive for 150 days, archiving
Comment 53 Andrew Church 2020-08-26 00:41 UTC
Still present.
Comment 54 Bugzilla 2021-01-23 00:00 UTC
Bug was inactive for 150 days, archiving
Comment 55 Bugzilla 2021-06-25 00:00 UTC
Bug was inactive for 150 days, archiving
Comment 56 Faheem Mitha 2021-06-25 00:58 UTC
Still a thing.
Comment 57 Bugzilla 2021-12-04 00:00 UTC
Bug was inactive for 150 days, archiving
Comment 58 Faheem Mitha 2021-12-04 04:49 UTC
Still present.
Comment 59 Bugzilla 2022-05-04 00:01 UTC
Bug was inactive for 150 days, archiving
Comment 60 Faheem Mitha 2022-05-04 03:15 UTC
Still present.
Comment 61 Bugzilla 2022-10-07 00:00 UTC
Bug was inactive for 154 days, archiving
Comment 62 Andrew Church 2022-10-07 00:49 UTC
Still present.
Comment 63 Bugzilla 2023-03-07 00:00 UTC
Bug was inactive for 151 days, archiving
Comment 64 Andrew Church 2023-03-07 02:17 UTC
Still present.