When files or dirs are renamed in Hg, repository size is increased, I guess by about the compressed size of those files: $ hg init $ cp /boot/vmlinuz-2.6.22-14-generic f $ hg add f $ hg ci -m 1 $ du --si 1.8M ./.hg/store/data 1.8M ./.hg/store 1.8M ./.hg 3.6M . $ hg ren f g $ hg ci -m 2 $ du --si 3.5M ./.hg/store/data 3.5M ./.hg/store 3.5M ./.hg 5.3M . $ ls -Rl .hg/store/data .hg/store/data: total 3328 -rw-r--r-- 1 jglick jglick 1692145 2007-11-09 05:42 f.d -rw-r--r-- 1 jglick jglick 64 2007-11-09 05:42 f.i -rw-r--r-- 1 jglick jglick 1692204 2007-11-09 05:42 g.d -rw-r--r-- 1 jglick jglick 64 2007-11-09 05:42 g.i For a repository which is already hundreds of megabytes, doing major source reorganizations is out of the question for this reason. This is a serious drawback compared to Subversion; or even arguably to CVS, where moving a dir means you only pay a penalty in history, not future usage. mpm has written regarding implementation: "Currently fixing the renaming issue would present a layering violation. That is, individual revlogs have no knowledge of any other revlog. So when we ask a revlog to retrieve version <x> of some file, it has to have all the data internally."
Generally having support for referencing other revlogs could allow for other usages, too, e.g. splitting revlogs if they grow to big, either to circumvent fs or backup limitations, or to prevent new changes breaking hard links for already huge revlogs.
In response to mps's: "Currently fixing the renaming issue would present a layering violation. That is, individual revlogs have no knowledge of any other revlog. So when we ask a revlog to retrieve version <x> of some file, it has to have all the data internally." Actually we can store in revlog a reference to generic external object, identified by some kind of "url" and (maybe) hash Initially it can be used to implement renames and copies but it can evolve into some kind of super svn:external facility later (like hg repo which retrieves file directrly form extrenal svn or git)
Ok, here's a proposed fix and the problems that subsequently crawl out from under the rock: In filelog, override revlog.revision. Add metadata that says "the revision returned by revlog is not a full revision as promised but a revision of file x@rev + the body here treated as a delta." Then filelog.revision can instantiate a temporary filelog object for x, get the specified revision, and apply the delta. Do the appropriate steps in filelog.add to make this work. Now with a little luck, getting the -next- revision from the filelog will just work. Otherwise, we'll need to hack revlog.revision to call itself (and thereby filelog.revision) to grab the base revision. So now we've got a scheme that mostly does away with the layering violations as revlog doesn't have to have any special knowledge about other revlogs (it's all in the filelog class, which already knows how to find and open revlog from a pathname). It even gets the case where c@z is a copy of b@y which is a copy of a@x right automatically. But we've also got a huge compatibility problem. An old client can't just pull this data and expect it to work. Instead, we've got to add a new version of the wire protocol that allows us to send these sorts of deltas to new clients, but sends full revisions to old clients. And a new client would like to take old client data and deltify the copies, which may not be possible at pull time (for instance, if the destination revlog is sent before the source revlog). Also, hashes at the revlog layer and at the filelog layer no longer agree. Ouch. In short: not an easy problem. Marking deferred.
@mpm: I would argue that the two problems -- revlog index cross-references, and the wire protocol -- could be viewed as 2 completely separate problems. One of the main problems right now in Mercurial seems to be a lack of viable cross-path-rev referencing method, in the revlog index scheme. If the index scheme was allowed to reference URI's from other paths (internal or external), instead of just revlog data with a matching name, that would be a simple fix for a whole list of issues. This reminds me of the discussion in the mailing list about combining HistoryTrimming, PartialCloning, Overlays, and Obliterate methods. An in-place replacement of revlog data with its hash value, and a "reason for missing data" that includes a URI for a third-party data source, could be a combined fix for all of these features/issues. That "third-party data source URI" could just as easily reference paths and revs inside the same repository, as external repository URI's. Now, separating the wire protocol, so that older clients get what they expect, rather than what data is actually held locally by the revlog, is not necessarily easy. It is possible, provided all the requested data is online *somewhere*. Attempts to push-pull revlog data that isn't available online could be a defined failure condition, for the "old client" wire protocol. So I would say that internal repository reference URI's are probably the easiest, to interpret into this "old client" wire protocol. Does Mercurial already have any way of signaling current repository version, and/or available extensions, on each end of a push-pull connection? That would be an easy way of signaling which wire protocol can be used optimally, in any given transfer. If it doesn't already exist, maybe a push/pull flag or attribute could be added, like a "wire protocol version specifier"?
Incompatibility with old clients is a non-starter, so viewing it as two problems is as well. Current clients have file revision hashes that include the current metadata for the copy info. If we change what we store, we break the hash -> old clients break. So we've either got to fake the contents (and destroy the concept of revlog id = hash of contents) or break compatibility.
My feeling is that it's possible to make this happen without changing the essential meaning of either the index or data files. One rather unsubtle and probably bad idea would be do allow index files to reference other data files via a combination of numerical linkrev (referencing a changeset) and filerev hash (referencing a manifest entry in that changeset). If the filerev hash were null then the information would be ignored. If not they would be taken as a 'base' on which to build the current file image, along with the delta range stuff from the main data file that's already there. Keeping the wire protocol unaffected after doing so will be tricky but I definitely think it's doable. If the wire protocol is unchanged though, divining the need for the new way of storing references to other data files for incoming changesets is going to be a pain. Incoming changes will have to be scanned for copies.
You're missing the first conceptual hurdle: if we change what we're storing in the revlog, we change the hashes. Revlog is a self-contained black box. You hand it "data", it hands you back an identifier hash. If we change our data from "copy + full revision" to "copy + delta", revlog will hand us back a different identifier. Thus, old and new clients will disagree about the hash for "file x containing X, copied from y@z". To get past this, we would need to hoist both the hash calculation and checking up out of revlog into filelog (and changelog, and manifest). Then when we checked in a copy, we'd have to first calculate the hash for "copy + full revision", then calculate the delta, then tell revlog "please store 'copy + delta' but with the hash for 'copy + full revision'". To recover a revision, we'd have to get "copy + delta", look up the copy, reconstruct that revision, apply the delta to get the full revision, then calculate the hash of "copy + full revision" and compare it with the identifier we were asked to retrieve. On pull over the existing wire protocol, we'd have to do the above, and then take our reconstructed "copy + full revision" and turn it into a delta (usually, but not always, against an empty file).
I understand that. Perhaps instead of moving that much work up the revlog could be given an external data handler when you asked it for data. And for the write side you could give it an optional argument with the data for revlog to use as the base for whatever diffing algorithm it might choose to use. The contract would be that the external data handler you passed on read would be able to retrieve that base for any revision in which you passed on such a base on write.
Oh, better idea for write... Pass in an optional external data handler on write. If there is one it should be able to provide the data for the base of the revision for diff purposes, and it should be able to provide a cookie that will be given to the external data handler for read. That way the external data handler doesn't have to remember any associations between the revision and the data. It will be able to the revlog to hand it the cookie which will allow it to fetch those.
Guys, I understand there are technical challenges in this issue, but maybe Something Could Be Done? I think this issue should be one in the major list -- people usually convert their svn repos with hg and git and compare sizes to see which DVCS to use. And you know, because of this issue hg often looses.
So here's an idea: discard the idea of redoing historical renames. People who want to do efficient renames for their history will have to do a full hg-to-hg conversion and work from there. Only future renames are supported. Would that be acceptable? It would open up a whole host of options.
djc: It would be the first format change that older versions would be unable to pull from. That means a MAJOR flag day. Keep in mind that we're regularly hearing from people running 0.9.5 and there are operating systems that have just been released containing 1.0.1. We really don't ever want to break old versions. And that still leaves us with the large conceptual hurdle: cross-revlog hash calculation.
I've summarized the plan I've outlined here at: http://www.selenic.com/mercurial/wiki/index.cgi/RenameSpaceSavingPlan
Agreed with mpm. In my case, we have a multi-100Mb repo with >100k revs in active use across at least a dozen public clones, by dozens of developers on several continents using an unknown mixture of Hg client versions, with rev hashes referred to in numerous public documents and issue reports. We would be unlikely to ever undertake a repo conversion unless moving to another SCM or (in the absence of shallow/narrow clones) splitting the repo into smaller pieces. Ideally, upon release of an Hg version supporting cheap renames, we would convert the server clones over a few maintenance hours during the weekend sometime, using whatever tool was recommended; and then recommend to developers that at their leisure they get the new version and either make fresh clones or convert their local clones. Wire compatibility is not absolutely essential if hashes are preserved - we could wait to change format for, say, a year after the version of Hg with cheap rename support was released, so that everyone gets a chance to upgrade - but certainly desirable.
When renames can use pointers, could similar pointers also be used for shallow copies, so past revisions can be loaded lazily? Actually a shallow copy only needs the data since the last snapshot, so requesting earlier revisions could trigger a similar second request for the data as cheap renames, but in that case for downloading it and then reckecking the missing changeset.
ArneBab, no, that doesn't make sense. And, it's entirely off-topic for this issue.
I'm working on it as it's my GSoC project.
Degrading to bug.
Progress report: http://markmail.org/message/bz46xb62hid57ewx (for those who are not following the mailing list)
No longer in progress
Hello I apologize if this was discussed before. The problem of extra space is because a new file is created on the repo to track the future history of the renamed file. It repeats the data to conserve history, More or less if i understood correctly. I read here http://mercurial.selenic.com/wiki/RenameSpaceSavingPlan that there is a lot of complications to handle revlogs as deltas (they become not self contained) is there a way to workaround this with another approach? for example Instead of creating deltas, the space could be saved by compressing/packing the related revlogs and keep them compressed together. It could be a new operation of the filelog (the filelog tracks the renames?) to decompress/compress the revlogs. As all the history is in one revlog, may be only one of them would need to be uncompressed at a specific time. I don't know if this makes sense but i was thinking that it might be easier to implement it keeping backward compatibility as the revlogs content will not change, so the hash do not need to change. It is just one extra layer. What do you think? Regards
My 2c: Version compatibility would not be possible anyway when repos are on a shared folder or when they're copied as files. Both of these methods are appealing due to ease/simplicity. Also the impact of preserving compatibility to Mercurial's relatively clean code/design should be considered.. IMHO it's quite valuable. Plus, I'm sure some admins would prefer to force an upgrade than have increased io/cpu usage on their servers. If there are optional default-off optimisations as part of, perhaps, "Mercurial 2" people can choose if and when to force an upgrade. I myself need this for a proposed project which would store mp3 podcasts in hg to version-control their ID3 tags and filenames. Mercurial is great except for the renames part.
The usual response when this is brought up on a mailing list is that a compressed text file doesn't take up much space. This obviously doesn't apply to binary resources, but doesn't always apply to text files either. If someone does a lot of refactoring, it is indeed possible that rename/move copying will take up more space in the repository than actual changesets.
--- Bug imported by bugzilla@serpentine.com 2012-05-12 08:46 EDT --- This bug was previously known as _bug_ 883 at http://mercurial.selenic.com/bts/issue883
Are there any plans in fixing this issue?
There are plans: http://mercurial.selenic.com/wiki/RenameSpaceSavingPlan But plans do not translate to development resources or timetables.
Bug was inactive for 728 days, archiving
Reopening because I believe there is high demand for this feature (48 users CCed).
Bulk change: standard priority for features is 'wish'.
Bulk move open feature requests to wish priority.
Bug was inactive for 184 days, archiving
Does bundle2 do anything to make this more feasible?
Bundle2 make it easier to carry new changelog format. On the other hand implementing this probably requires hash changes (If I remember sprint discussion right) which make it it a significant jump. We might make that jump with the tree manifest one (and the sha2 one).
The hash issue is easy to work around. The primary problem is implicit forward references on clone/pull: - rename B to A - add a bunch of revisions to A - do a clone - client receives file data in alphabetical order - client receives first A revision, delta again B (fine) - client receives a bunch more A deltas, has to store a full A - hasn't received B yet, so can't construct A Because we can have simultaneous renames of A to B and B to A, no filename reordering can fix this problem.
Bug was inactive for 150 days, archiving
Re-opening bug; hasn't yet been fixed AFAIK and I'm still interested in a fix, especially since Java repos tend to get bloated without this.
Bug was inactive for 206 days, archiving
Re-opening as still an issue
I don't really see the difficulties here, assuming generaldelta + bundle2. It should be possible to simply extend the "delta base" field in the revlog entry header to allow e.g. revision 124 in the filelog for a.txt specify a delta base of "b.txt@123". We still hash the final full text, as opposed to the delta, so no hash change required. As for constructing full revisions, it's only needed when you're redeltaing for some reason. Otherwise, the receiving end trusts the sender's deltas, and stores them without trying to redelta - IIRC this is what we do for most GD->GD operations. And since 4.1, we even have in-place repository upgrade and optimization, so we don't have to rely on redelta-on-clone. So, mpm's scenario can play out like this: - rename B to A - add a bunch of revisions to A - do a clone - client receives file data in alphabetical order - client receives first A revision as a delta against B, stores it - client receives a bunch more A deltas => just store them - client eventually receives a full A (if a full revision is needed on the client side, it's also needed on the server side), and stores it In debugupgraderepo or local clone, we have random access to the source and destination revlogs, so AFAIK we can simply construct the new revlogs changeset-by-changeset instead of file-by-file. Such an extension would, by analogy with generaldelta, need a new repo requirement - I propose the name foreigndelta.
I’m also still interested in a fix. I am using Mercurial to track my maildir (originally intended as a pressure test of Mercurial, but it became part of my general workflow), and that’s close to the worst case for rename space consumption since files are typically stored as new and then renamed as read (or when other flags are added), but the content typically does not change. I have 9.6GiB of textual data with the history requiring 14GiB. I would happily use that repository for testing the effectiveness of a fix.
*** Bug 5575 has been marked as a duplicate of this bug. ***
The new link to the space saving plan is https://www.mercurial-scm.org/wiki/RenameSpaceSavingPlan
As an alternative to my previous "foreigndelta" proposal, we now have a new option thanks to the new sparse revlog feature (bug 5480). With sparse revlogs, it can now be economical to use the same filelog for multiple files - e.g. not opening a new filelog when a file is renamed/copied, but instead storing revisions for both filenames in the pre-existing filelog, or even using a single large filelog (or just a few) for all files in the repository. In this setup, the pre- and post-rename revisions end up in the same revlog, so no extension of the delta base field is required. Previously, this would have been uneconomical due to the "delta chain span" constraint (and before generaldelta, the requirement for deltas to always be based on the previous revlog entry). Sparse revlogs relax this constraint, so having unrelated revisions stored in the same revlog file between the relevant deltas should no longer incur a performance penalty.
Is there something I can do to help make this real? My time is severely constrained but I could at least test stuff and would also happily donate towards this. I have a local repository whose .hg contains around 600k files (find ~/.local/share/mail/.hg | wc -l → 608907), so I could easily contribute some performance testing on a real, rename-heavy repository. It is a maildir and most files got renamed at least once. The .hg weights around 17GiB now.
Sadly, still present. And I see this insane auto-archive thing is still active too.
@Gábor Stefanik: Are you working on your sparse reflog using idea? If yes, could you note it in the wiki? https://www.mercurial-scm.org/wiki/RenameSpaceSavingPlan
Still very much present, unfortunately.
Still present.
Still a thing.
Bug was inactive for 154 days, archiving
Bug was inactive for 151 days, archiving