Windows people: please help check idea for a new Mercurial repository layout

Matt Mackall mpm at selenic.com
Sat Jun 14 11:45:46 CDT 2008


On Sat, 2008-06-14 at 17:43 +0200, Adrian Buehlmann wrote:
> On 14.06.2008 17:05, Matt Mackall wrote:
> > On Sat, 2008-06-14 at 12:22 +0100, Paul Moore wrote:
> >> 2008/6/14 Adrian Buehlmann <adrian at cadifra.com>:
> >>> So, a simple file/directory encoding strategy (for a new Mercurial repo
> >>> layout) would be to prepend a period to every directory- and filename
> >>> inside the .hg dir.
> >>>
> >>> If anyone knows this encoding doesn't work, please tell me!
> >> I see no reason why this would not work. However, it may be a bit more
> >> wide-ranging than is needed.
> > 
> > Actually, it's not wide-ranging enough - it doesn't solve the long
> > filename problem and we're not going to switch layouts unless we solve
> > both.
> 
> Ok. Good to know what the requirements are. So I can drop that track now.
> 
> > As I've mentioned before, it *is* possible to create both very long path
> > names *and* files with reserved filenames on Windows by using the
> > Unicode APIs. And that doesn't even require a layout change.
> 
> Sorry, I must have missed that one of your statements then.
> 
> Per my understanding, explorer.exe and other equally important programs can't
> handle such \\?\ paths (for example winzip and potentially other archivers).

Oh, I'm sure it will confuse the hell out of lots of things. But only
when users have such files in their repo. And only when they use tools
that poke in .hg. And there aren't many good reasons to do that beyond
backup and manual cloning. But really this is just a quick hack.

Here are the repo layout requirements:

a) does not randomize disk layout (ie hashing)
b) avoids case collisions
c) uses only ASCII
d) handles stupidly long filenames
e) avoids filesystem quirks like reserved words and characters
f) mostly human-readable (optional)
g) reversible (optional)

(a) is important for performance. Filesystems are likely to store
sequentially created files in the same directory near each other on
disk. Disks and operating systems are likely to cache such data. Thus,
always reading and writing files in a consistent order gives the
operating system and hardware its best opportunity to optimize layout.
Ideally, the store layout should roughly mirror the working dir layout.

Point (g) is interesting. If we throw out cryptographically-strong
hashing because of (a), we either have to expand names to meet (b) and
(c) or throw out information and risk collisions. And we don't really
want that, so (g) may effectively be implied. It's also worth
considering (f): it makes understanding what's going on in the store a
hell of a lot easier, especially when something goes wrong. Which again
means: no hashing.

Which means that adding (d) is hard, because just about any solution to
(b) and (c) will blow up path lengths, especially in the presence of
interesting character sets. If our absolute paths are limited to a mere
255 characters and people want to have multi-line filenames, we've got a
problem.

So we probably have to find a way to write longer filenames (NTFS's real
limit is 32k).

-- 
Mathematics is the supreme nostalgia of our time.



More information about the Mercurial mailing list