[PATCH 2 of 5 v5] store: implement fncache basic path encoding in C

Mads Kiilerich mads at kiilerich.com
Wed Sep 12 14:22:57 CDT 2012


On 09/12/2012 08:36 AM, Noel Grandin wrote:
>
> On 2012-09-12 00:59, Adrian Buehlmann wrote:
>> On 2012-09-10 22:34, Bryan O'Sullivan wrote:
>>> store: implement fncache basic path encoding in C
>> I have a (possibly crazy) idea:
>>
>> What if we would do a new repo format - let's call it "fasthash" [1] -
>> with the following characteristics:
>>
>> a) fixes issue3621
>> b) does a slightly simpler encoding for hashed paths
>> c) uses the same encoding as we currently have for short paths
>>
>
> Why not just always hash the paths?

One reason not to use hashes is that Mercurial and many other tools 
store and visit files in alphabetic order. A simple backup/restore or 
recursive copy of files will place the actual file content on disk in 
alphabetical order and thus give some kind of 'defragmentation' for 
access in that order, making the block device access mostly sequential. 
That will make a difference, especially with small files on spinning 
media where we have read-ahead and relatively high seek times.

With Mercurials current encoding of filenames the store will have almost 
the same sort order as the corresponding filenames and we will often 
benefit from the sequential access. If path hashes were used in the 
filename encoding it would be more random access. This is allegedly one 
of the reasons Mercurial in some cases outperform git.

- but that is all anecdotal evidence and might be irrelevant. 
Benchmarking of worst and best and realistic cases will tell how big the 
impact really is.

Another consideration is that directories with a lot files perform badly 
on some filesystems. It might be necessary to use multiple levels of 
directories.

A simple encoding is also faster than computing a hash ... and you have 
to use a 'secure' hash function unless you want to handle hash collisions.

A final reason for keeping a scheme like the current is that now it is 
quite transparent and easy to figure out what goes where. Pure hashes 
makes it much harder to debug storage issues.

But if we store all the mappings from 'real' name to encoded anyway then 
it might be possible to come up with some other naming scheme where we 
generate some 'random' names that have the same sort order as the actual 
filenames.

/Mads


More information about the Mercurial-devel mailing list