[PATCH 2 of 5 v5] store: implement fncache basic path encoding in C

Thu Sep 13 01:40:45 CDT 2012

On 2012-09-12 21:22, Mads Kiilerich wrote:
> On 09/12/2012 08:36 AM, Noel Grandin wrote:
>>
>> Why not just always hash the paths?
>
> One reason not to use hashes is that Mercurial and many other tools 
> store and visit files in alphabetic order. A simple backup/restore or 
> recursive copy of files will place the actual file content on disk in 
> alphabetical order and thus give some kind of 'defragmentation' for 
> access in that order, making the block device access mostly 
> sequential. That will make a difference, especially with small files 
> on spinning media where we have read-ahead and relatively high seek 
> times.
>
Why is that? Most tools simply visit in whatever order gets returned 
from the API.
What is it about mercurial that it requires sorted traversal?

> With Mercurials current encoding of filenames the store will have 
> almost the same sort order as the corresponding filenames and we will 
> often benefit from the sequential access. If path hashes were used in 
> the filename encoding it would be more random access. This is 
> allegedly one of the reasons Mercurial in some cases outperform git.
>
That could probably be fixed by doing initial traversal in whatever 
order the API supplies, and then doing a sorting pass before returning 
the result to the user.
>
> Another consideration is that directories with a lot files perform 
> badly on some filesystems. It might be necessary to use multiple 
> levels of directories.
>
Yeah, I think that's a given.
> A simple encoding is also faster than computing a hash ... and you 
> have to use a 'secure' hash function unless you want to handle hash 
> collisions.
>
A good hash function should pretty disappear into the noise on current 
CPUs. With the right choice of operations in your encoding method, the 
CPU's functional cores are gently ticking along, and you become 
completely constrained by memory bandwidth.

> A final reason for keeping a scheme like the current is that now it is 
> quite transparent and easy to figure out what goes where. Pure hashes 
> makes it much harder to debug storage issues.
>
Yeah, I can see that that helps. But sometimes we just have to live with 
a little more complexity in the name of performance :-)


-- Noel Grandin

Disclaimer: http://www.peralex.com/disclaimer.html