[PATCH 1 of 6] transaction: ensure journal is committed to file system

Wed Apr 22 02:07:43 CDT 2009

Matt Mackall wrote:
> On Tue, 2009-04-21 at 21:14 +0200, Dirkjan Ochtman wrote:
>> On Tue, Apr 21, 2009 at 19:33, Henrik Stuart <hg at hstuart.dk> wrote:
>>> +    def _write(self, data):
>>>         # add enough data to the journal to do the truncate
>>> -        self.file.write("%s\0%d\n" % (file, offset))
>>> +        self.file.write(data)
>>>         self.file.flush()
>>> +        # ensure journal is synced to file system
>>> +        os.fsync(self.file.fileno())
>> I believe Matt stated that he didn't want this.
> 
> I'm not really thrilled by it. For starters, it's not actually
> sufficient. It doesn't guarantee that the directory containing the inode
> is synced to disk.

Except in the initial clone, the directory will already be there, only
the file needs to be written, which fsync does guarantee. I think most
people would expect if their system fails sometime in a clone they will
probably have to redo that clone, whereas if a strip fails somewhere due
to a system crash, I think people will be mostly annoyed that they end
up with a corrupted repository.

> Second, this will get called a -lot- on some operations. We can
> currently call journal write hundreds of times per second. If we throw
> a sync in here, we're now writing to at least two (if not three or
> four) different blocks on disk with each call and incurring seek
> penalties.

Sure, that's the penalty of having interleaved transaction and regular
file writes - I looked at cleaning it up, but it required a larger overhaul.

> If we have a flash device, we could burn through tens of thousands of
> block writes in a single clone. Ouch.
> 
> If we've got a filesystem like ext3 which (until very recently) defaults
> to ordered writes, each sync is basically a pipeline stall for every
> user of the filesystem.

We should probably not focus on what ext3 does or does not, but look at
this in general terms.

> So look at the journal as an optimization for cleanup and less as an
> integrity tool. We can always use verify to find the last
> self-consistent revision.

Unfortunate naming then.

> If you're so inclined, we can add a companion to the journal that stores
> just that revision number and gets synced at the beginning of a
> transaction. Then we can add a slow path in recover that strips to that
> revision.

I propose that, if the general fsync isn't preferable, we add a flush
method transaction that may be optionally called by whomever is using
the transaction. That way strip can set up the entire transaction, call
fsync, committing it to disk, and then proceeding merrily with a
recoverably repository. It doesn't, of course, do anything for normal
revlog writing, but it won't be worse for those than it is today. I'll
send the entire patch series in a revised version in a bit.

-- 
Kind regards,
  Henrik Stuart