[PATCH 1 of 6] transaction: ensure journal is committed to file system

Wed Apr 22 14:49:59 CDT 2009

On Wed, 2009-04-22 at 09:07 +0200, Henrik Stuart wrote:
> Matt Mackall wrote:
> > On Tue, 2009-04-21 at 21:14 +0200, Dirkjan Ochtman wrote:
> >> On Tue, Apr 21, 2009 at 19:33, Henrik Stuart <hg at hstuart.dk> wrote:
> >>> +    def _write(self, data):
> >>>         # add enough data to the journal to do the truncate
> >>> -        self.file.write("%s\0%d\n" % (file, offset))
> >>> +        self.file.write(data)
> >>>         self.file.flush()
> >>> +        # ensure journal is synced to file system
> >>> +        os.fsync(self.file.fileno())
> >> I believe Matt stated that he didn't want this.
> > 
> > I'm not really thrilled by it. For starters, it's not actually
> > sufficient. It doesn't guarantee that the directory containing the inode
> > is synced to disk.
> 
> Except in the initial clone, the directory will already be there, only
> the file needs to be written, which fsync does guarantee.

Bzzt.

fsync guarantees that the file's inode and data are on disk. It does NOT
guarantee that a directory entry pointing to your inode exists on disk. 
In the case of a newly-created file like our journal, we need to sync
the directory contents as well for the journal to be there (and not
orphaned) when we try to recover.

> > Second, this will get called a -lot- on some operations. We can
> > currently call journal write hundreds of times per second. If we throw
> > a sync in here, we're now writing to at least two (if not three or
> > four) different blocks on disk with each call and incurring seek
> > penalties.
> 
> Sure, that's the penalty of having interleaved transaction and regular
> file writes - I looked at cleaning it up, but it required a larger overhaul.

No, I'm -not even counting the regular file writes above-. An fsync on
journal will require at least a write to the journal's inode and one to
the last data block of the file. If our file grows into a new block,
we'll need to go off and touch block bitmaps and group descriptors and
superblocks or various equivalents. For every write to the journal.
Assuming everything else we do stays in cache, that's at least two seeks
per journal write which means the fastest we can ever hope to go on
spinning media is about 50 journal entries per second and probably
significantly less.

At this point, you might be thinking to yourself: I benchmarked it and
it was slower but not that slow. That's because fsync() doesn't actually
hit the platter on commodity disk hardware (anything you're likely to
have on your desk or in your lap), it just goes into the disk's write
cache. Thus, it provides little protection against integrity failure on
spontaneous reboots and power failure. It mostly helps when you get a
blue screen.

There are heavy-handed things that can be done to make sure that write
barriers are actually respected on commodity hardware and these are also
expensive. And not enabled by default on the usual operating systems.
The moral is fsync cannot be relied upon to do what things like POSIX
standard say it does because the ATA drive standards people subverted it
about 10 years ago.

Now let's do some math: if we have an operation that takes 1 minute with
no journal syncing that grows to 1.5 minutes with "soft-syncing" and our
odds of power failure are constant over time, has syncing improved our
reliability or decreased it? Depending on the implementation details of
the writeback cache, we could very well decrease our reliability.

> > If we have a flash device, we could burn through tens of thousands of
> > block writes in a single clone. Ouch.
> > 
> > If we've got a filesystem like ext3 which (until very recently) defaults
> > to ordered writes, each sync is basically a pipeline stall for every
> > user of the filesystem.
> 
> We should probably not focus on what ext3 does or does not, but look at
> this in general terms.

None of the above is ext3-specific, but it is worth noting that one of
the most popular file systems makes this all much worse. And killing
people's thumb drives (which generally have really weak wear-leveling)
isn't really appreciated either.

> > So look at the journal as an optimization for cleanup and less as an
> > integrity tool. We can always use verify to find the last
> > self-consistent revision.
> 
> Unfortunate naming then.

It's still a journal. It protects against failures in Mercurial, and
against many failure modes in the operating system or hardware. And
we've also got an fsck-equivalent.

I'm also going to note that journals are not fsck replacements. A disk
losing power in the middle of a write can and will write random data to
random sectors on occasion. Journal recovery won't catch that, which is
one of the reasons why filesystems like ZFS and BTRFS have added CRCs.

> > If you're so inclined, we can add a companion to the journal that stores
> > just that revision number and gets synced at the beginning of a
> > transaction. Then we can add a slow path in recover that strips to that
> > revision.
> 
> I propose that, if the general fsync isn't preferable, we add a flush
> method transaction that may be optionally called by whomever is using
> the transaction. That way strip can set up the entire transaction, call
> fsync, committing it to disk, and then proceeding merrily with a
> recoverably repository. It doesn't, of course, do anything for normal
> revlog writing, but it won't be worse for those than it is today. I'll
> send the entire patch series in a revised version in a bit.

Sure. Though I'm not sure why you don't like my revision number
suggestion. It has the added benefit that we can say "rollback will take
you back to revision x".

-- 
http://selenic.com : development and support for Mercurial and Linux