This mail is about an issue that has been of concern to me for quite a
while and I think it is (well past) time to air it more widely and try
to come to a resolution.
This issue is how write barriers (the block-device kind, not the
memory-barrier kind) should be handled by the various layers.
The following is my understanding, which could well be wrong in
various specifics. Corrections and other comments are more than
welcome.
------------
What are barriers?
==================
Barriers (as generated by requests with BIO_RW_BARRIER) are intended
to ensure that the data in the barrier request is not visible until
all writes submitted earlier are safe on the media, and that the data
is safe on the media before any subsequently submitted requests
are visible on the device.
This is achieved by tagging request in the elevator (or any other
request queue) so that no re-ordering is performed around a
BIO_RW_BARRIER request, and by sending appropriate commands to the
device so that any write-behind caching is defeated by the barrier
request.
Along side BIO_RW_BARRIER is blkdev_issue_flush which calls
q->issue_flush_fn. This can be used to achieve similar effects.
There is no guarantee that a device can support BIO_RW_BARRIER - it is
always possible that a request will fail with EOPNOTSUPP.
Conversely, blkdev_issue_flush must be supported on any device that
uses write-behind caching (it if cannot be supported, then
write-behind caching should be turned off, at least by default).
We can think of there being three types of devices:
1/ SAFE. With a SAFE device, there is no write-behind cache, or if
there is it is non-volatile. Once a write completes it is
completely safe. Such a device does not require barriers
or ->issue_flush_fn, and can respond to them either by a
no-op or with -EOPNOTSUPP (the former is preferred).
2/ FLUSHABLE.
A FLUSHABLE device may have a volatile write-behind cache.
This cache can be flushed with a call to blkdev_issue_flush.
It may not support barrier requests.
3/ BARRIER.
A BARRIER device supports both blkdev_issue_flush and
BIO_RW_BARRIER. Either may be used to synchronise any
write-behind cache to non-volatile storage (media).
Handling of SAFE and FLUSHABLE devices is essentially the same and can
work on a BARRIER device. The BARRIER device has the option of more
efficient handling.
How does a filesystem use this?
===============================
A filesystem will often have a concept of a 'commit' block which makes
an assertion about the correctness of other blocks in the filesystem.
In the most gross sense, this could be the writing of the superblock
of an ext2 filesystem, with the "dirty" bit clear. This write commits
all other writes to the filesystem that precede it.
More subtle/useful is the commit block in a journal as with ext3 and
others. This write commits some number of preceding writes in the
journal or elsewhere.
The filesystem will want to ensure that all preceding writes are safe
before writing the barrier block. There are two ways to achieve this.
1/ Issue all 'preceding writes', wait for them to complete (bi_endio
called), call blkdev_issue_flush, issue the commit write, wait
for it to complete, call blkdev_issue_flush a second time.
(This is needed for FLUSHABLE)
2/ Set the BIO_RW_BARRIER bit in the write request for the commit
block.
(This is more efficient on BARRIER).
The second, while much easier, can fail. So a filesystem should be
prepared to deal with that failure by falling back to the first
option.
Thus the general sequence might be:
a/ issue all "preceding writes".
b/ issue the commit write with BIO_RW_BARRIER
c/ wait for the commit to complete.
If it was successful - done.
If it failed other than with EOPNOTSUPP, abort
else continue
d/ wait for all 'preceding writes' to complete
e/ call blkdev_issue_flush
f/ issue commit write without BIO_RW_BARRIER
g/ wait for commit write to complete
if it failed, abort
h/ call blkdev_issue
DONE
steps b and c can be left out if it is known that the device does not
support barriers. The only way to discover this to try and see if it
fails.
I don't think any filesystem follows all these steps.
ext3 has the right structure, but it doesn't include steps e and h.
reiserfs is similar. It does have a call to blkdev_issue_flush, but
that is only on the fsync path, so it isn't really protecting
general journal commits.
XFS - I'm less sure. I think it does 'a' then 'd', then 'b' or 'f'
depending on a whether it thinks the device handles barriers,
and finally 'g'.
I haven't looked at other filesystems.
So for devices that support BIO_RW_BARRIER, and for devices that don't
need any flush, they work OK, but for device that need flushing, but
don't support BIO_RW_BARRIER, none of them work. This should be easy
to fix.
HOW DO MD or DM USE THIS
========================
1/ striping devices.
This includes md/raid0 md/linear dm-linear dm-stripe and probably
others.
These devices can easily support blkdev_issue_flush by simply
calling blkdev_issue_flush on all component devices.
These devices would find it very hard to support BIO_RW_BARRIER.
Doing this would require keeping track of all in-flight requests
(which some, possibly all, of the above don't) and then:
When a BIO_RW_BARRIER request arrives:
wait for all pending writes to complete
call blkdev_issue_flush on all devices
issue the barrier write to the target device(s)
as BIO_RW_BARRIER,
if that is -EOPNOTSUP, re-issue, wait, flush.
Currently none of the listed modules do that.
md/raid0 and md/linear fail any BIO_RW_BARRIER with -EOPNOTSUP.
dm-linear and dm-stripe simply pass the BIO_RW_BARRIER flag down,
which means data may not be flushed correctly: the commit block
might be written to one device before a preceding block is
written to another device.
I think the best approach for this class of devices is to return
-EOPNOSUP. If the filesystem does the wait (which they all do
already) and the blkdev_issue_flush (which is easy to add), they
don't need to support BIO_RW_BARRIER.
2/ Mirror devices. This includes md/raid1 and dm-raid1.
These device can trivially implement blkdev_issue_flush much like
the striping devices, and can support BIO_RW_BARRIER to some
extent.
md/raid1 currently tries. I'm not sure about dm-raid1.
md/raid1 determines if the underlying devices can handle
BIO_RW_BARRIER. If any cannot, it rejects such requests (EOPNOTSUP)
itself.
If all underlying devices do appear to support barriers, md/raid1
will pass a barrier-write down to all devices.
The difficulty comes if it fails on one device, but not all
devices. In this case it is not clear what to do. Failing the
request is a lie, because some data has been written (possible too
early). Succeeding the request (after re-submitting the failed
requests) is also a lie as the barrier wasn't really honoured.
md/raid1 currently takes the latter approach, but will only do it
once - after that it fails all barrier requests.
Hopefully this is unlikely to happen. What device would work
correctly with barriers once, and then not the next time?
The answer is md/raid1. If you remove a failed device and add a
new device that doesn't support barriers, md/raid1 will notice and
stop supporting barriers.
If md/raid1 can change from supporting barrier to not, then maybe
some other device could too?
I'm not sure what to do about this - maybe just ignore it...
3/ Other modules
Other md and dm modules (raid5, mpath, crypt) do not add anything
interesting to the above. Either handling BIO_RW_BARRIER is
trivial, or extremely difficult.
HOW DO LOW LEVEL DEVICES HANDLE THIS
====================================
This is part of the picture that I haven't explored greatly. My
feeling is that most if not all devices support blkdev_issue_flush
properly, and support barriers reasonably well providing that the
hardware does.
There in an exception I recently found though.
For devices that don't support QUEUE_ORDERED_TAG (i.e. commands sent to
the controller can be tagged as barriers), SCSI will use the
SYNCHRONIZE_CACHE command to flush the cache after the barrier
request (a bit like the filesystem calling blkdev_issue_flush, but at
a lower level). However it does this without setting the SYNC_NV bit.
This means that a device with a non-volatile cache will be required --
needlessly -- to flush that cache to media.
So: some questions to help encourage response:
- Is the above substantial correct? Totally correct?
- Should the various filesystems be "fixed" as suggested above? Is
someone willing to do that?
- Is the approach to barriers taken by md appropriate? Should dm
do the same? Who will do that?
- Is setting the SYNC_NV bit really the right thing to do? Are there
any other places where the wrong sort of sync might be happening?
Are then any callers that require SYNC_NV to be clear.
- The comment above blkdev_issue_flush says "Caller must run
wait_for_completion() on its own". What does that mean?
- Are there other bit that we could handle better?
BIO_RW_FAILFAST? BIO_RW_SYNC? What exactly do they mean?
Thank you for your attention.
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
- Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
- Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
- Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
- Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
- Re: [dm-devel] [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
- Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
- Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]