Re: O_DIRECT question — Linux Kernel

Andrea Arcangeli wrote:

When you have I/O errors during _writes_ (not Read!!)  the raid must
kick the disk out of the array before the OS ever notices. And if it's
software raid that you're using, the OS should kick out the disk
before your app ever notices any I/O error. when the write I/O error
happens, it's not a problem for the application to solve.

I thought it obvious that we were talking about non recoverable errorsthat then DO make it to the application. And any kind of missioncritical app most definitely does care about write errors. You don'tneed your db completing the transaction when it was only half recorded.It needs to know it failed so it can back out and/or recover the dataand record it elsewhere. You certainly don't want the users to thinkeverything is fine, walk away, and have the system continue to limp onmaking things worse by the second.

when the I/O error reaches the filesystem if you're lucky if the OS
won't crash (ext3 claims to handle it), if your app receives the I/O
error all you should be doing is to shutdown things gracefully sending
all errors you can to the admin.

If the OS crashes due to an IO error reading user data, then there issomething seriously wrong and beyond the scope of this discussion. Itsuffices to say that due to the semantics of write() and soundengineering practice, the application expects to be notified of errorsso it can try to recover, or fail gracefully. Whether it chooses tofail gracefully as you say it should, or recovers from the error, itneeds to know that an error happened, and where it was.

It doesn't matter much where the error happend, all it matters is that
you didn't have a fault tolerant raid setup (your fault) and your
primary disk just died and you're now screwed(tm). If you could trust
that part of the disk is still sane you could perhaps attempt to avoid
a restore from the last backup, otherwise all you can do is the
equivalent of a e2fsck -f on the db metadata after copying what you
can still read to the new device.

It most certainly matters where the error happened because "you arescrewd" is not an acceptable outcome in a mission critical application.A well engineered solution will deal with errors as best as possible,not simply give up and tell the user they are screwed because thedesigner was lazy. There is a reason that read and write return thenumber of bytes _actually_ transfered, and the application is supposedto check that result to verify proper operation.

Sorry but as far as ordering is concerned, O_DIRECT, fsync and O_SYNC
offers exactly the same guarantees. Feel free to check the real life
db code. Even bdb uses fsync.

No, there is a slight difference. An fsync() flushes all dirty buffersin an undefined order. Using O_DIRECT or O_SYNC, you can control theflush order because you can simply wait for one set of writes tocomplete before starting another set that must not be written untilafter the first are on the disk. You can emulate that by placing anfsync between both sets of writes, but that will flush any other dirtybuffers whose ordering you do not care about. Also there is no aioversion of fsync.

Please try yourself, it's simple enough:

       time dd if=/dev/hda of=/dev/null bs=16M count=100
       time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=sync
       time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=direct

if you can measure any slowdown in the sync/direct you're welcome (it
runs faster here... as it should). The pipeline stall is not
measurable when it's so infrequent, and actually the pipeline stall is
not a big issue when the I/O is contigous and the dma commands are
always large.

aio is mandatory only while dealing with small buffers, especially
while seeking to take advantage of the elevator.

sync has no effect on reading, so that test is pointless. direct savesthe cpu overhead of the buffer copy, but isn't good if the cache isn'tentirely cold. The large buffer size really has little to do with it,rather it is the fact that the writes to null do not block dd frommaking the next read for any length of time. If dd were blocking on anactual output device, that would leave the input device idle for theportion of the time that dd were blocked.

In any case, this is a totally different example than your previous onewhich had dd _writing_ to a disk, where it would block for long periodsof time due to O_SYNC, thereby preventing it from reading from the inputbuffer in a timely manner. By not reading the input pipe frequently, itbecomes full and thus, tar blocks. In that case the large buffer sizeis actually a detriment because with a smaller buffer size, dd would notbe blocked as long and so it could empty the pipe more frequentlyallowing tar to block less.

This whole thing is about performance, if you remove performance
factors from the equation, you can stick to your O_SYNC 512bytes at
time to the journal design. You're perfectly right that when you
remove performance from the equation you can claim that O_DIRECT is
much the same as O_SYNC.

Guess what, if O_SYNC could run as fast as O_DIRECT by still passing
through pagecache, O_DIRECT wouldn't exist. You can't pretend to
describe the semantics of any kernel API if you remove performance
considerations from it. It must be some not useful university theory
if they thought you that performance evaluation must not be present in
the semantics. If that's the case, it's best you stop talking about
semantics when you discuss about any kernel APIs. A ton of kernel APIs
are all about improving performance, so they'll all be the same if you
only look at your performance agnostic semantics, it's not just
O_DIRECT that would become the same as O_SYNC.

You seem to have missed the point of this thread. Denis Vlasenko'smessage that you replied to simply pointed out that they aresemantically equivalent, so O_DIRECT can be dropped provided that O_SYNC+ madvise could be fixed to perform as well. Several people includingLinus seem to like this idea and think it is quite possible.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: O_DIRECT question
  - From: Michael Tokarev <[email protected]>
- Re: O_DIRECT question
  - From: Andrea Arcangeli <[email protected]>

References:
- O_DIRECT question
  - From: Aubrey <[email protected]>
- Re: O_DIRECT question
  - From: Denis Vlasenko <[email protected]>
- Re: O_DIRECT question
  - From: Bill Davidsen <[email protected]>
- Re: O_DIRECT question
  - From: Denis Vlasenko <[email protected]>
- Re: O_DIRECT question
  - From: Andrea Arcangeli <[email protected]>
- Re: O_DIRECT question
  - From: Phillip Susi <[email protected]>
- Re: O_DIRECT question
  - From: Andrea Arcangeli <[email protected]>

Prev by Date: Re: Free Linux Driver Development!
Next by Date: Re: Hidden SSID's
Previous by thread: Re: O_DIRECT question
Next by thread: Re: O_DIRECT question
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]