Re: O_DIRECT question — Linux Kernel

Andrea Arcangeli wrote:

On Tue, Jan 30, 2007 at 10:36:03AM -0500, Phillip Susi wrote:

Did you intentionally drop this reply off list?

No.


Then I'll restore the lkml to the cc list.

No, it doesn't... or at least can't report WHERE the error is.


O_SYNC doesn't report where the error is either, try a write(fd, buf,
10*1024*1024).

It should return the number of bytes successfully written before theerror, giving you the location of the first error. Also using smallerindividual writes ( preferably issued in parallel ) also allows theproblem spot to be isolated.

Typically you only want one sector of data to be written before youcontinue. In the cases where you don't, this might be nice, but as Isaid above, you can't handle errors properly.
Sorry but you're dreaming if you're thinking anything in real life
writes at 512bytes at time with O_SYNC. Try that with any modern
harddisk.

When you are writing a transaction log, you do; you don't need muchdata, but you do need to be sure it has hit the disk before continuing.You certainly aren't writing many mb across a dozen write() calls andonly then care to make sure it is all flushed in an unknown order. Whenorder matters, you can not use fsync, which is one of the reasons whydatabases use O_DIRECT; they care about the ordering.

Just grep for fsync in the db code of your choice (try postgresql) and
then explain me why they ever call fsync in their code, if you know
how to do better with O_SYNC ;).

Doesn't sound like a very good idea to me.


Why not a good idea to check any real life app?

I meant it is not a good idea to use fsync as you can't properly handleerrors.

The stalling is caused by cache pollution. Since you did not specify ablock size dd uses the base block size of the output disk. Whencombined with sync, only one block is written at a time, and no moreuntil the first block has been flushed. Only then does dd send downanother block to write. Without dd the kernel is likely allowing manymb to be queued in the buffer cache. Limiting output to one block at atime is not good for throughput, but allowing half of ram to be used bydirty pages is not good either.
Throughput is perfect. I forgot to tell I combine it with ibs=4k
obs=16M. Like it would be perfect with odirect too for the same
reason. Stalling the I/O pipeline once every 16M isn't measurable in

Throughput is nowhere near perfect, as the pipeline is stalled for quitesome time. The pipe fills up quickly while dd is blocked on the syncwrite, which then blocks tar until all 16 MB have hit the disk. Onlythen does dd go back to reading from the tar pipe, allowing it tocontinue. During the time it takes tar to archive another 16 MB ofdata, the write queue is empty. The only time that the tar process getsto continue running while data is written to disk is in the small timeit takes for the pipe ( 4 KB isn't it? ) to fill up.

The semantics of the two are very much the same; they only differ in theinternal implementation. As far as the caller is concerned, in bothcases, he is sure that writes are safe on the disk when they return, andreads semantically are no different with either flag. The internalimplementations lead to different performance characteristics, and theother post was simply commenting that the performance characteristics ofO_SYNC + madvise() is almost the same as O_DIRECT, or even better insome cases ( since the data read may already be in cache ).
The semantics mandates the implementation because the semantics make
up for the performance expectations. For the same reason you shouldn't
write 512bytes at time with O_SYNC you also shouldn't use O_SYNC if
your device risks to create a bottleneck in the CPU and memory.

No, semantics have nothing to do with performance. Semantics deals withthe state of the machine after the call, not how quickly it got there.Semantics is a question of correct operation, not optimal.

With both O_DIRECT and O_SYNC, the machine state is essentially the sameafter the call: the data has hit the disk. Aside from the performancedifference, the application can not tell the difference between O_DIRECTand O_SYNC, so if that performance difference can be resolved bychanging the implementation, Linus can be happy and get rid of O_DIRECT.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: O_DIRECT question
  - From: Andrea Arcangeli <[email protected]>

References:
- O_DIRECT question
  - From: Aubrey <[email protected]>
- Re: O_DIRECT question
  - From: Denis Vlasenko <[email protected]>
- Re: O_DIRECT question
  - From: Bill Davidsen <[email protected]>
- Re: O_DIRECT question
  - From: Denis Vlasenko <[email protected]>
- Re: O_DIRECT question
  - From: Andrea Arcangeli <[email protected]>

Prev by Date: Re: [PATCH] Fix /sys/device/.../power/state regression
Next by Date: Re: Fw: Re: [mm PATCH 4/6] RCU: (now) CPU hotplug
Previous by thread: Re: O_DIRECT question
Next by thread: Re: O_DIRECT question
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]