Re: O_DIRECT question — Linux Kernel

Denis Vlasenko wrote:

The difference is that you block exactly when you try to access
data which is not there yet, not sooner (potentially much sooner).

If application (e.g. database) needs to know whether data is _really_ there,
it should use aio_read (or something better, something which doesn't use signals.
Do we have this 'something'? I honestly don't know).

The application _IS_ using aio, which is why it can go and perform otherwork while it waits to be told that the read has completed. This is notpossible with mmap because the task is blocked while faulting in pages,and unless it tries to access the pages, they won't be faulted in.

In some cases, evne this is not needed because you don't have any other
things to do, so you just do read() (which returns early), and chew on
data. If your CPU is fast enough and processing of data is light enough
so that it outruns disk - big deal, you block in page fault handler
whenever a page is not read for you in time.
If CPU isn't fast enough, your CPU and disk subsystem are nicely working
in parallel.

Being blocked in the page fault handler means the cpu is now idlebecause you can't go chew on data that _IS_ in core. The aio + O_DIRECTallows you to control when IO is started rather than rely on the kernelto decide when is a good time for readahead, and to KNOW when that IO isdone so you can chew on the data.

With O_DIRECT, you alternate:
"CPU is idle, disk is working" / "CPU is working, disk is idle".

You have this completely backwards. With mmap this is what you getbecause you chew data, page fault... chew data... page fault...

What do you want to do on I/O error? I guess you cannot do much -
any sensible db will shutdown itself. When your data storage
starts to fail, it's pointless to continue running.

Ever hear of error recovery? A good db will be able to cope with one ortwo bad blocks, or at the very least continue operating the other tablesor databases it is hosting, or flush transactions and switch to readonly mode, or any number of things other than abort().

You do not need to know which read() exactly failed due to bad disk.
Filename and offset from the start is enough. Right?

So, SIGIO/SIGBUS can provide that, and if your handler is of
	void (*sa_sigaction)(int, siginfo_t *, void *);
style, you can get fd, memory address of the fault, etc.
Probably kernel can even pass file offset somewhere in siginfo_t...

Sure... now what does your signal handler have to do in order to handlethis error in such a way as to allow the one request to be failed andthe task to continue handling other requests? I don't think this iseven possible, yet alone clean.

You can still be multithreaded. The point is, with O_DIRECT
you _are forced_ to_ be_ multithreaded, or else perfomance will suck.

Or use aio. Simple read/write with the kernel trying to outsmart theapplication is nice for very simple applications, but it does notprovide very good performance. This is why we have aio and O_DIRECT;because the application can manage the IO better than the kernel becauseit actually knows what it needs and when.

Yes, the application ends up being more complex, but that is the priceyou pay. You simply can't get it perfect in a general purpose kernelthat has to guess what the application is really trying to do.

You think "Oracle". But this application may very well be
not Oracle, but diff, or dd, or KMail. I don't want to care.
I want all big writes to be efficient, not just those done by Oracle.
*Including* single threaded ones.

Then redesign those applications to use aio and O_DIRECT. IncidentallyI have hacked up dd to do just that and have some very nice performancenumbers as a result.

Well, I too currently work with Oracle.
Apparently people who wrote damn thing have very, eh, Oracle-centric
world-view. "We want direct writes to the disk. Period." Why? Does it
makes sense? Are there better ways? - nothing. They think they know better.


Nobody has shown otherwise to date.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: O_DIRECT question
  - From: Denis Vlasenko <[email protected]>

References:
- O_DIRECT question
  - From: Aubrey <[email protected]>
- Re: O_DIRECT question
  - From: Denis Vlasenko <[email protected]>
- Re: O_DIRECT question
  - From: Michael Tokarev <[email protected]>
- Re: O_DIRECT question
  - From: Denis Vlasenko <[email protected]>

Prev by Date: Re: [PATCH] nfs: fix congestion control -v3
Next by Date: Re: [PATCH 13/15] ide: fix UDMA/MWDMA/SWDMA masks
Previous by thread: Re: O_DIRECT question
Next by thread: Re: O_DIRECT question
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]