On Saturday 20 January 2007 21:55, Michael Tokarev wrote:
> Denis Vlasenko wrote:
> > On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
> >> example, which isn't quite possible now from userspace. But as long as
> >> O_DIRECT actually writes data before returning from write() call (as it
> >> seems to be the case at least with a normal filesystem on a real block
> >> device - I don't touch corner cases like nfs here), it's pretty much
> >> THE ideal solution, at least from the application (developer) standpoint.
> >
> > Why do you want to wait while 100 megs of data are being written?
> > You _have to_ have threaded db code in order to not waste
> > gobs of CPU time on UP + even with that you eat context switch
> > penalty anyway.
>
> Usually it's done using aio ;)
>
> It's not that simple really.
>
> For reads, you have to wait for the data anyway before doing something
> with it. Omiting reads for now.
Really? All 100 megs _at once_? Linus described fairly simple (conceptually)
idea here: http://lkml.org/lkml/2002/5/11/58
In short, page-aligned read buffer can be just unmapped,
with page fault handler catching accesses to yet-unread data.
As data comes from disk, it gets mapped back in process'
address space.
This way read() returns almost immediately and CPU is free to do
something useful.
> For writes, it's not that problematic - even 10-15 threads is nothing
> compared with the I/O (O in this case) itself -- that context switch
> penalty.
Well, if you have some CPU intensive thing to do (e.g. sort),
why not benefit from lack of extra context switch?
Assume that we have "clever writes" like Linus described.
/* something like "caching i/o over this fd is mostly useless" */
/* (looks like this API is easier to transition to
* than fadvise etc. - it's "looks like" O_DIRECT) */
fd = open(..., flags|O_STREAM);
...
/* Starts writeout immediately due to O_STREAM,
* marks buf100meg's pages R/O to catch modifications,
* but doesn't block! */
write(fd, buf100meg, 100*1024*1024);
/* We are free to do something useful in parallel */
sort();
> > I hope you agree that threaded code is not ideal performance-wise
> > - async IO is better. O_DIRECT is strictly sync IO.
>
> Hmm.. Now I'm confused.
>
> For example, oracle uses aio + O_DIRECT. It seems to be working... ;)
> As an alternative, there are multiple single-threaded db_writer processes.
> Why do you say O_DIRECT is strictly sync?
I mean that O_DIRECT write() blocks until I/O really is done.
Normal write can block for much less, or not at all.
> In either case - I provided some real numbers in this thread before.
> Yes, O_DIRECT has its problems, even security problems. But the thing
> is - it is working, and working WAY better - from the performance point
> of view - than "indirect" I/O, and currently there's no alternative that
> works as good as O_DIRECT.
Why we bothered to write Linux at all?
There were other Unixes which worked ok.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]