Re: O_DIRECT question

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Thu, 11 Jan 2007, Viktor wrote:
> 
> OK, madvise() used with mmap'ed file allows to have reads from a file
> with zero-copy between kernel/user buffers and don't pollute cache
> memory unnecessarily. But how about writes? How is to do zero-copy
> writes to a file and don't pollute cache memory without using O_DIRECT?
> Do I miss the appropriate interface?

mmap()+msync() can do that too.

Also, regular user-space page-aligned data could easily just be moved into 
the page cache. We actually have a lot of the infrastructure for it. See 
the "splice()" system call. It's just not very widely used, and the 
"drop-behind" behaviour (to then release the data) isn't there. And I bet 
that there's lots of work needed to make it work well in practice, but 
from a conceptual standpoint the O_DIRECT method really is just about the 
*worst* way to do things.

O_DIRECT is "simple" in the sense that it basically is a "OS: please just 
get out of the way" method. It's why database people like it, and it's why 
it has gotten implemented in many operating systems: it *looks* like a 
simple interface. 

But deep down, O_DIRECT is anything but simple. Trying to do a direct 
access with an interface that really isn't designed for it (write() 
_fundamentally_ has semantics that do not fit the problem in that you're 
supposed to be able to re-use the buffer immediately afterwards in user 
space, just as an example) is wrong in the first place, but the really 
subtle problems come when you realize that you can't really just "bypass" 
the OS.

As a very concrete example: people *think* that they can just bypass the 
OS IO layers and just do the write directly. It *sounds* like something 
simple and obvious. It sounds like a total no-brainer. Which is exactly 
what it is, if by "no-brainer" you mean "only a person without a brain 
will do it". Because by-passing the OS has all these subtle effects on 
both security and on fundamental correctness.

The whole _point_ of an OS is to be a "resource manager", to make sure 
that people cannot walk all over each other, and to be the central point 
that makes sure that different people doing allocations and deallocations 
don't get confused. In the specific case of a filesystem, it's "trivial" 
things like serializing IO, allocating new blocks on the disk, and making 
sure that nobody will see the half-way state when the dirty blocks haven't 
been written out yet.

O_DIRECT - by bypassing the "real" kernel - very fundamentally breaks the 
whole _point_ of the kernel. There's tons of races where an O_DIRECT user 
(or other users that expect to see the O_DIRECT data) will now see the 
wrong data - including seeign uninitialized portions of the disk etc etc. 

In short, the whole "let's bypass the OS" notion is just fundamentally 
broken. It sounds simple, but it sounds simple only to an idiot who writes 
databases and doesn't even UNDERSTAND what an OS is meant to do. For some 
reasons, db people think that they don't need one, and don't ever seem to 
really understand the concept fo "security" and "correctness". They 
understand it (sometimes) _within_ their own database, but seem to have a 
really hard time seeing past their own sandbox.

Some of the O_DIRECT breakage could probably be fixed:

 - An O_DIRECT operation must never allocate new blocks on the disk. It's 
   fundamentally broken. If you *cannot* write new blocks, and can only 
   read and re-write previous allocations, things are much easier, and a 
   lot of the races go away.

   This is probably _perfectly_ fine for the users (namely databases). 
   People who do O_DIRECT really expect to see a "raw disk image", but 
   they (exactly _because_ they expect a raw disk image) are perfectly 
   happy to "set up" that image beforehand.

 - An O_DIRECT operation must never race with any metadata operation, 
   most notably truncate(), but also any file extension operation like a 
   normal write() that extends the size of the file.

   This should be reasonably easy to do. Any O_DIRECT operation would just 
   take the inode->i_mutex for reading. HOWEVER. Right now it's a mutex, 
   not a read-write semaphore, so that is actually pretty painful. But it 
   would be fairly simple.

With those two rules, a lot of the complexity of the nasty side effects of 
O_DIRECT that the db people obviously never even thought about would go 
away. We'd still have to have some way to synchronize the page cache, but 
it could be as simple as having an O_DIRECT open simply _flush_ the whole 
page cache, and set some flag saying "can't do normal opens, we're 
exclusively open for O_DIRECT".

I dunno. A lot of filesystems don't want to (or can't) actually do a 
"write in place" ANYWAY (writes happen through the log, and hit the "real 
filesystem" part of the disk later), and O_DIRECT really only makes sense 
if you do the write in place, so the above rules would help make that 
obvious too - O_DIRECT really is a totally different thing from a normal 
IO, and an O_DIRECT write() or read() really has *nothing* to do with a 
regular write() or read() system call. 

Overloading a totally different operation with a flag is a bad idea, which 
is one reason I really hate O_DIRECT. It's just doing things badly on so 
many levels.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux