Re: Linux 2.6.17-rc2 — Linux Kernel

On Wed, 19 Apr 2006, Trond Myklebust wrote:
> 
> Any chance this could be adapted to work with all those DMA (and RDMA)
> engines that litter our motherboards? I'm thinking in particular of
> stuff like the drm drivers, and userspace rdma.

Absolutely. Especially with "vmsplice()" (the not-yet-implemented "move 
these user pages into a kernel buffer") it should be entirely possible to 
set up an efficient zero-copy setup that does NOT have any of the problems 
with aio and TLB shootdown etc.

Note that a driver would have to support the splice_in() and splice_out() 
interfaces (which are basically just given the pipe buffers to do with as 
they wish), and perhaps more importantly: note that you need specialized 
apps that actually use splice() to do this.

That's the biggest downside by far, and is why I'm not 100% convinced 
splice() usage will be all that wide-spread. If you look at sendfile(), 
it's been available for a long time, and is actually even almost portable 
across different OS's _and_ it is easy to use. But almost nobody actually 
does. I suspect the only users are some apache mods, perhaps a ftp deamon 
or two, and probably samba. And that's probably largely it.

There's a _huge_ downside to specialized interfaces. Admittedly, splice() 
is a lot less specialized (ie it works in a much wider variety of loads), 
but it's still very much a "corner-case" thing. You can always do the same 
thing splice() does with a read/write pair instead, and be portable.

Also, the genericity of splice() does come at the cost of complexity. For 
example, to do a zero-copy from a user space buffer to some RDMA network 
interface, you'd have to basically keep track of _two_ buffers:

 - keep track of how much of the user space buffer you have moved into 
   kernel space with "vmsplice()" (or, for that matter, with any other 
   source of data for the buffer - it might be a file, it might be another 
   socket, whatever. I say "vmsplice()", but that's just an example for 
   when you have the data in user space).

   The kernel space buffer is - for obvious reasons - size limited in the 
   way a user-space buffer is not. People are used to doing megabytes of 
   buffers in user space. The splice buffer, in comparison, is maybe a few 
   hundred kB at most. For some apps, that's "inifinity". For others, it's 
   just a few tens of pages of data.

 - keep track of how much of the kernel space buffer you have moved to the 
   RDMA network interface with "splice()".

   The splice buffer _is_ another buffer, and you have to feed the data 
   from that buffer to the RDMA device manually.

In many usage schenarios, this means that you end up having the normal 
kind of poll/select loop. Now, that's nothing new: people are used to 
them, but people still hate them, and it just means that very few 
environments are going to spend the effort on another buffering setup.

So the upside of splice() is that it really can do some things very 
efficiently, by "copying" data with just a simple reference counted 
pointer. But the downside is that it makes for another level of buffering, 
and behind an interface that is in kernel space (for obvious reasons), 
which means that it's somewhat harder to wrap your hands and head around 
than just a regular user-space buffer.

So I'd expect this to be most useful for perhaps things like some HPC 
apps, where you can have specialized libraries for data communication. And 
servers, of course (but they might just continue to use the old 
"sendfile()" interface, without even knowing that it's not sendfile() any 
more, but just a wrapper around splice()).

			Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: Linux 2.6.17-rc2
  - From: Peter Naulls <[email protected]>

References:
- Linux 2.6.17-rc2
  - From: Linus Torvalds <[email protected]>
- Re: Linux 2.6.17-rc2
  - From: Diego Calleja <[email protected]>
- Re: Linux 2.6.17-rc2
  - From: Linus Torvalds <[email protected]>
- Re: Linux 2.6.17-rc2
  - From: Trond Myklebust <[email protected]>

Prev by Date: Re: [RESEND][RFC][PATCH 2/7] implementation of LSM hooks
Next by Date: Re: sata suspend resume ...
Previous by thread: Re: Linux 2.6.17-rc2
Next by thread: Re: Linux 2.6.17-rc2
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]