Re: Status of AIO

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



David S. Miller wrote:
You didn't google hard enough, my blog entry on the topic
comes up as the first entry when you google for "Van Jacobson
net channels".


Thanks, I read the page... I find it to be a little extreme, and zero copy aio can get the same benefits without all that hassle. Let me write this as a reply to the article itself:


With SMP systems this "end host" concept really should be extended to
the computing entities within the system, that being cpus and threads
within the box.


I agree; all threads and cpus should be able to concurrently process network IO, and without wasting cpu cycles copying the data around 6 times. That does not, and should not mean moving the TCP/IP protocol to user space.


So, given all that, how do you implement network packet receive
properly? Well, first of all, you stop doing so much work in interrupt
(both hard and soft) context. Jamal Hadi Salim and others understood
this quite well, and NAPI is a direct consequence of that understanding.
But what Van is trying to show in his presentation is that you can take
this further, in fact a _lot_ further.


I agree; a minimum of work should be done in interrupt context. Specifically the interrupt handler should simply insert and remove packets from the queue and program the hardware registers for DMA access to the packet buffer memory. If the hardware supports scatter/gather DMA, then the upper layers can enqueue packet buffers to send/recieve into/from and the interrupt handler just pulls packets off this queue when the hardware raises an interrupt to indicate it has completed the DMA transfer.


This is how NT and I believe BSD have been doing things for some time now, and the direction the Linux kernel is moving in.

A Van Jacobson channel is a path for network packets. It is
implemented as an array'd queue of packets. There is state for the
producer and the consumer, and it all sits in different cache lines so
that it is never the case that both the consumer and producer write to
shared cache lines. Network cards want to know purely about packets, yet
for years we've been enforcing an OS determined model and abstraction
for network packets upon the drivers for such cards. This has come in
the form of "mbufs" in BSD and "SKBs" under Linux, but the channels are
designed so that this is totally unnecessary. Drivers no longer need to
know about what the OS packet buffers look like, channels just contain
pointers to packet data.

I must admit, I am a bit confused by this. It sounds a lot like the pot calling the kettle black to me. Aren't SKBs and mbufs already just a form of the very queue of packets being advocated here? Don't they simply list memory ranges for the driver to transfer to the nic as a packet?

The next step is to build channels to sockets. We need some
intelligence in order to map packets to channels, and this comes in the
form of a tiny packet classifier the drivers use on input. It reads the
protocol, ports, and addresses to determine the flow ID and uses this to
find a channel. If no matching flow is found, we fall back to the basic
channel we created in the first step. As sockets are created, channel
mappings are installed and thus the driver classifier can find them
later. The socket wakes up, and does protocol input processing and
copying into userspace directly out of the channel.

How is this any different from what we have now, other than bypassing the kernel buffer? The tcp/ip layer looks at the incoming packet to decide what socket it goes with, and copies it to the waiting buffer. Right now that waiting buffer is a kernel buffer, because at the time the packet arrives, the kernel does not have any user buffers.

If the user process uses aio though, it can hand the kernel a few buffers to receive into ahead of time so when the packets have been classified, they can be copied directly to the user buffer.

And in the next step you can have the socket ask for a channel ID
(with a getsockopt or something like that), have it mmap() a receive
ring buffer into user space, and the mapped channel just tosses the
packet data into that mmap()'d area and wakes up the process. The
process has a mini TCP receive engine in user space.

There is no need to use mmap() and burden the user code with implementing TCP itself ( which is quite a lot of work ). It can hand the kernel buffers by queuing multiple O_DIRECT aio requests and the kernel can directly dump the data stream there after stripping off the headers. When sending it can program the hardware to directly scatter/gather DMA from the user buffer attached to the aio request.

And you can take this even further than that (think, remote systems).
At each stage Van presents a table of profiled measurements for a normal
bulk TCP data transfer. The final stage of channeling all the way to
userspace is some 6 times faster than what we're doing today, yes I said
6 times faster that isn't a typo.


Yes, we can and should have a 6 times speed up, but as I've explained above, NT has had that for 10 years without having to push TCP into user space.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux