Re: [rfc] direct IO submission and completion scalability issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


On Tue, Jul 31, 2007 at 06:19:17AM +0200, Nick Piggin wrote:
> On Mon, Jul 30, 2007 at 01:35:19PM -0700, Suresh B wrote:
> > So any suggestions for making this clean and acceptable to everyone?
> It is obviously a good idea to hand over the IO at the point which
> requires the least number of cachelines to be moved, and I think doing
> it in the block layer is right. Mostly you have to convince the block
> and driver maintainers I guess.

Yes. Implementation is the challenging part I guess.

> The scheduler really should be made interrupt-load aware anyway, so I
> don't have a problem with changing that; or scheduling kblockd at a
> higher priority, but I don't know if SCHED_FIFO is a good idea. Couldn't
> it be done in a softirq instead?

Yes, softirq context is one way. But just didn't want to penalize the running
task by taking away some of its cpu time. With CFS micro accounting, perhaps
we can track irq, softirq time and avoid penalizing the running task's cpu

> Latency for IO migration could be the most difficult problem to solve
> really. You don't give much details of the workload, profiles, etc... I
> hope this is for a real world test?

Improvement numbers quoted are from the OLTP database workload. We can look
into other workloads.

> Can the locking be improved in simpler ways first?
> Just some random questions...
> It looks like the main source of cacheline bouncing you're eliminating
> is from the initial starting of IO from an empty queue (ie. unplug).
> From then on, the submission is driven by completion, right?
> Why is the queue allowed to go empty in the first place in an IO critical
> workload?

This workload is using direct IO and there is no batching at the block layer
for direct IO. IO is submitted to the HW as it arrives.

> Are you loading up each CPU with as many disks as it can possibly handle
> plus a few more? If so, is that realistic? (I honestly don't know).

There is 3-4% iowait time in the system. So the cpu's are not 100% busy,
but there is quite a bit of direct IO going on.

> You say that you'd like to do this for direct IO only, but if it is more
> efficient, why not for buffered IO as well? (or is it not more efficient
> for buffered IO? if not, why?)

It is applicable for both direct IO and buffered IO. But the implementations
will differ. For example in buffered IO, we can setup in such a way that the
block plug timeout function runs on the IO completion cpu.

> AFAIKS, you'd still have significant queue_lock contention from other
> CPUs inserting requests into the list?

Correct. We have more potential to explore. Current implementation
is very elementary.

> What IO scheduler are you using? I assume noop...


> as a crazy experiment, what happens if you create per-cpu request queues?

or in other words, each kblockd thread catering multiple request queues
(perhaps one for each cpu or one for group of cpu's).

softirq context and each kblockd thread handling multiple request queues will
lead to further improvements.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at
Please read the FAQ at

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux