Re: [rfc] direct IO submission and completion scalability issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jul 30, 2007 at 01:35:19PM -0700, Suresh B wrote:
> On Mon, Jul 30, 2007 at 11:20:04AM -0700, Christoph Lameter wrote:
> > On Fri, 27 Jul 2007, Siddha, Suresh B wrote:
> 
> > > Observation #2: This introduces some migration overhead during IO submission.
> > > With the current prototype, every incoming IO request results in an IPI and
> > > context switch(to kblockd thread) on the interrupt processing cpu.
> > > This issue needs to be addressed and main challenge to address is
> > > the efficient mechanism of doing this IO migration(how much batching to do and
> > > when to send the migrate request?), so that we don't delay the IO much and at
> > > the same point, don't cause much overhead during migration.
> > 
> > Right.
> 
> So any suggestions for making this clean and acceptable to everyone?

It is obviously a good idea to hand over the IO at the point which
requires the least number of cachelines to be moved, and I think doing
it in the block layer is right. Mostly you have to convince the block
and driver maintainers I guess.

The scheduler really should be made interrupt-load aware anyway, so I
don't have a problem with changing that; or scheduling kblockd at a
higher priority, but I don't know if SCHED_FIFO is a good idea. Couldn't
it be done in a softirq instead?

Latency for IO migration could be the most difficult problem to solve
really. You don't give much details of the workload, profiles, etc... I
hope this is for a real world test? Can the locking be improved in simpler
ways first?

Just some random questions...

It looks like the main source of cacheline bouncing you're eliminating
is from the initial starting of IO from an empty queue (ie. unplug).
>From then on, the submission is driven by completion, right?

Why is the queue allowed to go empty in the first place in an IO critical
workload?

Are you loading up each CPU with as many disks as it can possibly handle
plus a few more? If so, is that realistic? (I honestly don't know).

You say that you'd like to do this for direct IO only, but if it is more
efficient, why not for buffered IO as well? (or is it not more efficient
for buffered IO? if not, why?)

AFAIKS, you'd still have significant queue_lock contention from other
CPUs inserting requests into the list? What IO scheduler are you using?
I assume noop... as a crazy experiment, what happens if you create per-cpu
request queues?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux