On Jan 2, 2007, at 10:36 PM, Chen, Kenneth W wrote:
Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM
In the example you
gave earlier, task with min_nr of 2 will be woken up after 4
completed
events.
I only gave 2 ios/events in that example.
Does that clear up the confusion?
It occurs to me that people might not be aware how peculiar the
current io_getevent wakeup scheme is, to the extend of erratic
behavior.
Yeah. I have to admit that I hadn't fully grasped the behaviour of
the interface when we started this thread. I realized just how
insane it was as I was finishing off last night.
wakes up on event _4_. If someone ask me to describe algorithm
of io_getevents wake-up scheme in the presence of multiple
waiters, I call it erratic and un-deterministic.
I agree completely. This leads me to assume that essentially *no
one* has multiple waiters on a context who care about min_nr. That's
good news because we can consider it the slow path for now :)
So I can categorize my patchset as a bug fix instead of a
performance patch ;-) Let's be serious, this ought to be fixed
one way or the other.
Yeah. How about this:
Start with a cleanup of the event copying loop:
- prefault user event dest with fault_in_pages_writable()
- copy multiple events with _inatomic
- only copy events if min_nr are available
- back off and retry if _inatomic faults
- go back to sleep if too few events are available
- return once a copy doesn't fault
(I guess one could only fall back to prefaulting if the _inatomic
fails. I don't know if most event buffers fault or not, I'd guess not.)
This would get rid of the nonsense of consuming events (possibly
depriving other concurrent waiters) without returning them to
userspace in a timely fashion. We get rid of a lot of the per-event-
delivery overhead that the current loop suffers from.
The controversial part here happens when min_nr is larger than the
ring size. In that case I think we should consider min_nr to be
equal to the ring size. We'll return fewer events than userspace
asked for, but it'll still be a big batch.
Does io_getevents() returning +ve but < min_nr raise alarm bells for
anyone? It's already done in the timeout case, so arguably code
already handles it.
Now, on the waking and sleeping side:
- min_nr 0 and 1 as exclusive waiters in ctx->wait
- maintain a rbtree of with min_nr of waiters when > 1
- always wake ctx->wait if it's pending
- wake rb_first of the tree if nr_in_ring >= node->min_nr
The rbtree nodes would have a task_struct pointer and its waker would
use wake_up_process(). It'd be maintained under the ctx lock. It'll
be expensive to maintain in the case of lots of concurrent min_nr > 1
waiters, but we're declaring that the slow path.
The fast path of a single waiter with > 1 min_nr becomes a pointer
indirection through the root node of the rb tree. It should be just
like the current list_head use in the wait queue. If the wait queue
head and rbtree root node share a cacheline in the context then
checking them both (and almost always only finding one of them
populated) shouldn't cost the completion path much.
I don't have a strong opinion about using exclusive wake-ups or not
because I doubt we ever have concurrent waiters in the field. I'm
leaning towards avoiding the thundering herd, even if that risks the
event delivery stream waiting behind a woken waiter who might be
blocking while prefaulting their destination event buffer. We should
wake from io_getevents() before it sleeps or returns, then. I could
easily be convinced otherwise.
The wait queue lets us support lots of min_nr == 1 waiters without
the rbtree maintenance overhead.. they're not as self-evidently
broken in the current code as multiple min_nr > 1 waiters, so they
may well be in use? who knows. It's easy.
How's this all sound? I can throw this together if you'd prefer it
that way. Though I suspect you'll want to keep hacking on this :).
Finally, let's get a test case for multiple min_nr > 1 waiters into
autotest. I'd use our trivial example where the min_nr = 2 waiter
fires of 3 ios for the concurrent min_nr = 3 waiter to ensure that
things don't get stuck.
- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]