Evgeniy Polyakov wrote:
Possible solution:
a) it would be possible to have a "used" flag in each ring buffer entry.
That's too expensive, I guess.
b) kevent_wait needs another parameter which specifies the which is the
last (i.e., least recently added) entry in the ring buffer.
Everything between this entry and the current head (in ->kidx) is
occupied. If multiple threads arrive in kevent_wait the highest idx
(with wrap around possibly lowest) is used.
kevent_wait will not try to move more entries into the ring buffer
if ->kidx and the higest index passed in to any kevent_wait call
is equal (i.e., the ring buffer is full).
There is one issue, though, and that is that a system call is needed
to signal to the kernel that more entries in the ring buffer are
processed and that they can be refilled. This goes against the
kernel filling the ring buffer automatically (see below)
If thread calls kevent_wait() it means it has processed previous entries,
one can call kevent_wait() with $num parameter as zero, which
means that thread does not want any new events, so nothing will be
copied.
This doesn't solve the problem. You could only request new events when
all previously reported events are processed. Plus: how do you report
events if the you don't allow get_event pass them on?
Writable ring buffer does not sound too good to me - what if one thread
will overwrite the whole ring buffer so kernel's indexes can be screwed?
Agreed, there are problems. This is why I suggested the ring buffer can
be a structured. Parts of it might be read-only, other parts
read/write. I don't necessarily think the 'used' flag is the right way.
And front/tail pointer solution seems to be better.
Ring buffer processed not in FIFO order is wrong idea
Not necessarily, see my comments about CPU affinity in the previous mail.
- ring buffer can
be potentially very big and searching there for the entry, which was
been marked as 'free' by userspace is not a solution at all - userspace
in that case must provide ukevent so fast tree search would be used,
(and although it is already possible) it requires userspace to make
additional syscalls which is not what we want.
It is not necessary. I've proposed to only have a fron and tail
pointer. The tail pointer is maintained by the application and passed
to the kernel explicitly or via shared memory. The kernel maintains the
front pointer. No tree needed.
As a solution I can create folowing scheme:
there are two syscalls (or one with a switch) which get events and
commits them.
kevent_wait() becomes a syscall which waits until number of events or
one of them becomes ready and just copies them into ring buffer and
returns. kevent_wait() will fail with special error code when ring
buffer is full.
kevent_commit() frees requested number of events _from the beginning_,
i.e. from special index, visible from userspace. Userspace can create
special counters for events (and even put them into read-only ring
buffer overwriting some fields of kevent, especially if we will increase
it's size) and only call kevent_commit() when all events have zero usage
counter.
Right, that's basically the front/tail pointer implementation. That
would work. You just have to make sure that the kevent_wait() call
takes the current front pointer/index as a parameter. This way if the
buffer gets filled between the thread checking the ring buffer (and
finding it empty) and the syscall being handled the thread is not suspended.
I disagree that having possibility to have holes in the ring buffer is a
good idea at all - it requires much more complex protocol, which will
fill and reuse that holes, and the main disavantge - it requires to
transfer much more information from userspace to kernelspace to free the
ring entry in the hole - in that case it is already possible just to
call kevent_ctl(KEVENT_REMOVE) and do not wash the brain with new
approach at all.
Well, it would require more data transport of we'd use writable shared
memory. But I agree, it's far too complicated and might not scale with
growing ring buffer sizes.
- implementing the kevent_wait syscall the proposed way means we are
missing out on one possible optimization. The ring buffer is
currently only filled on kevent_wait calls. I expect that in really
high traffic situations requests are coming in at a higher rate than
the can be processed. At least for periods of time. If such
situations it would be nice to not have to call into the kernel at
all. If the kernel would deliver into the ring buffer on its own
this would be possible.
Well, it can be done on behalf of workqueue or dedicated thread which
will bring up appropriate mm context,
I think it should be done. It's potentially a huge advantage.
although it means that userspace
can not handle the load it requested, which is a bad sign...
I don't understand. What is not supposed to work? There is nothing
which cannot work with automatic posting since the get_event() call does
nothing but copying the event data over and wake a thread.
- the kevent_get_event syscall is not needed at all. All reporting
should be done using a ring buffer. There really is not reason to
keep two interfaces around which serve the same purpose. Making
the argument the kevent_get_event is so much easier to use is not
valid. The exposed interface to access the ring buffer will be easy,
too. In the OLS paper I more or wait hinted at the interfaces. I
think they should be like this (names are irrelevant):
Well, kevent_get_events() _is_ much easier to use. And actually having
only that interface it is possible to implement ring buffer with any
kind or protocol for its controlling - userspace can have a wrapper
which will call kevent_get_events() with pointer which shows to the
place in the shared ring buffer where to place new events, that wrapper
can handle essentially any kind of flags/parameters which are suitable
for that ring buffer implementation.
That's far too slow. The whole point behind the ring buffer is speed.
And emulation would defeat the purpose.
But since we started to implement ring buffer as a additional feature of
kevent, let's find the way all people will be happy with before removing
something which was proven to work correctly.
The get_event interface is basically the userlevel interface the runtime
(glibc probably) would provide. Programmers don't see the complexity.
I'm concerned about the get_event interface holding the kernel
implementation back. For instance, automatic filling the ring buffer.
This would not be possible if the program is free to mix
kevent_get_event and kevent_wait calls freely. If you do away with the
get_event syscall the automatic ring buffer filling is possible and a
logical extension.
The last three are exactly kevent_get_events() with different set of
parameters - it is possible to get events without sleeping, it is
possible to wait until at least something is ready and it is possible to
sleep for timeout.
Exactly. But these interfaces should be implemented at userlevel, not
at the syscall level. It's not necessary. The kernel interface should
be kept as small as possible and the get_event syscall is pure duplication.
They all already imeplemented. Just all above, and it was done several
months ago already. No need to reinvent what is already there.
Even if we will decide to remove kevent_get_events() in favour of ring
buffer-only implementation, winting-for-event syscall will be
essentially kevent_get_events() without pointer to the place where to
put events.
Right, but this limitation of the interface is important. It means the
interface of the kernel is smaller: fewer possibilities for problems and
fewer constraints if in future something should be changed (and smaller
kernel).
I agree that having special syscall to initialize kevent is a good idea,
and initial kevent implementation had it, but it was removed due to API
cleanup work by Cristoph Hellwing.
Well, he is wrong. If, for instance, init or any of the programs which
start first wants to use the syscall it couldn't because /dev isn't
mounted. The program might use libraries and therefore not have any
influence on whether the kevent stuff is used or not.
Yes, the /dev interface is useful for some/many other kernel interfaces.
But this is a core interface. For the same reason epoll_create is a
syscall.
Do you have _any_ kind of benchmarks with epoll() which would show that
it is feasible? ukevent is one cache line (well, 2 cache lines on old
CPUs), which can be setup way too far away from the time when it is
ready, and CPU which origianlly set that up can be busy, so we will lose
performance waiting until CPU becomes free instead of calling other
thread on different CPU.
If the period between the generation of the event (e.g., incoming
network traffic or sent data) and the delivery of the event by waking a
thread is too long, it makes not too much sense. But if the L2 cache
hasn't hasn't been flushed it might be a big advantage.
I think it's reasonable to only have the last queued entry for a CPU
handled special. And note, this is only ever a hint. If an event entry
was created by the kernel in one CPU but none of the threads which wait
to be waken is on that CPU, nothing has to be done.
No, I don't have a benchmark. But it is likely quite easily possible to
create a synthetic benachmark. Maybe with pipes.
It is possible to specify CPU id in kevent (not in ukevent, i.e. not
in shared by userspace structure, but in it's kernel representation),
and then check if currently active CPU is the same or not, but what if
it is not the same CPU?
Nothing special. It's up to the userlevel wrapper code. The CPU number
would only be a hint.
Entry order is important, since application can
take advantage of synchronization, so idea to skip some entries is bad.
That's something the application should be make a call about. It's not
always (or even mostly) the case that the ordering of the notification
is important. Furthermore, this would also require the kernel to
enforce an ordering. This is expensive on SMP machines. A locally
generated event (i.e., source and the thread reporting the event) can be
delivered faster than an event created on another CPU.
It is management task - kernel should not even know about someone has
died and can not process events it requested.
But the kernel has to be involed.
Userspace can open a control pipe (and setup a kevent handler for it)
and glibc will write there a byte thus awakening some other thread.
It can be done in userspace and should be done in userspace.
That's invasive. The problem is that no userlevel interface should have
to implicitly keep file descriptors open. This would mean the
application would be influenced since suddenly a file descriptor is not
available anymore. Yes, applications shouldn't care but they
unfortunately sometimes do.
Will we discuss it for death?
Kevent does not need to have absolute timeout.
Of course it does. Just because you don't see a need for it for your
applications right now it doesn't mean it's not a valid use.
Because timeout specified there is always related to the start of
syscall, since it is a timeout which specifies maximum time frame
syscall can live.
That's your current implementation. There is absolutely no reason
whatsoever why this couldn't be changed.
I created kevent_signal notifications - it allows user to setup any set
of interested signals before call to kevent_get_events() and friends.
No need to solve a problem with operation way when there is tactical and
strategical ones
Of course there is a need and I explained it before. Getting signal
notifications is in no way the same as changing the signal mask
temporarily. You cannot correctly emulate the case where you want to
block a signal while in the call as reenable it afterwards. Receiving
the signal as an event and then artificially raising it is not the same.
Especially timing-wise, the signal kevent might not be seen long after
the syscall returns because other entries are worked on first.
The opposite case is equally impossible to emulate: unblocking a signal
just for the duration of the syscall. These are all possible and used
cases.
- the KEVENT_REQ_WAKEUP_ONE functionality is good and needed. But I
would reverse the default. I cannot see many places where you want
all threads to be woken. Introduce KEVENT_REQ_WAKEUP_ALL instead.
I.e. to wake up only first thread always and in addon those threads
which have specified flag set? Ok, will put into todo foer the next
release.
It's a flag for an event. So the threads won't have the flag set. If
an event is delivered with the flag set, wake all threads. Otherwise
just one.
- there is really no reason to invent yet another timer implementation.
We have the POSIX timers which are feature rich and nicely
implemented. All that is needed is to implement SIGEV_KEVENT as a
notification mechanism. The timer is registered as part of the
timer_create() syscalls.
Feel free to add any interface you like - it is as simple as call for
kevent_user_add_ukevent() in userspace.
No, that's not what I mean. There is no need for the special
timer-related part of your patch. Instead the existing POSIX timer
syscalls should be modified to handle SIGEV_KEVENT notification. Again,
keep the interface as small as possible. Plus, the POSIX timer
interface is very flexible. You don't want to duplicate all that
functionality.
And I almost silently stay behind with the fact that it is possbile to
implement _all_ above ring buffer things in userspace with
kevent_get_events() and this functionality is there for almost a year :)
Again, this defeats the purpose completely. The ring buffer is the
faster interface, especially when coupled with asynchronous filling of
ring buffer (i.e., without a syscal).
Let's solve problem in order of theirs appearance - what do you think
about above interface for ring buffer?
Looks better, yes.
--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]