On Sat, Jan 07, 2006 at 12:36:25AM -0800, David S. Miller wrote:
> From: Eric Dumazet <[email protected]>
> Date: Sat, 07 Jan 2006 08:53:52 +0100
>
> > I have no problem with this, since the biggest server I have is 4
> > way, but are you sure big machines wont suffer from this single
> > spinlock ?
>
> It is the main question.
>
> > Also I dont understand what you want to do after this single
> > spinlock patch. How is it supposed to help the 'ip route flush
> > cache' problem ? In my case, I have about 600.000 dst-entries :
>
> I don't claim to have a solution to this problem currently.
>
> Doing RCU and going through the whole DST GC machinery is overkill for
> an active system. So, perhaps a very simple solution will do:
>
> 1) On rt_run_flush(), do not rt_free(), instead collect all active
> routing cache entries onto a global list, begin a timer to
> fire in 10 seconds (or some sysctl configurable amount).
>
> 2) When a new routing cache entry is needed, check the global
> list appended to in #1 above first, failing that do dst_alloc()
> as is done currently.
>
> 3) If timer expires, rt_free() any entries in the global list.
>
> The missing trick is how to ensure RCU semantics when reallocating
> from the global list.
The straightforward ways of doing this require a per-entry lock in
addition to the dst_entry reference count -- lots of read-side overhead.
More complex approaches use a generation number that is incremented
when adding to or removing from the global list. When the generation
number overflows, unconditionally rt_free() it rather than adding
to the global list again. Then there needs to be some clever code
on the read side to detect the case when the generation number
changes while acquiring a reference. And memory barriers. Also
lots of read-side overhead. Also, it is now -always- necessary to
acquire a reference on the read-side.
> The idea is that an active system will immediately repopulate itself
> with all of these entries just flushed from the table.
>
> RCU really doesn't handle this kind of problem very well. It truly
> excels when work is generated by process context work, not interrupt
> work.
Sounds like a challenge to me. ;-)
Well, one possible way to attack Eric's workload might be the following:
o Size the hash table to strike the appropriate balance between
read-side search overhead and memory consumption. Call the
number of hash-chain headers N.
o Create a hashed array of locks sized to allow the update to
proceed sufficiently quickly. Call the number of locks M,
probably a power of two. This means that M CPUs can be doing
the update in parallel.
o Create an array of M^2 list headers (call it xfer[][]), but
since this is only needed during an update, it can be allocated
and deallocated if need be. (Me, with my big-server experience,
would probably just create the array, since M is not likely to
be too large. But your mileage may vary. And you really only
need M*(M-1) list headers, but that makes the index calculation
a bit more annoying.)
o Use a two-phase update. In the first phase, each updating
CPU acquires the corresponding lock and removes entries from
the corresponding partition of the hash table. If the new
location of a given entry falls into the same partition, it
is added back to the appropriate hash chain of that partition.
Otherwise, add the entry to xfer[dst][src], where "src" and
"dst" are indexes of the corresponding partitions.
o When all CPUs finish removing entries from their partition,
they check into a barrier. Once all have checked in, they
can start the second phase of the update.
o In the second phase, each CPU removes the entries from the
xfer array that are destined for its partition and adds them
to the hash chain that they are destined for.
Some commentary and variations, in the hope that this inspires someone
to come up with an even better idea:
o Unless M is at least three, there is no performance gain
over a single global lock with a single CPU doing the update,
since each element must now undergo four list operations rather
than just two.
o The xfer[][] array must have each entry cache-aligned, or
you lose big on cacheline effects. Note that it is -not-
sufficient to simply align the rows or the columns, since
each CPU has its own column when inserting and its own
row when removing from xfer[][].
o And the data-skew effects are less severe if this procedure
runs from process context. A spinning barrier must be used
otherwise. But note that the per-partition locks could remain
spinlocks, only the barrier need involve sleeping (in case
that helps, am getting a bit ahead of my understanding of
this part of the kernel).
So I half-agree with Dave -- this works better if the bulk
update is done in process context, however, updates involving
single entries being individually added and removed from the
hash table can be done from interrupt context.
o One (more complex!) way to reduce the effect of data skew to
partition the array based on the number of elements in each
partition rather than by the number of chains, but this would
require a second barrier that is completed before dropping locks
in order to avoid messing up concurrent single-element updates.
And this only handles the phase-1 data-skew effects. It is
possible to handle phase-2 data skew as well, but it takes an
up-front full-table scan and so it is not clear how it could
possibly be a performance win.
o Note that routing slows down significantly while an update is
in progress, because on average only half of the entries are
in the hash table during the update. I have no idea whether
or not this is acceptable. I am assuming that the same
mechanism that prevents two concurrent route-cache misses
from inserting two entries from the same route would also
prevent adding a new entry when one was already in the
xfer array.
Thoughts?
Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]