Re: rcu-refcount stacker performance

Thanks, Paul, that's a great idea!

The approach I'm testing right now just does a module_get(mod), which
is released when you manually disable the module by echoing it's name
into /sys/kernel/security/stacker/unload, so that must be done before
you can rmmod.  This way no refcounting is actually needed at all
in the time-critical loops, but it involves a change in the LSM API
which may well be rejected, so your approach may be the way to go.

thanks,
-serge

Quoting Paul E. McKenney (paulmck@us.ibm.com):
> On Thu, Jul 14, 2005 at 12:13:57PM -0500, serue@us.ibm.com wrote:
> > Quoting Paul E. McKenney (paulmck@us.ibm.com):
> > > On Thu, Jul 14, 2005 at 08:44:50AM -0500, serue@us.ibm.com wrote:
> > > > Quoting Paul E. McKenney (paulmck@us.ibm.com):
> > > > > My guess is that the reference count is indeed costing you quite a
> > > > > bit.  I glance quickly at the patch, and most of the uses seem to
> > > > > be of the form:
> > > > > 
> > > > > 	increment ref count
> > > > > 	rcu_read_lock()
> > > > > 	do something
> > > > > 	rcu_read_unlock()
> > > > > 	decrement ref count
> > > > > 
> > > > > Can't these cases rely solely on rcu_read_lock()?  Why do you also
> > > > > need to increment the reference count in these cases?
> > > > 
> > > > The problem is on module unload: is it possible for CPU1 to be
> > > > on "do something", and sleep, and, while it sleeps, CPU2 does
> > > > rmmod(lsm), so that by the time CPU1 stops sleeping, the code it
> > > > is executing has been freed?
> > > 
> > > OK, but in the above case, "do something" cannot be sleeping, since
> > > it is under rcu_read_lock().
> > 
> > Oh, but that's not quite what the code is doing, rather it is doing:
> > 
> > 	rcu_read_lock
> > 	while get next element from list
> > 		inc element.refcount
> > 		rcu_read_unlock
> > 		do something
> > 		rcu_read_lock
> > 		dec refcount
> > 	rcu_read_unlock
> 
> Color me blind this morning...  :-/  Yes, "do something" can legitimately
> sleep.  Sorry for my confusion!
> 
> > What I plan to try next is:
> > 
> > 	rcu_read_lock
> > 	while get next element from list
> > 		if (element->owning_module->state != LIVE)
> > 			continue
> > 		rcu_read_unlock
> > 		do something
> > 		rcu_read_lock
> > 	rcu_read_unlock
> > 
> > > > Because stacker won't remove the lsm from the list of modules
> > > > until mod->exit() is executed, and module_free(mod) happens
> > > > immediately after that, the above scenario seems possible.
> > > 
> > > Right, if you have some other code path that sleeps (outside of
> > > rcu_read_lock(), right?), then you need the reference count for that
> > > code path.  But the code paths that do not sleep should be able to
> > > dispense with the reference count, reducing the cache-line traffic.
> > 
> > Most if not all of the codepaths can sleep, however.  So unfortunately
> > that doesn't seem a feasible solution.  That's why I'm hoping there is
> > something inherent in the module unload code that I can take advantage
> > of to forego my own refcounting.
> 
> OK, so the only way that elements are removed is when a module is
> unloaded, right?
> 
> If your module trick does not pan out, how about the following:
> 
> o	Add a "need per-element reference count" global variable
> 
> o	Have a per-CPU reference-count variable.
> 
> o	Make your code snippet do something like the following:
> 
> 	rcu_read_lock()
> 	while get next element from list
> 		if (need per-element reference count)
> 			ref = &element.refcount
> 		else
> 			ref = &__get_cpu_var(stacker_refcounts)
> 		atomic_inc(ref)
> 		rcu_read_unlock()
> 		do something
> 		rcu_read_lock()
> 		atomic_dec(ref)
> 	rcu_read_unlock()
> 
> o	The point is to (hopefully) reduce the cache thrashing associated
> 	with the reference counts.
> 
> At module unload time, do something like the following:
> 
> 	need per-element reference count = 1
> 	synchronize_rcu()
> 	for_each_cpu(cpu)
> 		while (per_cpu(stacker_refcounts,cpu) != 0)
> 			sleep for a bit
> 
> 	/* At this point, all CPUs are using per-element reference counts */
> 
> If this approach does not reduce cache thrashing enough, one could use
> a per-task reference count instead of a per-CPU reference count.  The
> downside of doing this per-task approach is that you have to traverse
> the entire task list at unload time.  But module unloading should be
> quite rare.  If doing the per-task approach, you don't need atomic
> increments and decrements for the reference count, and you have excellent
> cache locality.
> 
> 							Thanx, Paul
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

References:
- rcu-refcount stacker performance
  - From: serue@us.ibm.com
- Re: rcu-refcount stacker performance
  - From: "Paul E. McKenney" <paulmck@us.ibm.com>
- Re: rcu-refcount stacker performance
  - From: serue@us.ibm.com
- Re: rcu-refcount stacker performance
  - From: "Paul E. McKenney" <paulmck@us.ibm.com>
- Re: rcu-refcount stacker performance
  - From: serue@us.ibm.com
- Re: rcu-refcount stacker performance
  - From: "Paul E. McKenney" <paulmck@us.ibm.com>

Prev by Date: Re: NUMA support for dual core Opteron
Next by Date: Re: [discuss] Re: NUMA support for dual core Opteron
Previous by thread: Re: rcu-refcount stacker performance
Next by thread: Re: rcu-refcount stacker performance
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind]