Re: [RFC] RCU and CONFIG_PREEMPT_RT progress

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 10 May 2005, Paul E. McKenney wrote:

> On Tue, May 10, 2005 at 12:55:31PM +0200, Esben Nielsen wrote:
> [...]
> > > 
> > > The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> > > counter-based approach, which seems to work, but which can result in
> > > indefinite-duration grace periods.
> >
> > Is that really a problem in practise? For RCU updates should happen
> > rarely.
> 
> Yes.  Several people have run into serious problems on small-memory
> machines under denial-of-service workloads.  Not pretty.
>
That is what I tell people at work: Dynamic memory allocation is _not_ a
good thing in real-time embedded systems. They call be old fashioned when
I say I want anything important pre-allocated in seperate pools. But it
exactly avoids these kind of surprises, makes the system more
deterministic and makes it easier to find leaks.
 
> [...]
> > 
> > Why bother? What is the overhead of checking relative to just doing the
> > increment/decrement?
> 
> The increment of the per-CPU counter pair is likely to incur a cache miss,
> and thus be orders of magnitude more expensive than the increment of
> the variable in the task structure.  I did not propose this optimization
> lightly.  ;-)
> 
> > I had another idea: Do it in the scheduler:
> >   per_cpu_rcu_count += prev->rcu_read_count;
> >   if(per_cpu_rcu_count==0) {
> >      /* grace period encountered */
> >   }
> >   (do the task switch)
> >   
> >   per_cpu_rcu_count -= count->rcu_read_count;
> > 
> > Then per_cpu_rcu_count is the rcu-read-count for all tasks except the
> > current. During schedule there is no current and therefore it is the total
> > count.
> 
> One downside of this approach is that it would maximally expand the
> RCU read-side critical sections -- one of the things we need is to
> avoid holding onto memory longer than necessary, as noted above.
> 
> Also, what prevents a grace period from ending while a task is
> preempted in an RCU read-side critical section?  Or are you intending
> that synchronize_rcu() scan the task list as well as the per-CPU
> counters?
Eh, if a RCU task is preempted in a read-side CS there IS no grace period!
The trick is to avoid having it haning there for too long (see my boosting
mechanismen below).
> 
> > > [...] 
> > > 2.	#1 above, but use a seqlock to guard the counter selection in
> > > 	rcu_read_lock().  One potential disadvantage of this approach
> > > 	is that an extremely unlucky instance of rcu_read_lock() might
> > > 	be indefinitely delayed by a series of counter flips.  I am
> > > 	concerned that this might actually happen under low-memory
> > > 	conditions.  Also requires memory barriers on the read side,
> > > 	which we might be stuck with, but still hope to be able to
> > > 	get rid of.  And the per-CPU counter manipulation must use
> > > 	atomic instructions.
> >
> > Why not just use 
> >  preempt_disable(); 
> >  manipulate counters; 
> >  preempt_enable();
> > ??
> 
> Because we can race with the counter switch, which can happen on some
> other CPU.  Therefore, preempt_disable() does not help with this race.

Well, the per-cpu counters should be truely per-cpu. I.e. all tasks must
_only_ touch the counter on their current cpu. preempt_disable() prevents
the migration of the current task, right?
 
> 
> > > 3.	#1 above, but use per-CPU locks to guard the counter selection.
> > > 	I don't like this any better than #2, worse, in fact, since it
> > > 	requires expensive atomic instructions as well.
> >
> > local_irqs_disable() andor preempt_disable() should work as they counters
> > are per-cpu, right?
> 
> Again, neither of these help with the race against the counter-switch,
> which would likely be happening on some other CPU.
As above :-)

> 
> > Or see the alternative above: The counter is per task but latched to a
> > per-cpu on schedule().
> 
> This could work (assuming that synchronize_rcu() scanned the task list
> as well as the per-CPU counters), but again would extend grace periods
> too long for small-memory machines subject to denial-of-service attacks.
> 
No, need to traverse the list. Again the syncronize_rcu() waits for all
CPU to check their number _locally_. The task doing that can simply check
it's the latched number as it knows it is not in a read-side CS.

> > [...] 
> > I was playing around with it a little while back. I didn't sned a patch
> > though as I didn't have time. It works on a UP machine but I don't have a
> > SMP to test on :-(
> 
> There are some available from OSDL, right?  Or am I out of date here?
Honestly, I can't remember if I mailed it to anyone. I only have the code
on my home PC, which broke down. I am right now running off a backup. I
have attached the patch to this mail. I don't have time to clean it up :-(
I see it also includes code for a testing device trying where I try 
to meassure the grace period.

> 
> > The ingreedients are:
> > 1) A per-cpu read-side counter, protected locally by preempt_disable().
> 
> In order to avoid too-long grace periods, you need a pair of counters.
> If you only have a single counter, you can end up with a series of
> RCU read-side critical-section preemptions resulting in the counter
> never reaching zero, and thus the grace period never terminating.
> Not good, especially in small-memory machines subject to denial-of-service
> attacks.
See the boosting trick below. That will do the trick.

> 
> > 2) A task read-side counter (doesn't need protectection at all).
> 
> Yep.  Also a field in the task struct indicating which per-CPU counter
> applies (unless the scheduler is managing this).
No, it is _always_ the local CPU which alplies as I have tried to make 3)
below (which isn't tested at all).
> 
> > 3) Tasks having non-zero read-side counter can't be migrated to other
> > CPUs. That is to make the per-cpu counter truely per-cpu.
> 
> If CPU hotplug can be disallowed, then this can work.  Do we really want
> to disallow CPU hotplug?  Especially given that some implementations
> might want to use CPU hotplug to reduce power consumption?  There also
> might be issues with tickless operation.
> 
Yeah, I noticed the hotplog code. What are the RT requirements for
hotplugging? With 4) below it shouldn't be a problem to wait until all
tasks ar out of the read-lock session.

A better alternative is to make migration work like this:
1) Deactive the task.
2) Make the target cpu increment it's rcu-counter.
3) Migrate the task.
4) Make the source cpu increment it's rcu-counter.
Then you can safely migrate tasks preempted in read-side CS as well.
Harder to code than my small hack, though.

> > 4) When the scheduler sees a SCHED_OTHER task with a rcu-read-counter>0 it
> > boost the priority to the maximum non-RT priority such it will not be 
> > preempted by other SCHED_OTHER tasks. This makes the  delay in
> > syncronize_kernel() somewhat deterministic (it depends on how
> > "deterministic" the RT tasks in the system are.)
> 
> But one would only want to do this boost if there was a synchronize_rcu()
> waiting -and- if there is some reason to accelerate grace periods (e.g.,
> low on memory), right?  If there is no reason to accelerate grace periods,
> then extending them increases efficiency by amortizing grace-period
> overhead over many updates.
Well, I only do the boost if the task is actually preempted in the
read-side CS - it should be relatively rare. I could make a check and
only do the boost if their is an request for a grace-period set. BUT the
problem is that when somebody requests a grace-period later on it would
have to traverse the runqueue and boost the tasks in read-side CS. That
sounds like more expensive, but one can only know that by testing....

> 
> > I'll try to send you what I got when I get my computer at home back up.
> > I have only testet id on my UP labtop, where it seems to run fine.
> 
> I look forward to seeing it!
> 
Attached here :-)
> 							Thanx, Paul
> 

Esben
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/Makefile linux-2.6.12-rc1-RCU/Makefile
--- Orig/linux-2.6.12-rc1/Makefile	Sun Mar 20 20:30:55 2005
+++ linux-2.6.12-rc1-RCU/Makefile	Sun Mar 27 21:05:58 2005
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION =-rc1
+EXTRAVERSION =-rc1-RCU-SMP
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/drivers/char/Kconfig linux-2.6.12-rc1-RCU/drivers/char/Kconfig
--- Orig/linux-2.6.12-rc1/drivers/char/Kconfig	Sun Mar 20 20:31:03 2005
+++ linux-2.6.12-rc1-RCU/drivers/char/Kconfig	Fri Mar 25 00:19:07 2005
@@ -978,6 +978,13 @@
 	  The mmtimer device allows direct userspace access to the
 	  Altix system timer.
 
+
+config RCU_TESTER
+	tristate "Test device for testing RCU code."
+	default n
+	help
+	  To help debugging the RCU code. Don't include this.
+
 source "drivers/char/tpm/Kconfig"
 
 endmenu
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/drivers/char/Makefile linux-2.6.12-rc1-RCU/drivers/char/Makefile
--- Orig/linux-2.6.12-rc1/drivers/char/Makefile	Sun Mar 20 20:31:03 2005
+++ linux-2.6.12-rc1-RCU/drivers/char/Makefile	Fri Mar 25 00:19:11 2005
@@ -48,6 +48,8 @@
 obj-$(CONFIG_VIOTAPE)		+= viotape.o
 obj-$(CONFIG_HVCS)		+= hvcs.o
 
+obj-$(CONFIG_RCU_TESTER)	+= rcu_tester.o
+
 obj-$(CONFIG_PRINTER) += lp.o
 obj-$(CONFIG_TIPAR) += tipar.o
 
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/drivers/char/rcu_tester.c linux-2.6.12-rc1-RCU/drivers/char/rcu_tester.c
--- Orig/linux-2.6.12-rc1/drivers/char/rcu_tester.c	Thu Jan  1 01:00:00 1970
+++ linux-2.6.12-rc1-RCU/drivers/char/rcu_tester.c	Wed Mar 30 00:15:52 2005
@@ -0,0 +1,120 @@
+/*
+ * test device for testing RCU code
+ */
+
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+
+#define RCU_TESTER_MINOR	 222
+
+#define RCU_TESTER_WRITE	4247
+#define RCU_TESTER_READ 	4248
+
+#define ERCU_FAILED               35
+
+#ifndef notrace
+#define notrace
+#endif
+
+u64 notrace get_cpu_tick(void)
+{
+	u64 tsc;
+#ifdef ARCHARM
+	tsc = *oscr;
+#else
+	__asm__ __volatile__("rdtsc" : "=A" (tsc));
+#endif
+	return tsc;
+}
+
+void notrace loop(int loops)
+{
+	int i;
+
+	for (i = 0; i < loops; i++)
+		get_cpu_tick();
+}
+
+static spinlock_t write_lock;
+static int *protected_data = NULL;
+
+static int rcu_tester_open(struct inode *in, struct file *file)
+{
+	return 0;
+}
+
+static long rcu_tester_ioctl(struct file *file,
+			  unsigned int cmd, unsigned long args)
+{
+	switch(cmd) {
+	case RCU_TESTER_READ: {
+		int *dataptr, data1,data2;
+		rcu_read_lock();
+		dataptr = protected_data;
+		data1 = *dataptr;
+		loop(args);
+		data2 = *dataptr;
+		rcu_read_unlock();
+		if(data1!=data2)
+			return -ERCU_FAILED;
+		else
+			return 0; /* ok */
+	}
+	case RCU_TESTER_WRITE: {
+		int *newdata, *olddata;
+		newdata = kmalloc(sizeof(int), GFP_KERNEL);
+		if(!newdata)
+			return -ENOMEM;
+
+		spin_lock(&write_lock);
+		olddata = rcu_dereference(protected_data);
+		*newdata = *olddata + 1;
+		rcu_assign_pointer(protected_data, newdata);
+		spin_unlock(&write_lock);
+		synchronize_kernel();
+		kfree(olddata);
+		return 0;
+	}
+	default:
+		return -EINVAL;
+	}
+}
+
+static struct file_operations rcu_tester_fops = {
+	.owner		= THIS_MODULE,
+	.llseek		= no_llseek,
+	.unlocked_ioctl = rcu_tester_ioctl,
+	.open		= rcu_tester_open,
+};
+
+static struct miscdevice rcu_tester_dev =
+{
+	RCU_TESTER_MINOR,
+	"rcu_tester",
+	&rcu_tester_fops
+};
+
+static int __init rcu_tester_init(void)
+{
+	if (misc_register(&rcu_tester_dev))
+		return -ENODEV;
+
+	protected_data = kmalloc(sizeof(int), GFP_KERNEL);
+	*protected_data = 0;
+	
+	spin_lock_init(&write_lock);
+
+	return 0;
+}
+
+void __exit rcu_tester_exit(void)
+{
+	printk(KERN_INFO "rcu-tester device uninstalled\n");
+	misc_deregister(&rcu_tester_dev);
+}
+
+module_init(rcu_tester_init);
+module_exit(rcu_tester_exit);
+
+MODULE_LICENSE("GPL");
+
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/include/linux/init_task.h linux-2.6.12-rc1-RCU/include/linux/init_task.h
--- Orig/linux-2.6.12-rc1/include/linux/init_task.h	Sun Mar 20 20:31:28 2005
+++ linux-2.6.12-rc1-RCU/include/linux/init_task.h	Fri Mar 25 00:22:00 2005
@@ -74,6 +74,7 @@
 	.usage		= ATOMIC_INIT(2),				\
 	.flags		= 0,						\
 	.lock_depth	= -1,						\
+	.rcu_read_lock_nesting = 0,				       	\
 	.prio		= MAX_PRIO-20,					\
 	.static_prio	= MAX_PRIO-20,					\
 	.policy		= SCHED_NORMAL,					\
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/include/linux/rcupdate.h linux-2.6.12-rc1-RCU/include/linux/rcupdate.h
--- Orig/linux-2.6.12-rc1/include/linux/rcupdate.h	Wed Mar  2 08:37:50 2005
+++ linux-2.6.12-rc1-RCU/include/linux/rcupdate.h	Sun Mar 27 18:12:45 2005
@@ -41,6 +41,8 @@
 #include <linux/percpu.h>
 #include <linux/cpumask.h>
 #include <linux/seqlock.h>
+#include <linux/sched.h>
+#include <linux/interrupt.h>
 
 /**
  * struct rcu_head - callback structure for use with RCU
@@ -85,6 +87,7 @@
  * curlist - current batch for which quiescent cycle started if any
  */
 struct rcu_data {
+	int             active_readers;
 	/* 1) quiescent state handling : */
 	long		quiescbatch;     /* Batch # for grace period */
 	int		passed_quiesc;	 /* User-mode/idle loop etc. */
@@ -115,12 +118,14 @@
 static inline void rcu_qsctr_inc(int cpu)
 {
 	struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
-	rdp->passed_quiesc = 1;
+	if(rdp->active_readers==0)
+		rdp->passed_quiesc = 1;
 }
 static inline void rcu_bh_qsctr_inc(int cpu)
 {
 	struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
-	rdp->passed_quiesc = 1;
+	if(rdp->active_readers==0)
+		rdp->passed_quiesc = 1;
 }
 
 static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
@@ -183,14 +188,34 @@
  *
  * It is illegal to block while in an RCU read-side critical section.
  */
-#define rcu_read_lock()		preempt_disable()
+static inline void rcu_read_lock(void)
+{	
+	unsigned long flags;
+
+	local_irq_save(flags);
+ 	__get_cpu_var(rcu_data).active_readers++;
+	current->rcu_read_lock_nesting++;
+	local_irq_restore(flags);
+}
 
 /**
  * rcu_read_unlock - marks the end of an RCU read-side critical section.
  *
  * See rcu_read_lock() for more information.
  */
-#define rcu_read_unlock()	preempt_enable()
+static inline void rcu_read_unlock(void)
+{
+	unsigned long flags;
+	int cpu;
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	
+ 	per_cpu(rcu_data,cpu).active_readers--;
+	current->rcu_read_lock_nesting--;
+	rcu_qsctr_inc(cpu);
+	local_irq_restore(flags);
+}
 
 /*
  * So where is rcu_write_lock()?  It does not exist, as there is no
@@ -213,14 +238,34 @@
  * can use just rcu_read_lock().
  *
  */
-#define rcu_read_lock_bh()	local_bh_disable()
+static inline void rcu_read_lock_bh(void)
+{ 
+	unsigned long flags;
+
+	local_irq_save(flags);
+ 	__get_cpu_var(rcu_bh_data).active_readers++;
+	current->rcu_read_lock_nesting++;
+	local_irq_restore(flags);
+}
 
 /*
  * rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section
  *
  * See rcu_read_lock_bh() for more information.
  */
-#define rcu_read_unlock_bh()	local_bh_enable()
+static inline void rcu_read_unlock_bh(void)	
+{ 
+	unsigned long flags;
+	int cpu;
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	
+ 	per_cpu(rcu_bh_data,cpu).active_readers--;
+	current->rcu_read_lock_nesting--;
+	rcu_bh_qsctr_inc(cpu);
+	local_irq_restore(flags);
+}
 
 /**
  * rcu_dereference - fetch an RCU-protected pointer in an
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/include/linux/sched.h linux-2.6.12-rc1-RCU/include/linux/sched.h
--- Orig/linux-2.6.12-rc1/include/linux/sched.h	Sun Mar 20 20:31:29 2005
+++ linux-2.6.12-rc1-RCU/include/linux/sched.h	Fri Mar 25 00:22:00 2005
@@ -569,6 +569,9 @@
 	unsigned long ptrace;
 
 	int lock_depth;		/* Lock depth */
+	
+	int rcu_read_lock_nesting; /* How many RCU read reagions the thread is 
+				      in */
 
 	int prio, static_prio;
 	struct list_head run_list;
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/init/main.c linux-2.6.12-rc1-RCU/init/main.c
--- Orig/linux-2.6.12-rc1/init/main.c	Sun Mar 20 20:31:30 2005
+++ linux-2.6.12-rc1-RCU/init/main.c	Fri Mar 25 13:36:26 2005
@@ -688,8 +688,11 @@
 	system_state = SYSTEM_RUNNING;
 	numa_default_policy();
 
-	if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
-		printk(KERN_WARNING "Warning: unable to open an initial console.\n");
+	{
+	  int res = sys_open((const char __user *) "/dev/console", O_RDWR, 0);
+	  if(res<0)
+	    printk(KERN_WARNING "Warning: unable to open an initial console.: retval=%d\n",res);
+	}
 
 	(void) sys_dup(0);
 	(void) sys_dup(0);
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/kernel/rcupdate.c linux-2.6.12-rc1-RCU/kernel/rcupdate.c
--- Orig/linux-2.6.12-rc1/kernel/rcupdate.c	Wed Mar  2 08:37:30 2005
+++ linux-2.6.12-rc1-RCU/kernel/rcupdate.c	Fri Mar 25 13:08:38 2005
@@ -66,7 +66,10 @@
 	  {.lock = SPIN_LOCK_UNLOCKED, .cpumask = CPU_MASK_NONE };
 
 DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
+EXPORT_PER_CPU_SYMBOL(rcu_data);
 DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L };
+EXPORT_PER_CPU_SYMBOL(rcu_bh_data);
+
 
 /* Fake initialization required by compiler */
 static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/kernel/sched.c linux-2.6.12-rc1-RCU/kernel/sched.c
--- Orig/linux-2.6.12-rc1/kernel/sched.c	Sun Mar 20 20:31:30 2005
+++ linux-2.6.12-rc1-RCU/kernel/sched.c	Wed Mar 30 00:06:39 2005
@@ -594,6 +594,9 @@
 	if (rt_task(p))
 		return p->prio;
 
+	if (p->rcu_read_lock_nesting)
+		return MAX_RT_PRIO;
+
 	bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
 
 	prio = p->static_prio - bonus;
@@ -986,6 +989,9 @@
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
+	if (unlikely(p->rcu_read_lock_nesting))
+		goto out_activate;
+
 #ifdef CONFIG_SCHEDSTATS
 	schedstat_inc(rq, ttwu_cnt);
 	if (cpu == this_cpu) {
@@ -1644,6 +1650,8 @@
 		return 0;
 	if (!cpu_isset(this_cpu, p->cpus_allowed))
 		return 0;
+	if (p->rcu_read_lock_nesting)
+		return 0;
 
 	/*
 	 * Aggressive migration if:
@@ -2633,6 +2641,7 @@
 need_resched:
 	preempt_disable();
 	prev = current;
+	  
 	release_kernel_lock(prev);
 need_resched_nonpreemptible:
 	rq = this_rq();
@@ -2675,6 +2684,7 @@
 		else {
 			if (prev->state == TASK_UNINTERRUPTIBLE)
 				rq->nr_uninterruptible++;
+
 			deactivate_task(prev, rq);
 		}
 	}
@@ -2710,6 +2720,17 @@
 	}
 
 	array = rq->active;
+
+	if( unlikely(prev->rcu_read_lock_nesting) && 
+	    prev->prio > MAX_RT_PRIO ) {
+		prio_array_t *prev_array = prev->array;
+		if(prev_array)
+			dequeue_task(prev, prev_array);
+		prev->prio = MAX_RT_PRIO;
+		if(prev_array)
+			enqueue_task_head(prev, prev_array);
+	}
+
 	if (unlikely(!array->nr_active)) {
 		/*
 		 * Switch the active and expired arrays.

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux