Re: [PATCH 1/2] scalable rw_mutex

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2007-05-14 at 17:36 -0700, Paul E. McKenney wrote:
> On Mon, May 14, 2007 at 12:59:10PM +0200, Peter Zijlstra wrote:
> > Changes include:
> > 
> >  - wmb+rmb != mb
> >  - ->state folded into ->waiter
> > 
> > ---
> > Subject: scalable rw_mutex
> > 
> > Scalable reader/writer lock.
> > 
> > Its scalable in that the read count is a percpu counter and the reader fast
> > path does not write to a shared cache-line.
> > 
> > Its not FIFO fair, but starvation proof by alternating readers and writers.
> 
> Hmmm...  brlock reincarnates, but as sleeplock.  ;-)

/me googles... ooh, yeah, quite similar.

> I believe that there are a few severe problems in this code, search
> for "!!!" to quickly find the specific areas that concern me.

Thanks for the feedback, replies at the !!! sites.

> 						Thanx, Paul
> 
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > ---
> >  include/linux/rwmutex.h |   82 ++++++++++++++++
> >  kernel/Makefile         |    3 
> >  kernel/rwmutex.c        |  232 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 316 insertions(+), 1 deletion(-)
> 
> List of races that must be resolved:
> 
> 1.	Read acquire vs. write acquire.
> 
> 	rw_mutex_read_lock() invokes __rw_mutex_read_trylock(), if
> 		this fails, invokes rw_mutex_read_lock_slow().
> 
> 		__rw_mutex_read_trylock() increments the per-CPU counter,
> 			does smp_mb(), picks up ->waiter:
> 				if non-NULL decrements the per-CPU
> 				counter, does a barrier(), does
> 				wake_up_process() on the task fetched
> 				from ->waiter.  Return failure.
> 
> 				Otherwise, return success.
> 
> 		rw_mutex_read_lock_slow() increments ->read_waiters,
> 			acquires ->read_mutex, increments the ->readers
> 			counter, and decrements the ->read_waiters
> 			counter.  It then fetches ->waiter, and, if
> 			non-NULL, wakes up the tasks.
> 			Either way, releases ->read_mutex.
> 
> 	rw_mutex_write_lock_nested(): acquires ->write_mutex, which
> 		prevents any writer-writer races.  Acquires ->read_mutex,
> 		which does -not- prevent readers from continuing to
> 		acquire.  Sets ->waiter to current, which -does-
> 		(eventually) stop readers.  smp_mb(), then invokes
> 		rw_mutex_writer_wait() for the sum of the per-CPU
> 		counters to go to zero.
> 
> 		!In principle, this could be indefinitely postponed,
> 		but in practice would require an infinite number of
> 		reading tasks, so probably OK to ignore.  ;-)
> 		This can occur because the readers unconditionally
> 		increment their per-CPU counters, and decrement it
> 		only later.
> 
> 	The smp_mb()s currently in the reader and the writer code
> 	forms a Dekker-algorithm-like barrier, preventing both the
> 	reader and writer from entering their critical section, as
> 	required.
> 
> 2.	Read acquire vs. write release (need to avoid reader sleeping
> 	forever, even in the case where no one ever uses the lock again).
> 
> 	rw_mutex_read_lock() invokes __rw_mutex_read_trylock(), if
> 		this fails, invokes rw_mutex_read_lock_slow().
> 
> 		__rw_mutex_read_trylock() increments the per-CPU counter,
> 			does smp_mb(), picks up ->waiter:
> 				if non-NULL decrements the per-CPU
> 				counter, does a barrier(), does
> 				wake_up_process() on the task fetched
> 				from ->waiter.  Return failure.
> 
> 				Otherwise, return success.
> 
> 		rw_mutex_read_lock_slow() increments ->read_waiters,
> 			acquires ->read_mutex, increments the ->readers
> 			counter, and decrements the ->read_waiters
> 			counter.  It then fetches ->waiter, and, if
> 			non-NULL, wakes up the tasks.
> 			Either way, releases ->read_mutex.
> 
> 	rw_mutex_write_unlock(): pick up ->read_waiters, release
> 		->read_mutex, if copy of ->read_waiters was non-NULL
> 		do slow path (but refetches ->read_waiters??? why???
> 		[Ah -- refetched each time through the loop in the
> 		rw_mutex_writer_wait() macro), NULL out ->waiter,
> 		then release ->write_mutex.
> 
> 		Slow path: Pick up ->waiter, make sure it is us,
> 			set state to uninterruptible, loop while
> 			->read_waiters less than the value fetched
> 			earlier from ->read_waiters, scheduling each time
> 			through, set state back to running.
> 
> 		(!!!This is subject to indefinite postponement by a
> 		flurry of readers, see the commentary for 
> 		rw_mutex_write_unlock() interspersed below.)
> 
> 	However, the fact that ->read_mutex is unconditionally released
> 	by the writer prevents the readers from being starved.
> 
> 3.	Read release vs. write acquire (similar to #2).
> 
> 	rw_mutex_read_unlock(): decrement the per-CPU counter, smb_mb(),
> 		pick up ->waiter, if non-NULL, wake it up.
> 
> 	rw_mutex_write_lock_nested(): acquires ->write_mutex, which
> 		prevents any writer-writer races.  Acquires ->read_mutex,
> 		which does -not- prevent readers from continuing to
> 		acquire.  Sets ->waiter to current, which -does-
> 		(eventually) stop readers.  smp_mb(), then invokes
> 		rw_mutex_writer_wait() for the sum of the per-CPU
> 		counters to go to zero.
> 
> 	As written, the writer never really blocks, so omitting the
> 	wakeup is OK.  (Failing to really block is -not- OK given realtime
> 	processes, but more on that later.)
> 
> 4.	Read release vs. write release -- presumably this one cannot
> 	happen, but wouldn't want to fall prey to complacency.  ;-)
> 
> 	Strangely enough, this -can- happen!!!	When the readers
> 	indefinitely postpone writer release, the readers will also
> 	be read-releasing...  So the readers will be needlessly waking
> 	up the task that is trying to finish releasing the write lock.
> 	:-/  Not a big deal in this case, but just shows to go that
> 	when reviewing locking algorithms, you should -never- restrict
> 	yourself to considering only things that seem possible.  ;-)
> 
> > Index: linux-2.6/include/linux/rwmutex.h
> > ===================================================================
> > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> > +++ linux-2.6/include/linux/rwmutex.h	2007-05-14 10:34:32.000000000 +0200
> > @@ -0,0 +1,82 @@
> > +/*
> > + * Scalable reader/writer lock.
> > + *
> > + *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <[email protected]>
> > + *
> > + * This file contains the public data structure and API definitions.
> > + */
> > +#ifndef _LINUX_RWMUTEX_H
> > +#define _LINUX_RWMUTEX_H
> > +
> > +#include <linux/preempt.h>
> > +#include <linux/wait.h>
> > +#include <linux/percpu_counter.h>
> > +#include <linux/lockdep.h>
> > +#include <linux/mutex.h>
> > +#include <asm/atomic.h>
> > +
> > +struct rw_mutex {
> > +	/* Read mostly global */
> > +	struct percpu_counter	readers;
> > +
> > +	/* The following variables are only for the slowpath */
> > +	struct task_struct	*waiter;	/* w -> r waiting */
> > +	struct mutex		read_mutex;	/* r -> w waiting */
> > +	struct mutex		write_mutex;	/* w -> w waiting */
> 
> Priority-inheritance relationship?  Seems like this would be tough
> to arrange while still avoiding deadlock...
> 
> Readers currently do boost the writer via ->read_mutex.
> 
> !!!
> 
> Writers currently do -not- boost readers.  In fact, the identities of
> the readers are not tracked, so there is no way for the writer to tell
> what to boost.  Admittedly, write-to-read priority boosting is quite
> the can of worms if you allow more than one reader, but something will
> be needed for realtime kernels.
> 
> A brlock-like implementation can allow boosting in both directions,
> but has other issues (such as write-side performance even in the case
> where there are no readers).

You could short circuit the no readers case for a full brlock like
affair. But yeah, the write side will be horribly heavy when you have to
take nr_cpu_ids() locks.

> > +	atomic_t		read_waiters;
> > +
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > +	struct lockdep_map dep_map;
> > +#endif
> > +};
> > +
> > +void __rw_mutex_init(struct rw_mutex *rw_mutex, const char * name,
> > +		struct lock_class_key *key);
> > +void rw_mutex_destroy(struct rw_mutex *rw_mutex);
> > +
> > +#define rw_mutex_init(rw_mutex)					\
> > +	do {							\
> > +		static struct lock_class_key __key;		\
> > +		__rw_mutex_init((rw_mutex), #rw_mutex, &__key);	\
> > +	} while (0)
> > +
> > +void rw_mutex_read_lock_slow(struct rw_mutex *rw_mutex);
> > +
> > +void rw_mutex_write_lock_nested(struct rw_mutex *rw_mutex, int subclass);
> > +void rw_mutex_write_unlock(struct rw_mutex *rw_mutex);
> > +
> > +int __rw_mutex_read_trylock(struct rw_mutex *rw_mutex);
> > +
> > +static inline int rw_mutex_read_trylock(struct rw_mutex *rw_mutex)
> > +{
> > +	int ret = __rw_mutex_read_trylock(rw_mutex);
> > +	if (ret)
> > +		rwsem_acquire_read(&rw_mutex->dep_map, 0, 1, _RET_IP_);
> > +	return ret;
> > +}
> > +
> > +static inline void rw_mutex_read_lock(struct rw_mutex *rw_mutex)
> > +{
> > +	int ret;
> > +
> > +	might_sleep();
> > +	rwsem_acquire_read(&rw_mutex->dep_map, 0, 0, _RET_IP_);
> > +
> > +	ret = __rw_mutex_read_trylock(rw_mutex);
> > +	if (!ret)
> > +		rw_mutex_read_lock_slow(rw_mutex);
> > +}
> > +
> > +void rw_mutex_read_unlock(struct rw_mutex *rw_mutex);
> > +
> > +static inline int rw_mutex_is_locked(struct rw_mutex *rw_mutex)
> > +{
> > +	return mutex_is_locked(&rw_mutex->write_mutex);
> > +}
> > +
> > +static inline void rw_mutex_write_lock(struct rw_mutex *rw_mutex)
> > +{
> > +	rw_mutex_write_lock_nested(rw_mutex, 0);
> > +}
> > +
> > +#endif /* _LINUX_RWMUTEX_H */
> > Index: linux-2.6/kernel/rwmutex.c
> > ===================================================================
> > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> > +++ linux-2.6/kernel/rwmutex.c	2007-05-14 11:32:01.000000000 +0200
> > @@ -0,0 +1,229 @@
> > +/*
> > + * Scalable reader/writer lock.
> > + *
> > + *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <[email protected]>
> > + *
> > + * Its scalable in that the read count is a percpu counter and the reader fast
> > + * path does not write to a shared cache-line.
> > + *
> > + * Its not FIFO fair, but starvation proof by alternating readers and writers.
> > + */
> > +#include <linux/sched.h>
> > +#include <linux/rwmutex.h>
> > +#include <linux/debug_locks.h>
> > +#include <linux/module.h>
> > +
> > +/*
> > + * rw mutex - oxymoron when we take mutex to stand for 'MUTual EXlusion'
> > + *
> > + * However in this context we take mutex to mean a sleeping lock, with the
> > + * property that it must be released by the same context that acquired it.
> > + *
> > + * design goals:
> > + *
> > + * A sleeping reader writer lock with a scalable read side, to avoid bouncing
> > + * cache-lines.
> > + *
> > + * dynamics:
> > + *
> > + * The reader fast path is modification of a percpu_counter and a read of a
> > + * shared cache-line.
> > + *
> > + * The write side is quite heavy; it takes two mutexes, a writer mutex and a
> > + * readers mutex. The writer mutex is for w <-> w interaction, the read mutex
> > + * for r -> w. The read side is forced into the slow path by setting the
> > + * status bit. Then it waits for all current readers to disappear.
> > + *
> > + * The read lock slow path; taken when the status bit is set; takes the read
> > + * mutex. Because the write side also takes this mutex, the new readers are
> > + * blocked. The read unlock slow path tickles the writer every time a read
> > + * lock is released.
> > + *
> > + * Write unlock clears the status bit, and drops the read mutex; allowing new
> > + * readers. It then waits for at least one waiting reader to get a lock (if
> > + * there were any readers waiting) before releasing the write mutex which will
> > + * allow possible other writers to come in an stop new readers, thus avoiding
> > + * starvation by alternating between readers and writers
> > + *
> > + * considerations:
> > + *
> > + * The lock's space footprint is quite large (on x86_64):
> > + *
> > + *   88 bytes				[struct rw_mutex]
> > + *    8 bytes per cpu NR_CPUS		[void *]
> > + *   32 bytes per cpu (nr_cpu_ids)	[smallest slab]
> > + *
> > + *  408 bytes for x86_64 defconfig (NR_CPUS = 32) on a 2-way box.
> > + *
> > + * The write side is quite heavy; this lock is best suited for situations
> > + * where the read side vastly dominates the write side.
> > + */
> > +
> > +void __rw_mutex_init(struct rw_mutex *rw_mutex, const char *name,
> > +		struct lock_class_key *key)
> > +{
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > +	debug_check_no_locks_freed((void *)rw_mutex, sizeof(*rw_mutex));
> > +	lockdep_init_map(&rw_mutex->dep_map, name, key, 0);
> > +#endif
> > +
> > +	percpu_counter_init(&rw_mutex->readers, 0);
> > +	rw_mutex->waiter = NULL;
> > +	mutex_init(&rw_mutex->read_mutex);
> > +	mutex_init(&rw_mutex->write_mutex);
> > +}
> > +EXPORT_SYMBOL(__rw_mutex_init);
> > +
> > +void rw_mutex_destroy(struct rw_mutex *rw_mutex)
> > +{
> > +	percpu_counter_destroy(&rw_mutex->readers);
> > +	mutex_destroy(&rw_mutex->read_mutex);
> > +	mutex_destroy(&rw_mutex->write_mutex);
> > +}
> > +EXPORT_SYMBOL(rw_mutex_destroy);
> > +
> > +#define rw_mutex_writer_wait(rw_mutex, condition)		\
> > +do {								\
> > +	struct task_struct *tsk = (rw_mutex)->waiter;		\
> > +	BUG_ON(tsk != current);					\
> > +								\
> > +	set_task_state(tsk, TASK_UNINTERRUPTIBLE);		\
> > +	while (!(condition)) {					\
> > +		schedule();					\
> 
> !!!
> 
> If there are at least as many of these scalable reader-writer locks
> as there are CPUs, and each such lock has a realtime-priority writer
> executing, you can have an infinite loop starving all the non-realtime
> readers, who are then unable to ever read-release the lock, preventing
> the writers from making progress.  Or am I missing something subtle here?

It will actually block the task. schedule() will make it sleep when
state is uninterruptible.

> > +		set_task_state(tsk, TASK_UNINTERRUPTIBLE);	\
> > +	}							\
> > +	tsk->state = TASK_RUNNING;				\
> > +} while (0)
> 
> Can't something like __wait_event() be used here?

Blame Oleg for this :-)
He suggested I not use waitqueues since I only ever have a single
waiter.

> > +void rw_mutex_read_lock_slow(struct rw_mutex *rw_mutex)
> > +{
> > +	struct task_struct *tsk;
> > +
> > +	/*
> > +	 * read lock slow path;
> > +	 * count the number of readers waiting on the read_mutex
> > +	 */
> > +	atomic_inc(&rw_mutex->read_waiters);
> > +	mutex_lock(&rw_mutex->read_mutex);
> > +
> > +	percpu_counter_inc(&rw_mutex->readers);
> > +
> > +	/*
> > +	 * wake up a possible write unlock; waiting for at least a single
> > +	 * reader to pass before letting a new writer through.
> > +	 */
> > +	atomic_dec(&rw_mutex->read_waiters);
> > +	tsk = rw_mutex->waiter;
> > +	if (tsk)
> > +		wake_up_process(tsk);
> > +	mutex_unlock(&rw_mutex->read_mutex);
> > +}
> > +EXPORT_SYMBOL(rw_mutex_read_lock_slow);
> > +
> > +int __rw_mutex_read_trylock(struct rw_mutex *rw_mutex)
> > +{
> > +	struct task_struct *tsk;
> > +
> > +	percpu_counter_inc(&rw_mutex->readers);
> > +	/*
> > +	 * ensure the ->readers store and the ->waiter load is properly
> > +	 * sequenced
> > +	 */
> > +	smp_mb();
> > +	tsk = rw_mutex->waiter;
> > +	if (unlikely(tsk)) {
> > +		percpu_counter_dec(&rw_mutex->readers);
> > +		/*
> > +		 * ensure the ->readers store has taken place before we issue
> > +		 * the wake_up
> > +		 *
> > +		 * XXX: or does this require an smp_wmb() and the waiter to do
> > +		 *   (smp_rmb(), percpu_counter(&rw_mutex->readers) == 0)
> 
> I don't think that -anything- is needed here as written.  The "sleeping"
> writers don't really block, they instead simply loop on schedule().  Of
> course that in itself is a -really- bad idea if case of realtime-priority
> writers...
> 
> So the writers need to really block, which will make the wakeup code
> much more ticklish.  In that case, it seems to me that smp_mb() will
> be needed here, but much will depend on the exact structure of the
> block/wakeup scheme chosen.

It does really sleep. And Nick thinks the suggested wmb + rmb will
suffice if I read his latest mail correctly.

> > +		barrier();
> > +		/*
> > +		 * possibly wake up a writer waiting for this reference to
> > +		 * disappear
> > +		 */
> 
> !!!
> 
> Suppose we get delayed here (preempted, interrupted, whatever) and the
> task referenced by tsk has exited and cleaned up before we get a chance
> to wake it up?  We might well be "waking up" some entirely different
> data structure in that case, not?

Yes, this is a real problem, I'll borrow a spinlock from one of the
mutexes.

> > +		wake_up_process(tsk);
> > +		return 0;
> > +	}
> > +	return 1;
> > +}
> > +EXPORT_SYMBOL(__rw_mutex_read_trylock);
> > +
> > +void rw_mutex_read_unlock(struct rw_mutex *rw_mutex)
> > +{
> > +	struct task_struct *tsk;
> > +
> > +	rwsem_release(&rw_mutex->dep_map, 1, _RET_IP_);
> > +
> > +	percpu_counter_dec(&rw_mutex->readers);
> > +	/*
> > +	 * ensure the ->readers store and the ->waiter load is properly
> > +	 * sequenced
> > +	 */
> > +	smp_mb();
> > +	tsk = rw_mutex->waiter;
> > +	if (unlikely(tsk)) {
> > +		/*
> > +		 * on the slow path; nudge the writer waiting for the last
> > +		 * reader to go away
> > +		 */
> 
> !!!
> 
> What if we are delayed (interrupted, preempted, whatever) here, so that
> the task referenced by tsk has exited and been cleaned up before we can
> execute the following?  We could end up "waking up" the freelist, not?

idem.

> > +		wake_up_process(tsk);
> > +	}
> > +}
> > +EXPORT_SYMBOL(rw_mutex_read_unlock);
> > +
> > +void rw_mutex_write_lock_nested(struct rw_mutex *rw_mutex, int subclass)
> > +{
> > +	might_sleep();
> > +	rwsem_acquire(&rw_mutex->dep_map, subclass, 0, _RET_IP_);
> > +
> > +	mutex_lock_nested(&rw_mutex->write_mutex, subclass);
> > +	BUG_ON(rw_mutex->waiter);
> > +
> > +	/*
> > +	 * block new readers
> > +	 */
> > +	mutex_lock_nested(&rw_mutex->read_mutex, subclass);
> > +	rw_mutex->waiter = current;
> > +	/*
> > +	 * full barrier to sequence the store of ->waiter
> > +	 * and the load of ->readers
> > +	 */
> > +	smp_mb();
> > +	/*
> > +	 * and wait for all current readers to go away
> > +	 */
> > +	rw_mutex_writer_wait(rw_mutex,
> > +			(percpu_counter_sum(&rw_mutex->readers) == 0));
> > +}
> > +EXPORT_SYMBOL(rw_mutex_write_lock_nested);
> > +
> > +void rw_mutex_write_unlock(struct rw_mutex *rw_mutex)
> > +{
> > +	int waiters;
> > +
> > +	might_sleep();
> > +	rwsem_release(&rw_mutex->dep_map, 1, _RET_IP_);
> > +
> > +	/*
> > +	 * let the readers rip
> > +	 */
> > +	waiters = atomic_read(&rw_mutex->read_waiters);
> > +	mutex_unlock(&rw_mutex->read_mutex);
> > +	/*
> > +	 * wait for at least 1 reader to get through
> > +	 */
> > +	if (waiters) {
> > +		rw_mutex_writer_wait(rw_mutex,
> > +			(atomic_read(&rw_mutex->read_waiters) < waiters));
> > +	}
> 
> !!!
> 
> Readers can indefinitely postpone the write-unlock at this point.
> Suppose that there is one reader waiting when we fetch ->read_waiters
> above.  Suppose that as soon as the mutex is released, a large number
> of readers arrive.  Because we have not yet NULLed ->waiter, all of
> these readers will take the slow path, incrementing the ->read_waiters
> counter, acquiring the ->read_mutex, and so on.  The ->read_mutex lock
> acquisition is likely to be the bottleneck for large systems, which
> would result in the count remaining 2 or higher indefinitely.  The
> readers would make progress (albeit quite inefficiently), but the
> writer would never get out of the loop in the rw_mutex_writer_wait()
> macro.

Ah, yeah. Ouch!

I missed this detail when I got rid of ->state. I used to clear the
reader slow path before unlocking the ->read_mutex.

> Consider instead using a generation number that just increments every
> time a reader successfully acquires the lock and is never decremented.
> That way, you can unambiguously determine when at least one reader has
> made progress.  An additional counter required, but might be able to
> multiplex with something else.  Or get rid of the current ->read_waiters
> in favor of the generation number?  This latter seems like it should be
> possible, at least at first glance...  Just change the name of the field,
> get rid of the decrement, and change the comparison to "!=", right?

Yes, this seems like a very good suggestion, I'll give it a shot.
Thanks!

> > +	rw_mutex->waiter = NULL;
> > +	/*
> > +	 * before we let the writers rip
> > +	 */
> > +	mutex_unlock(&rw_mutex->write_mutex);
> > +}
> > +EXPORT_SYMBOL(rw_mutex_write_unlock);
> > Index: linux-2.6/kernel/Makefile
> > ===================================================================
> > --- linux-2.6.orig/kernel/Makefile	2007-05-12 15:33:00.000000000 +0200
> > +++ linux-2.6/kernel/Makefile	2007-05-12 15:33:02.000000000 +0200
> > @@ -8,7 +8,8 @@ obj-y     = sched.o fork.o exec_domain.o
> >  	    signal.o sys.o kmod.o workqueue.o pid.o \
> >  	    rcupdate.o extable.o params.o posix-timers.o \
> >  	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
> > -	    hrtimer.o rwsem.o latency.o nsproxy.o srcu.o die_notifier.o
> > +	    hrtimer.o rwsem.o latency.o nsproxy.o srcu.o die_notifier.o \
> > +	    rwmutex.o
> > 
> >  obj-$(CONFIG_STACKTRACE) += stacktrace.o
> >  obj-y += time/
> > 
> > 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux