Re: [PATCH] cpuset semaphore depth check optimize

Thank-you, Roman, for looking into cpuset locks.

I am starting to see the picture you are painting, with a global
cpuset_sem to guard changes in the cpuset hierarchy (add/remove)
and a per-cpuset spinlock to briefly pin down an active cpuset
and guard select values (such as mems_allowed) for access from
allocator callbacks and other tasks.

Roman wrote:
> If I read the source correctly, a cpuset cannot be removed or moved while 
> it's attached to a task, which makes it a lot simpler.

Yes - a cpuset cannot be removed while attached (count > 0).  And there
is no 'move' that I know of.  A rename(2) system call on a cpuset in
the cpuset filesystem gets the vfs default -EINVAL response.

So, yes, if I can pin a cpuset with a per-cpuset spinlock on it, then
its parent chain (and whatever else I take care to guard with that
spinlock) is held as well (the cpuset->parent chain is pinned).  I
guess this some of what you meant by your phrase "makes it a lot
simpler".


> This means you have to take the second lock ...

By "second lock", did you mean what you described in your earlier
message as:

> low-level lock (maybe even a spinlock) which manages the state of an 
> active cpuset.


You also wrote:
> You can BTW avoid locking in cpuset_exit() completely in the common case:
> 
> 	tsk->cpuset = NULL;
> 	if (atomic_dec_and_test(&cs->count) && notify_on_release(cs)) {

I don't think that works.  And I suspect you are proposing the same bug
that I had, and fixed with the following patch:

> changeset:   1793:feaa2e5b1e0026994365daf8d1ad3bd745ba8a14
> user:        Paul Jackson <pj@sgi.com>
> date:        Fri May 27 08:07:26 2005 +0019
> files:       kernel/cpuset.c
> description:
> [PATCH] cpuset exit NULL dereference fix
> 
> There is a race in the kernel cpuset code, between the code
> to handle notify_on_release, and the code to remove a cpuset.
> The notify_on_release code can end up trying to access a
> cpuset that has been removed.  In the most common case, this
> causes a NULL pointer dereference from the routine cpuset_path.
> However all manner of bad things are possible, in theory at least.
> 
> The existing code decrements the cpuset use count, and if the
> count goes to zero, processes the notify_on_release request,
> if appropriate.  However, once the count goes to zero, unless we
> are holding the global cpuset_sem semaphore, there is nothing to
> stop another task from immediately removing the cpuset entirely,
> and recycling its memory.


You also wrote:
> There may be a subtle problem with cpuset_fork()

Hmmm ... interesting.  I will think about this some more.

> The only (simple) solution I see is to do this:
> 
> 	lock();
> 	tsk->cpuset = current->cpuset;
> 	atomic_inc(&tsk->cpuset->count);
> 	unlock();

What "lock()" and "unlock()" is this?  Your "second lock",
aka "low-level lock (maybe even a spinlock)" ?

Separate question:

    When cpuset_excl_nodes_overlap() or cpuset_zone_allowed()
    callbacks are walking backup the cpuset hierarchy in the
    nearest_exclusive_ancestor() code, how do they handle the
    per-cpuset spinlocks?

    My assumption is that it would be like this.  Say my current task
    is attached to cpuset "/dev/cpuset/A/B/C", and it happens I will
    have to walk all the way to the top to find the exclusive ancestor
    that I am searching for.  I'd expect the spinlocks to get taken
    and released in the following order on these cpusets:

	    lock /A/B/C
	    lock /A/B
	    lock /A
	    lock /
	      ==> found what I was searching for
	    unlock /
	    unlock /A
	    unlock /A/B
	    unlock /A/B/C

    If at any point, perhaps in the creation of a cpuset, I had reason
    to want to get a child cpusets spinlock while already holding the
    parents, I would have to release the parent, and relock starting
    with the child first, working upward.  The proper order to obtain
    these locks would always be bottom up.

    Does this resemble what you are suggesting, Roman?

Your review of this is most appreciated.  Once again - thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [PATCH] cpuset semaphore depth check optimize
  - From: Roman Zippel <zippel@linux-m68k.org>
- Re: [PATCH] cpuset semaphore depth check optimize
  - From: Paul Jackson <pj@sgi.com>

References:
- [PATCH] cpuset semaphore depth check optimize
  - From: Paul Jackson <pj@sgi.com>
- Re: [PATCH] cpuset semaphore depth check optimize
  - From: Andrew Morton <akpm@osdl.org>
- Re: [PATCH] cpuset semaphore depth check optimize
  - From: Paul Jackson <pj@sgi.com>
- Re: [PATCH] cpuset semaphore depth check optimize
  - From: Roman Zippel <zippel@linux-m68k.org>
- Re: [PATCH] cpuset semaphore depth check optimize
  - From: Roman Zippel <zippel@linux-m68k.org>

Prev by Date: -git11 breaks parisc and sh even more
Next by Date: Re: [2.6.14-rc1] sym scsi boot hang
Previous by thread: Re: [PATCH] cpuset semaphore depth check optimize
Next by thread: Re: [PATCH] cpuset semaphore depth check optimize
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind]