Problem while stopping many threads within a module

Hello -

I'm having a strange thread scheduling issue in a project that I'm
working on.  We have a module, with an interface that can be called by
many (currently 50) threads simulatenously.  Threads that have entered
the kernel, sleep on a wait queue until everyone else has entered.  At
this point, a "master" process wakes up the first thread, which does
some work, then wakes up the second, etc.  After waking up its
successor, each thread changes its state to STOPPED and sends itself a
SIGSTOP.  Note that the threads are created with
CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND but NOT CLONE_THREAD so
there is no group stop.

Basically, the structure is the following:
kernel_entry_point() {
        wait until its your turn
        ...... do some work .... (serialized)
        wake up the next thread
        send SIGSTOP to yourself
}

At the same time, a monitoring process polls until all the threads
have stopped themselves:
monitor() {
repeat:
        for each thread
               if (thread->state < TASK_STOPPED)
                       yield()
                       goto repeat
}

Now, here's the problem.  On 2.6.9 UP (Preempt), it is often the case
that one thread gets "stuck" in between the wake up of the next thread
and stopping itself -- this causes the monitor to poll for extended
periods of time, since the thread remains RUNNING.  Strangely enough,
it generally gets unstuck by itself, sometimes within 10 seconds,
sometimes after as long as 10 minutes.  When peeking at the kernel
stack of the offending process via the monitor, I only see that it is
in schedule and the stack looks like this:

       c55e7ad0 00000086 c55e6000 c55e7a94 00000046 c55e6000 c55e7ad0 c0109c2d 
       00000000 c03ddae0 00000001 fd0b6c12 0013bc9f c6502130 001770fe fd478e5c 
       0013bc9f c55d546c c05d3960 00002710 c05d3960 c55e6000 c0106f25 c05d3960 
Call Trace:
[<c0106f25>] need_resched+0x27/0x32

It also continues to be charged ticks, indicating that its being
scheduled but is making no progress?  However, I can't find anything
that this thread could be spinning on.  Also, I don't understand why
there is no further context on the stack -- the thread does eventually
finish and never leaves the kernel, so the stack shouldn't be
corrupted...  How can it finish if it has nowhere to return?

I realize that this is a long shot, but if anyone has any ideas, I'd
appreciate hearing them.  Please let me know if I can provide any
further information.

Thanks,
-Yuly
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

References:
- Scheduler: SIGSTOP on multi threaded processes
  - From: Olivier Croquette <[email protected]>
- Re: Scheduler: SIGSTOP on multi threaded processes
  - From: Daniel Jacobowitz <[email protected]>
- Re: Scheduler: SIGSTOP on multi threaded processes
  - From: "Richard B. Johnson" <[email protected]>
- Re: Scheduler: SIGSTOP on multi threaded processes
  - From: "Richard B. Johnson" <[email protected]>
- Re: Scheduler: SIGSTOP on multi threaded processes
  - From: "Miquel van Smoorenburg" <[email protected]>

Prev by Date: [PATCH 2/12] UML - Remove include/asm-um/elf.h
Next by Date: Re: [PATCH] ppc32: Fix POWER3/POWER4 compiler error
Previous by thread: Re: Scheduler: SIGSTOP on multi threaded processes
Next by thread: Re: Scheduler: SIGSTOP on multi threaded processes
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]