Trond Myklebust wrote:
on den 01.06.2005 Klokka 16:59 (-0400) skreiv john cooper:
Yes later versions of the patch do. The version at hand
40-04 is based on 2.6.11. We intend to sync-up with a
more recent version of the RT patch pending resolution
of this issue.
Well it is pointless to concentrate on an obviously buggy patch. Could
you please sync up to rc5-rt-V0.7.47-15 at least: that looks like it
might be working (or at least be close to working).
I fully share your frustration of wanting to "use the
latest patch -- dammit". However there are other practical
constraints coming into play. This tree has accumulated a
substantial amount of fixes for scheduler violation assertions
along with associated testing and has faired well thus far.
The bug under discussion here is the last major operational
problem found in the associated testing process. Arriving
at this point also required development of target specific
driver/board code so a resync to a later version is not a
trivial operation. However it would be justifiable in the
case of encountering at an impasse with the current tree.
Could you then apply the following debugging patch? It should warn you
in case something happens to corrupt base->running_timer (something
which would screw up del_timer_sync()). I'm not sure that can happen,
but it might be worth checking.
Yes, thanks. Though the event trace does not suggest a
reentrance in __run_timer() but rather a preemption of it
during the call to rpc_run_timer() by a high priority
application task in the midst of an RPC. The preempting
task requeues the timer in the cascade at the tail of
xprt_transmit(). rpc_run_timer() upon resuming execution
unconditionally clears the RPC_TASK_HAS_TIMER flag. This
creates the inconsistent state.
No explicit deletion attempt of the timer (synchronous or
otherwise) is coming into play in the failure scenario as
witnessed by the event trace. Rather it is the implicit
dequeue of the timer from the cascade in __run_timer() and
attempt to track ownership of it in rpc_run_timer() via
RPC_TASK_HAS_TIMER which is undermined in the case of
preemption.
From earlier mail:
> There should be no instances of RPC entering call_transmit() or any
> other tk_action callback with a pending timer.
My description wasn't clear. The timeout isn't pending
before call_transmit(). Rather the RPC appears to be
blocked elsewhere and upon wakeup via __run_timer()/xprt_timer()
preempts ksoftirqd and does the __rpc_sleep_on()/__mod_timer()
at the very tail of xprt_transmit().
I have work-arounds which detect the preemption of
rpc_run_timer() and correct the timer state. But I
don't suggest they are general solutions or a
permanent fix.
In any case I will ascertain whether or not the problem here
still exists as soon as possible upon resync with a current
tree.
-john
--
[email protected]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]