Re: [feature] automatically detect hung TASK_UNINTERRUPTIBLE tasks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Dec 02, 2007 at 10:10:27PM +0100, Ingo Molnar wrote:
> what if you considered - just for a minute - the possibility of this 
> debug tool being the thing that actually animates developers to fix such 
> long delay bugs that have bothered users for almost a decade meanwhile?

Throwing frequent debugging messages for non buggy cases will
just lead to people generally ignore softlockups.

I don't think runtime instrumentation is the way to introduce
TASK_KILLABLE in general. The only way there is people going through
the source and identify places where it makes sense.

> 
> Until now users had little direct recourse to get such problems fixed. 
> (we had sysrq-t, but that included no real metric of how long a task was 

Actually task delay accounting can measure this now.  iirc someone
had a latencytop based on it already.

> blocked, so there was no direct link in the typical case and users had 
> no real reliable tool to express their frustration about unreasonable 
> delays.)
> 
> Now this changes: they get a "smoking gun" backtrace reported by the 
> kernel, and blamed on exactly the place that caused that unreasonable 
> delay. And it's not like the kernel breaks - at most 10 such messages 
> are reported per bootup.
> 
> We increase the delay timeout to say 300 seconds, and if the system is 
> under extremely high IO load then 120+ might be a reasonable delay, so 
> it's all tunable and runtime disable-able anyway. So if you _know_ that 
> you will see and tolerate such long delays, you can tweak it - but i can 

This means the user has to see their kernel log fill by such
messages at least once - do a round trip to some mailing list to 
explain that it is expected and not a kernel bug - then tweak
some obscure parameters. Doesn't seem like a particular fruitful
procedure to me.

> tell you with 100% certainty that 99.9% of the typical Linux users do 
> not characterize such long delays as "correct behavior".

It's about robustness, not the typical case.
Throwing backtraces when something slightly unusual happens is not a robust system.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux