Re: nmi_watchdog=2 regression in 2.6.21

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 2007-08-28 at 10:05 -0700, Stephane Eranian wrote:
> Daniel,
> 
> On Tue, Aug 28, 2007 at 07:34:44AM -0700, Daniel Walker wrote:
> > On Tue, 2007-08-28 at 02:12 -0700, Stephane Eranian wrote:
> > > Daniel,
> > > 
> > > On Mon, Aug 27, 2007 at 04:07:54PM -0700, Daniel Walker wrote:
> > > > On Mon, 2007-08-27 at 15:55 -0700, Stephane Eranian wrote:
> > > > 
> > > > > Yet the model name looks strange. So we need to run one more test,
> > > > > as the fam/model is not enough. What we need to check is whether or
> > > > > not this processor implements architectural perfmon or not.
> > > > > 
> > > > > Could you please compile and run the attached program and send me 
> > > > > the output?
> > > > 
> > > > The output below is all the output ..
> > > > 
> > > > eax=0x7280201: version=1  num_cnt=2
> > > > 
> > > Then you have a Core Duo processor and the commit from Bjorn should
> > > fix the problem. If it does not, then there is something else wrong.
> > > Unfortunately, I do not have a Core Duo machine to try and reproduce.
> > 
> > There must be something else wrong, cause the problem persists .. As I
> > said in past emails to Bjorn, I tested his commit in git, as well as the
> > latest git all with the same issue (as well as bisecting git)..
> > 
> > If the hardware is buggy then we need some way to determine that..
> > 
> Could you instrument check_nmi_watchdog() to verify that you terminate
> this function? Normally there is a safety mechanism in there.
> 
> Another  possibility is that you get flooded with NMI interrupts and
> do not make forward progress.

Here is some output from my boot log,

md: raid1 personality registered for level 1
device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: [email protected]
ip_tables: (C) 2000-2006 Netfilter Core Team
TCP cubic registered
NET: Registered protocol family 1
NET: Registered protocol family 17
Testing NMI watchdog ... CPU#0: NMI appears to be stuck (0->0)!
CPU#1: NMI appears to be stuck (0->0)!
CPU#2: NMI appears to be stuck (0->0)!
CPU#3: NMI appears to be stuck (0->0)!
Starting balanced_irq
<hangs>

So it does appear to make it out of the check_nmi_watchdog() function
even tho there is a problem with the watchdog .. "Starting balance_irq"
is in balanced_irq_init(). It usually hangs at this spot , but with
initcall_debug on it once hung a little further down ..

It seems like there might be some other interrupt going off too early,
that causes the system to hang..

As you can see from the log above, this system is a quad.. Two physical
cpus each with two cores.

Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux