Re: Device hang when offlining a CPU due to IRQ misrouting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



"Darrick J. Wong" <[email protected]> writes:

> Hi there,
>
> I'm seeing a driver hang with 2.6.22-rc3 while being slightly stupid
> about offlining CPUs.  I suspect that this problem extends beyond a
> particular machine, as I've been able to replicate it with an IBM x3650
> and an IBM x3755.  This is what I'm doing:
>
> 1) I tie an IRQ to a particular CPU via /proc/irq/XXX/smp_affinity (IRQ
> 4341 is the network card and we're picking on CPU1 in this example):
> echo 2 > /proc/irq/4341/smp_affinity
>
> 2) I then take CPU1 offline:
> echo 0 > /sys/devices/system/cpu/cpu1/online
>
> 3) The kernel prints this:
> [ 1101.968040] Breaking affinity for irq 4341
> [ 1102.074019] CPU 1 is now offline
> [ 1102.081593] lockdep: not fixing up alternatives.
> [ 1112.886919] nfs: server 9.47.66.169 not responding, still trying
>
> After step 2 the system never sees interrupts from the network card and
> remains hung like that until CPU1 is brought back up.  It looks as
> though the kernel is trying to reroute the IRQ (or so I'm assuming from
> the "Breaking affinity" message), but this doesn't ever happen, so the
> the kernel stops seeing interrupts from the device.
>
> Granted, one should not be offlining the CPU that is currently
> designated to handle an IRQ, but I suspect that the kernel ought at a
> minimum to reject the offlining or route the IRQ to any online CPU
> instead of screwing things up.

I agree.

> There exists a similar scenario.  Set the IRQ affinity to a bunch of
> CPUs, watch /proc/interrupts to see which CPU is actually servicing the
> interrupts, then offline that CPU.  The kernel does not reroute the IRQ
> to any of the other CPUs and the device also hangs.
>
> The furthest that I've dug is that it works on 2.6.17 and is broken in
> 2.6.22-rc3 and 2.6.21.  Will git-bisect further, but I wanted to know if
> anyone else has seen this sort of problem.  afaik, this seems to happen
> with both IOAPIC and MSI interrupts, possibly more.

Thanks for the bug report.  I'm chuckling because I just submitted a
patch to count that whole code path as broken, based on code review.
It is trying to do something that the hardware can not reliably
accomplish.

Now I am surprised you were seeing this with MSI as well because
the hardware should theoretically work in that case.  However the
irq_fixup code has enough issues that I wouldn't be surprised if
it was just doing something stupid and wrong.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux