Re: NMI problems with Dell SMP Xeons

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 6/7/06, Keith Owens <[email protected]> wrote:
This problem is not KDB specific, although that is where it was first
noticed.  Any code that sends a broadcast IPI 2 or an NMI IPI will
crash these Dell boxes when there is a mismatch between the cpus known
to the BIOS and the cpus known to the OS.

I missed this (broadcast IPI 2 causing problems)...

Before the OS starts all AP CPUs should be in a special halt state
(specifically "cli; hlt"), where the only interrupt sources that reach
the CPU are NMI, SMI and INIT. This is why broadcast NMI is unsafe for
a variety of reasons (not just on Dells).

If the Dell machines allow an IPI 2 to get through then unused CPUs
aren't in the "cli; hlt" state, and they would also respond to any
other broadcast IPIs. This would imply that any code that uses
send_IPI_allbutself(), send_IPI_all() or anything else that sends
broadcast IPIs would/could cause problems on these Dell machines.
Depending on whether Linux leaves the real mode IVT intact or not and
which interrupt vector/s are used, the result could be nothing at all,
erratic behaviour, triple faults/resets, or executing unintended BIOS
functions (which could also trash memory or in some cases cause device
failure).

The easiest solution here would be for Dell to release an updated BIOS
that fixes this bug (i.e. puts the AP CPUs into a "cli; hlt" state, as
per Intel's notes in section 7.5.4.2 "Typical AP Initialization
Sequence" of the system programming guide). This still won't/shouldn't
fix the "broadcast NMI" problem though.

I will try forcing send_IPI_allbutself() to use the mask version rather
than the broadcast shortcut.  Later tonight ...

I've been looking into this a little - it appears that Linux tries to
use one bit per CPU in the local APIC's "logical destination register"
and that in all cases at least one bit is set for each valid CPU. As
far as I can tell sending an NMI (or any other broadcast IPI) in
logical mode with no shorthand and "destination = 0xFF" should work
fine for both "cluster mode" and "flat mode". In this case I'd also
suggest that "clear_local_APIC()" clear the destination format
register (DFR) and the logical destination register (LDR) so that it
doesn't receive broadcast NMIs if the CPU is taken offline.

The "send_IPI_mask_sequence()" function also seems like a perfectly
valid option (except for the comments, which are easy enough to
change). IMHO it's just not as pretty, especially if it's to be used
for all broadcast IPIs (rather than just for broadcast NMIs) - I'd be
tempted to do an "#ifdef BORKED_DELL" around it in that case. :-)


Cheers,

Brendan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux