with 2.6.15 kernel running on a Tyan S4881 quad processor board (with factory BIOS)
using Opterons 254s, I received the following MCEs:
CPU 18: Machine Check Exception: 4 Bank 0: b601a00000000833
TSC 9c3799943459 ADDR 4eee07800
CPU 18: Machine Check Exception: 4 Bank 2: d000400000000863
TSC 9c3799943d01
CPU 18: Machine Check Exception: 4 Bank 4: d42dc00100000813
TSC 9c379994422d ADDR 4eee05708
It was later determined to be a bad memory stick, but the problem was
'CPU 18'. Running the same hardware with 2.6.17-rc6 produced MCEs with:
'CPU 2' messages instead
as the output. Thought problem fixed, BUT.....
looking at 2.6.17-rc6 safe_smp_processor_id() in arch/x86_64/kernel/smp.c (This
function is called by the MCE handler code):
int safe_smp_processor_id(void)
{
int apicid, i;
if (disable_apic)
return 0;
apicid = hard_smp_processor_id();
-----> if (x86_cpu_to_apicid[apicid] == apicid)
return apicid;
for (i = 0; i < NR_CPUS; ++i) {
if (x86_cpu_to_apicid[i] == apicid)
return i;
}
/* No entries in x86_cpu_to_apicid? Either no MPS|ACPI,
* or called too early. Either way, we must be CPU 0. */
if (x86_cpu_to_apicid[0] == BAD_APICID)
return 0;
return 0; /* Should not happen */
}
I noticed the: if (x86_cpu_to_apicid[apicid] == apicid)
above.
NR_CPUS was 4 and apicid could be: 16, 17 18, or 19
definitely an out-of-bounds reference.
doug thompson
portion of boot.mesg follows:
SRAT: PXM 0 -> APIC 16 -> Node 0
SRAT: PXM 1 -> APIC 17 -> Node 1
SRAT: PXM 2 -> APIC 18 -> Node 2
SRAT: PXM 3 -> APIC 19 -> Node 3
SRAT: Node 0 PXM 0 0-a0000
SRAT: Node 0 PXM 0 0-d0000000
SRAT: Node 0 PXM 0 0-230000000
SRAT: Node 1 PXM 1 230000000-430000000
SRAT: Node 2 PXM 2 430000000-630000000
SRAT: Node 3 PXM 3 630000000-830000000
NUMA: Using 28 for the hash shift.
Bootmem setup node 0 0000000000000000-0000000230000000
Bootmem setup node 1 0000000230000000-0000000430000000
Bootmem setup node 2 0000000430000000-0000000630000000
Bootmem setup node 3 0000000630000000-0000000830000000
On node 0 totalpages: 2063996
DMA zone: 2596 pages, LIFO batch:0
DMA32 zone: 833240 pages, LIFO batch:31
Normal zone: 1228160 pages, LIFO batch:31
On node 1 totalpages: 2068480
Normal zone: 2068480 pages, LIFO batch:31
On node 2 totalpages: 2068480
Normal zone: 2068480 pages, LIFO batch:31
On node 3 totalpages: 2068480
Normal zone: 2068480 pages, LIFO batch:31
Nvidia board detected. Ignoring ACPI timer override.
ACPI: PM-Timer IO Port: 0x8008
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x10] enabled)
Processor #16 15:5 APIC version 16
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x11] enabled)
Processor #17 15:5 APIC version 16
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x12] enabled)
Processor #18 15:5 APIC version 16
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x13] enabled)
Processor #19 15:5 APIC version 16
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]