On x86_64 kernel, level triggered irq migration gets initiated in the context
of that interrupt(after executing the irq handler) and following steps are
followed to do the irq migration.
1. mask IOAPIC RTE entry; // write to IOAPIC RTE
2. EOI; // processor EOI write
3. reprogram IOAPIC RTE entry // write to IOAPIC RTE with new destination and
// and interrupt vector due to per cpu vector
// allocation.
4. unmask IOAPIC RTE entry; // write to IOAPIC RTE
Because of the per cpu vector allocation in x86_64 kernels, when the irq
migrates to a different cpu, new vector(corresponding to the new cpu) will
get allocated.
An EOI write to local APIC has a side effect of generating an EOI write
for level trigger interrupts (normally this is a broadcast to all IOAPICs).
The EOI broadcast generated as a side effect of EOI write to processor may
be delayed while the other IOAPIC writes (step 3 and 4) can go through.
Normally, the EOI generated by local APIC for level trigger interrupt
contains vector number. The IOAPIC will take this vector number and
search the IOAPIC RTE entries for an entry with matching vector number and
clear the remote IRR bit (indicate EOI). However, if the vector number is
changed (as in step 3) the IOAPIC will not find the RTE entry when the EOI
is received later. This will cause the remote IRR to get stuck causing the
interrupt hang (no more interrupt from this RTE).
Current x86_64 kernel assumes that remote IRR bit is cleared by the time
IOAPIC RTE is reprogrammed. Fix this assumption by checking for remote IRR
bit and if it still set, delay the irq migration to the next interrupt
arrival event(hopefully, next time remote IRR bit will get cleared
before the IOAPIC RTE is reprogrammed).
Initial analysis and patch from Nanhai.
Clean up patch from Suresh.
Rewritten to be less intrusive, and to contain a big fat comment by Eric.
Cc: Nanhai Zou <[email protected]>
Cc: Suresh Siddha <[email protected]>
Cc: Asit Mallick <[email protected]>
Cc: Keith Packard <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/io_apic.c | 57 ++++++++++++++++++++++++++++++++++++++++-
1 files changed, 55 insertions(+), 2 deletions(-)
diff --git a/arch/x86_64/kernel/io_apic.c b/arch/x86_64/kernel/io_apic.c
index d8bfe31..2d52aed 100644
--- a/arch/x86_64/kernel/io_apic.c
+++ b/arch/x86_64/kernel/io_apic.c
@@ -152,6 +152,31 @@ static inline void io_apic_modify(unsigned int apic, unsigned int value)
writel(value, &io_apic->data);
}
+static int io_apic_level_ack_pending(unsigned int irq)
+{
+ struct irq_pin_list *entry;
+ unsigned long flags;
+ int pending = 0;
+ spin_lock_irqsave(&ioapic_lock, flags);
+ entry = irq_2_pin + irq;
+ for (;;) {
+ unsigned int reg;
+ int pin;
+ pin = entry->pin;
+ if (pin == -1)
+ break;
+ reg = io_apic_read(entry->apic, 0x10 + pin*2);
+ /* Is the remote IRR bit set? */
+ pending |= (reg >> 14) & 1;
+ if (!entry->next)
+ break;
+ entry = irq_2_pin + entry->next;
+ }
+ spin_unlock_irqrestore(&ioapic_lock, flags);
+ return pending;
+}
+
+
/*
* Synchronize the IO-APIC and the CPU by doing
* a dummy read from the IO-APIC
@@ -1418,9 +1443,37 @@ static void ack_apic_level(unsigned int irq)
ack_APIC_irq();
/* Now we can move and renable the irq */
- move_masked_irq(irq);
- if (unlikely(do_unmask_irq))
+ if (unlikely(do_unmask_irq)) {
+ /* Only migrate the irq if the ack has been received.
+ *
+ * On rare occaions the broadcast level triggered ack gets
+ * delayed going to ioapics, and if we reprogram the
+ * vector while Remote IRR is still the irq will never
+ * fire again.
+ *
+ * To prevent this scenario we read the Remote IRR bit
+ * of the ioapic. This has two affects.
+ * - On any sane system the read of the ioapic will
+ * flush writes (and acks) going to the ioapic from
+ * this cpu.
+ * - We get to see if the ACK has actually been delivered.
+ *
+ * Based on failed experiments of reprogramming the
+ * ioapic entry from outside of irq context starting
+ * with masking the ioapic entry and then polling until
+ * Remote IRR was clear before reprogramming the
+ * ioapic I don't trust the Remote IRR bit to be
+ * completey accurate.
+ *
+ * However there appears to be no other way to plug
+ * this race, so if the Remote IRR bit is not
+ * accurate and causing problems it is a hardware bug
+ * and you can go talk chipset vendor about it.
+ */
+ if (!io_apic_level_ack_pending(irq))
+ move_masked_irq(irq);
unmask_IO_APIC_irq(irq);
+ }
}
static struct irq_chip ioapic_chip __read_mostly = {
--
1.5.1.1.181.g2de0
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]