[PATCH 2.6.16] Shared interrupts sometimes lost

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,
 I've been having enormous fun just lately playing with my new Minitar
 wireless cards - based on the RaLink 2560 chip set.    I have a
 couple of PCI cards that seem to work well, and a PCMCIA that mostly
 works, but will lockup the computer under load, which isn't nice :-(

 This is using the "legacy" rt2500 driver derived from the vendor
 provided driver, rather than the newer rt2x00 which I cannot get to
 work at all.  These are all part of rt2400.sourceforge.net.

 Anyway, it turns out that my lockup problem is actually a problem in
 the Linux kernel rather than with the driver (though possibly aspects
 of the driver make the situation worse).  I have a patch, below,
 which fixes my problem for me.

 The problem is that after a while, the interrupts generated by the
 card stop being handled by Linux, so the ISR routine in the driver
 doesn't get called, and progress of all sorts stop.  The fact that
 this leads to a full system lockup is the fault of the driver, not
 Linux.  But Linux is definitely implicated in the loss of interrupts.
 I can work around the problem using "irqpoll", but that is a bit
 heavy handed, and shouldn't be needed.

 This is my first real introduction to the IRQ handling code in Linux,
 so please forgive any little errors.  I'm fairly sure the big picture
 is right, partly because the patch helps so much.

 To explain what I think is happening, let me start with a very simple
 case.  A number of PCI devices (this one included) have a number of
 events which can trigger an interrupt.  The events which are current
 are presented as bits in a register, and are ORed together (and
 possibly masked by another register) to make the IRQ line.
 When 1's are written to any bits in this register, it acknowledges
 the event and clears the bit.
 A typical code fragment is 
   events = read_register(INTERRUPTS);
   write_register(INTERRUPTS, events);
   ... handle each 1 bits in events ....

 This would normally clear all pending events and cause the interrupt
 line to go low (or at least to not be asserted).

 However there is room for a race here.  If an event occurs between
 the read and the write, then this will NOT de-assert the IRQ line.
 It will remain asserted throughout.

 Now if the IRQ is handled as an edge-triggered line (which I believe
 they are in Linux), then losing this race will mean that we don't see
 any more interrupts on this line.

 This is a race that would be fairly hard to hit, and is easily fixed
 by the driver.  After handling all of 'events', it should read the
 INTERRUPTS register again and see if there is anything more.  I tried
 this and it didn't fix my problem (though it might have made it
 happen less often, I'm not certain).

 In my particular case (and probably quite commonly with PCMCIA
 cards), the IRQ line is shared:

 10:      66178          XT-PIC  yenta, yenta, ohci_hcd:usb2, ohci_hcd:usb3, ehci_hcd:usb4, eth0

 So now the race is somewhat larger.
 If yenta-1 handles its interrupt events and then another event
 happens for yenta-1 before eth0 gets around to clearing its
 interrupts, the shared IRQ line will remain asserted after all IRQ
 handlers have been called, and so another edge will never be seen.

 This race - with a shared IRQ line - cannot be handled within any one
 driver.  There needs to be some sort of end-to-end check.

 In particular, the list of 'action' handlers needs to be called
 repeatedly until a full pass has been made through all, and none of
 them have reported that there was anything to do.  Only then can we
 be sure that the IRQ line has been de-asserted.

 The following patch does that, and "works for me".  It included a
 kernel-parameter which allows me to disable the fix so I can test
 that the fix really makes a difference.  It does.

 I am presenting this patch for comment at the moment rather than
 inclusion.  If it is generally acceptable, I'll remove the __setup
 stuff and put a proper change-log at the top etc.

 It might be appropriate to put a loop limit in so it doesn't more
 than 1000 times or something, just in case someone claims to always
 handle the interrupt.  Is that needed?

 Looking forward to your comments,
 NeilBrown


------------------------------
Signed-off-by: Neil Brown <[email protected]>

### Diffstat output
 ./kernel/irq/handle.c |   28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff ./kernel/irq/handle.c~current~ ./kernel/irq/handle.c
--- ./kernel/irq/handle.c~current~	2006-04-08 13:26:46.000000000 +1000
+++ ./kernel/irq/handle.c	2006-04-08 13:27:05.000000000 +1000
@@ -73,23 +73,47 @@ irqreturn_t no_action(int cpl, void *dev
 	return IRQ_NONE;
 }
 
+static int safeirq = 1;
+int __init unsafeirq_setup(char *str)
+{
+	safeirq = 0;
+	printk(KERN_INFO "IRQ safety-line removed - good luck\n");
+	return 1;
+}
+__setup("unsafeirq", unsafeirq_setup);
 /*
  * Have got an event to handle:
  */
 fastcall int handle_IRQ_event(unsigned int irq, struct pt_regs *regs,
-				struct irqaction *action)
+				struct irqaction *actionlist)
 {
 	int ret, retval = 0, status = 0;
+	struct irqaction *action = actionlist;
+	int repeat=0;
 
 	if (!(action->flags & SA_INTERRUPT))
 		local_irq_enable();
 
 	do {
 		ret = action->handler(irq, action->dev_id, regs);
-		if (ret == IRQ_HANDLED)
+		if (ret == IRQ_HANDLED) {
 			status |= action->flags;
+			repeat = 1;
+		}
 		retval |= ret;
 		action = action->next;
+		if (!action &&
+		    repeat &&
+		    safeirq &&
+		    (actionlist->flags & SA_SHIRQ)) {
+			/* at least one handler on the list did something,
+			 * and the interrupt is sharable, so give
+			 * every handler another chance, incase a new event
+			 * came in and is holding the irq line asserted.
+			 */
+			action = actionlist;
+			repeat = 0;
+		}
 	} while (action);
 
 	if (status & SA_SAMPLE_RANDOM)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux