Re: Problems with EDAC coexisting with BIOS

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On Wed, 2006-05-03, Alan Cox wrote:
> > something with NMI-signalled errors, I was wondering what the 
problems 
> > with using NMI-signalled ECC errors were?
> 
> The big problem with NMI is that it can occur *during* a PCI
> configuration sequence (ie during pci_config_* functions). That 
> means we can't safely do some I/O, especially configuration space 
I/O in an NMI
> handler. At best we could set a flag and catch it afterwards.

This is roughly what I did in the NMI handling code I wrote for
bluesmoke.  If my memory is correct, I believe there's a spinlock
that pci_read_config_dword() and friends acquire.  Basically
I did the following type of thing:

    /* Using spin_trylock() below avoids deadlock in the case where
     * the code interrupted by the NMI is holding the lock.
     */
    if (likely(spin_trylock(&lock))) {
            We got the lock.  Go ahead and access PCI config space
            (and then drop the lock)...
    } else {
            This case should be rare in practice.  Defer the access to
            PCI config space outside NMI context (I wrote a little API
            that facilitates doing this kind of stuff in a manner that
            avoids deadlocks and race conditions associated with NMI
            handlers).
    }

For anyone who is interested, the code may be obtained by going to
http://sourceforge.net/projects/bluesmoke/ and downloading the latest
version of the 'bluesmoke' package.  It's experimental stuff that
hasn't seen much testing.  Also it's not quite functional as-is
because a little piece of code still needs to be added somewhere that
enables SERR#.

I'm not necessarily advocating that NMI-driven error handling should
go into the mainstream kernel.  The expected benefits would have to
be weighed against the extra complexity that the code introduces.
However the parts of the code that handle the basic NMI-related
synchronization issues are abstracted into a relatively clean 
architecture-independent API that may be useful in other places where
NMIs (or similar types of exceptions/interrupts on other platforms)
are used.  I posted this code to LKML a while ago but it has since
been improved.  Perhaps having this type of code (or something
similar) in just one location would be an improvement over reinventing
the wheel in a number of places.

There is an issue regarding NMI-driven error handling that may be a
substantial pain to deal with on x86: When NMI occurs, we can't be
sure whether the NMI is from the watchdog, or due to a hardware error.
Therefore we must check the hardware for errors on each NMI.  This is
no better than polling (at whatever frequency the watchdog runs at).

The above problem can be worked around as follows (although I'm
not advocating that this be done in Linux): Modify local_irq_disable()
and local_irq_enable() so that instead of using cli/sti machine
instructions they adjust the interrupt priority level (controlled by
the local APIC on x86 processors) as follows:

    local_irq_disable()
    {
            Set interrupt priority level to (MAX_PRIORITY - 1).
            In other words, all interrupts are masked out except
            those whose priority is MAX_PRIORITY.
    }

    local_irq_enable()
    {
            Set interrupt priority level to 0 (i.e. all interrupts
            are enabled).
    }

Then the watchdog may be implemented as a normal interrupt with
priority MAX_PRIORITY, and all other interrupts may be given lower
priorities.  NMI would then only be asserted for genuine hardware
errors (PCI parity errors, ECC memory errors, etc.).  This is
probably more trouble than it's worth.  However I think it's doable
in principle.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux