Re: [PATCH] EDAC: core EDAC support code

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 2006-03-10 at 13:13 -0800, Dave Peterson wrote:
> On Friday 10 March 2006 11:33, Arjan van de Ven wrote:
> > > Regarding the actual call to run it, I guess it depends on which of
> > > the following you prefer:
> > >
> > >     Scenario A
> > >     ----------
> > >     A more decentralized layout.  Here, the controls that govern the
> > >     error handling behavior for a given category of hardware (a
> > >     category might be "PCI devices" or "devices that use bus
> > >     technology XYZ") are grouped together with other stuff for that
> > >     category.
> >
> > this would basically make edac a place to report "help something went to
> > the gutter". Sure. I see that useful.
> >
> > In fact there are 3 layers then
> >
> > 1) low level "do check" function
> 
> I'm a bit unclear here.  Are you thinking of a single "do check"
> function that bus-specific code and chipset-specific code can hook
> into, or individual "do check" functions for various bus-specific and
> chipset-specific modules?

hmm ok so I want a function that takes a device as parameter, and checks
the state of that device for errors. Internally that probably has to go
via a function pointer somewhere to a device specific check method.

Or maybe a per test-type (pci parity / ECC / etc) check

int pci_check_parity_errors(struct pci_dev *dev, int flags);

something like that, or pci_check_and_clear_parity_errors()
(although that gets too long :)

drivers can call that, say, after firmware init or something to validate
their device is sanely connected. Maybe pci_enable_device() could call
it too.

This also needs a pci_suspend_parity_check() ... _resume_ ... so that
the driver can temporarily disable any checks, for example during device
reset/init. And then just before resume, it manually clears a check.




> 
> > 2) per bus code that calls the do check functions and whatever is needed
> > for bus checks
> >
> > 3) "EDAC" central command, which basically gathers all failure reports
> > and does something with them (push them to userspace or implement the
> > userspace chosen policy (panic/reboot/etc))
> 
> Are you suggesting something like the following?
> 
>     - The controls that determine how the error checking is done are
>       located within the various hardware subsystems.  For instance,
>       with PCI parity checking, this would include stuff like setting
>       the polling frequency and determining which devices to check.

yes. I would NOT make it overly tunable btw. For many things a sane
default can be chosen, and the effect of picking a different tuning
isn't all that big. Maybe think of it this way: a tuneable to a large
degree is an excuse for not determining the right value / heuristic in
the first place.



>     - When an error is actually detected, the subsystem that detected
>       the error (for instance, PCI) would feed the error information
>       to EDAC.  Then EDAC would determine how to respond to the error
>       (for instance, push it to userspace or implement the
>       userspace-chosen policy (panic/reboot/etc))

yup.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux