Re: [PATCH] EDAC: core EDAC support code

On Fri, 2006-03-10 at 13:13 -0800, Dave Peterson wrote:
> On Friday 10 March 2006 11:33, Arjan van de Ven wrote:
> > > Regarding the actual call to run it, I guess it depends on which of
> > > the following you prefer:
> > >
> > >     Scenario A
> > >     ----------
> > >     A more decentralized layout.  Here, the controls that govern the
> > >     error handling behavior for a given category of hardware (a
> > >     category might be "PCI devices" or "devices that use bus
> > >     technology XYZ") are grouped together with other stuff for that
> > >     category.
> >
> > this would basically make edac a place to report "help something went to
> > the gutter". Sure. I see that useful.
> >
> > In fact there are 3 layers then
> >
> > 1) low level "do check" function
> 
> I'm a bit unclear here.  Are you thinking of a single "do check"
> function that bus-specific code and chipset-specific code can hook
> into, or individual "do check" functions for various bus-specific and
> chipset-specific modules?

hmm ok so I want a function that takes a device as parameter, and checks
the state of that device for errors. Internally that probably has to go
via a function pointer somewhere to a device specific check method.

Or maybe a per test-type (pci parity / ECC / etc) check

int pci_check_parity_errors(struct pci_dev *dev, int flags);

something like that, or pci_check_and_clear_parity_errors()
(although that gets too long :)

drivers can call that, say, after firmware init or something to validate
their device is sanely connected. Maybe pci_enable_device() could call
it too.

This also needs a pci_suspend_parity_check() ... _resume_ ... so that
the driver can temporarily disable any checks, for example during device
reset/init. And then just before resume, it manually clears a check.




> 
> > 2) per bus code that calls the do check functions and whatever is needed
> > for bus checks
> >
> > 3) "EDAC" central command, which basically gathers all failure reports
> > and does something with them (push them to userspace or implement the
> > userspace chosen policy (panic/reboot/etc))
> 
> Are you suggesting something like the following?
> 
>     - The controls that determine how the error checking is done are
>       located within the various hardware subsystems.  For instance,
>       with PCI parity checking, this would include stuff like setting
>       the polling frequency and determining which devices to check.

yes. I would NOT make it overly tunable btw. For many things a sane
default can be chosen, and the effect of picking a different tuning
isn't all that big. Maybe think of it this way: a tuneable to a large
degree is an excuse for not determining the right value / heuristic in
the first place.



>     - When an error is actually detected, the subsystem that detected
>       the error (for instance, PCI) would feed the error information
>       to EDAC.  Then EDAC would determine how to respond to the error
>       (for instance, push it to userspace or implement the
>       userspace-chosen policy (panic/reboot/etc))

yup.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [PATCH] EDAC: core EDAC support code
  - From: Dave Peterson <dsp@llnl.gov>

References:
- Re: [PATCH] EDAC: core EDAC support code
  - From: Dave Peterson <dsp@llnl.gov>
- Re: [PATCH] EDAC: core EDAC support code
  - From: Arjan van de Ven <arjan@infradead.org>
- Re: [PATCH] EDAC: core EDAC support code
  - From: Dave Peterson <dsp@llnl.gov>

Prev by Date: [PATCH] remove __put_task_struct_cb export again
Next by Date: Re: skge/sk98lin slowdown with 4 Gb memory compared to 2 Gb memory
Previous by thread: Re: [PATCH] EDAC: core EDAC support code
Next by thread: Re: [PATCH] EDAC: core EDAC support code
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]