Re: noisy edac

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 2006-01-31 at 01:02 +0100, Gunther Mayer wrote:
> Dave Peterson wrote:
> 
> >On Monday 30 January 2006 15:44, Gunther Mayer wrote:
> >  
> >
> >>>For each individual type of error that is specific to a particular
> >>>low-level chipset driver (e752x, amd76x, etc.) there could be an entry
> >>>in the appropriate part of the sysfs hierarchy under the given chipset
> >>>driver.  This entry could have several settings that the user may choose
> >>>      
> >>>
> >>>from such as { ignore, syslog, panic }.  For the implementation, there
> >>    
> >>
> >>>could be a generic piece of code in the core EDAC module that a chipset
> >>>driver calls into.  The generic code would do the dirty work of creating
> >>>the sysfs entries (and destroying them when the chipset module is
> >>>unloading).  How does this sound?
> >>>      
> >>>
> >>Over-Engineered.
> >>    
> >>
> >
> >Do you have an alternate suggestion?
> >  
> >
> Just printk() the exact driver specific low-level error, even if non-fatal.
> 
> Single non-fatal errors just show your system recovers correctly.
> 
> Multiple (e.g. noisy) non-fatal are either an indication of a serious 
> problem
>   (e.g. after how many corrected ECC errors on the same address in which
>     time interval will you replace your dimm? How many S-ATA CRC-errors
>      will indicate marginal bad cabling? )
> or it shows the problem needs to be root analyzed. But don't disable the
> messages as this will only hide the real problem.
> 
> Concerning Non-Fatal PCI Express errors, the error cause registers need
> to be printed in case of error, too (see Intel Chipset Specifications)

EDAC currently presents a common interface for harvesting the number of
CEs (and other data) that occur AND maps to a DIMM label, which can be
populated to a mobo silk screen label.  This is via the sysfs interface.

Currently each of the MC drivers some of their output error messages in
their own pattern of output, while also funnelling common info to the
core module.  We don't want to lose that device specific information,
but it would be a better pattern to funnel that device specific output
to the EDAC Core module for presentation in a more uniforum manner,
through the new sysfs interface, such as:

/sys/drivers/system/edac/mc/mc_driver_error_report

or some such.

The EDAC stack is like a SCSI stack:  A core module and a lower level
instance driver.

We want to move some of the information that is currently being dumped
via printk() to the sysfs interface. Some not all.

The message that is being printed now that caused this post, just prints
some MC specific information, but not quite enough to be of real use.
It is extra noise, which needs to be examined in depth, to form a clean
error stack.

New MC drivers are in the works. New MC chipsets are coming down the
line and I don't want to have each driver output its OWN output that
proves more difficult to "harvest" over time, because each driver writer
does it his way.

EDAC is not perfect enough yet, but with time and more refactoring it
will approach there.

thanks

doug t

> -
> Gunther
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
> _______________________________________________
> bluesmoke-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/bluesmoke-devel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux