Re: [RFC] Linux memory error handling

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 15 Jun 2005, Russ Anderson wrote:

> >  This is highly undesirable if the same interrupt is used for MBEs.  A 
> > page that causes an excessive number of SBEs should rather be removed from 
> > the available pool instead.
> 
> As a practical point I think you are right that if there are enough 
> SBEs to cause a performance hit, migrating the data to a different 
> physical page would be a prudent thing to do.  But that functionality
> hasn't been implemented yet.

 There's another point actually -- if a memory location (here meaning a 
single entity covered by ECC; usually a 32-bit word or a 64-bit 
doubleword) is causing SBEs excessively, then a bit there probably 
somewhat less reliable due to wear, damage, etc.  But the normal memory 
error statistics apply to other bits at that location as usually, so now 
the probability of an MBE is greater.  So you'd better keep your data 
away.

> That may not always be the right setting for all customers.
> One possible way to deal with that would be to have different
> threshold settings for logging and page migration.  That would
> provide flexibility.

 Certainly.

> >                              Logging should probably take recent events 
> > into account anyway and take care of not overloading the system, e.g. by 
> > keeping only statistical data instead of detailed information about each 
> > event under load.
> 
> That's what the SBE thresholding does.  It avoids overloading the
> system by switching from interrupt mode to periodic polling
> mode, where detailed information can get dropped.

 But as I mentioned you have to be careful not to switch the MBE interrupt 
off as a side effect.  Actually the overhead for handling the interrupt 
shouldn't be that high, unless the error triggers in the handler itself.

> >  Note we have some infrastructure for that in the MIPS port -- we kill the 
> > triggering process, but we don't mark the problematic memory page as 
> > unusable (which is an area for improvement). 
> 
> Mips has some nice features when it comes to error recovery.

 Starting with synchronous reporting when possible. :-)

  Maciej
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux