Re: [PATCH] x86_64: mcelog tolerant level cleanup

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wednesday 16 May 2007 22:29, Tim Hockin wrote:
> From: Tim Hockin <[email protected]>
>
> Background:
>  The MCE handler has several paths that it can take, depending on various
>  conditions of the MCE status and the value of the 'tolerant' knob.  The
>  exact semantics are not well defined and the code is a bit twisty.
>
> Description:
>  This patch makes the MCE handler's behavior more clear by documenting the
>  behavior for various 'tolerant' levels.  It also fixes or enhances
>  several small things in the handler.  Specifically:
>      * If RIPV is set it is not safe to restart, so set the 'no way out'
>        flag rather than the 'kill it' flag.

Why? It is not PCC. We cannot return of course, but killing isn't returning.

>      * Don't panic() on correctable MCEs.

The idea behind this was that if you get an exception it is always a bit risky
because there are a few potential deadlocks that cannot be avoided.
And normally non UC is just polled which will never cause a panic.
So I don't quite see the value of this change.

> This patch also calls nonseekable_open() in mce_open (as suggested by akpm).

That should be a separate patch

> +		0: always panic on uncorrected errors, log corrected errors
> +		1: panic or SIGBUS on uncorrected errors, log corrected errors
> +		2: SIGBUS or log uncorrected errors, log corrected errors

Just saying SIGBUS is misleading because it isn't a catchable
signal.



> +
> +	/*
> +	 * If the error seems to be unrecoverable, something should be
> +	 * done.  Try to kill as little as possible.  If we can kill just
> +	 * one task, do that.  If the user has set the tolerance very
> +	 * high, don't try to do anything at all.
> +	 */
> +	if (kill_it && tolerant < 3) {
>  		int user_space = 0;
>
> -		if (m.mcgstatus & MCG_STATUS_RIPV)
> +		/*
> +		 * If the EIPV bit is set, it means the saved IP is the
> +		 * instruction which caused the MCE.
> +		 */
> +		if (m.mcgstatus & MCG_STATUS_EIPV)
>  			user_space = panicm.rip && (panicm.cs & 3);
> -
> -		/* When the machine was in user space and the CPU didn't get
> -		   confused it's normally not necessary to panic, unless you
> -		   are paranoid (tolerant == 0)
> -
> -		   RED-PEN could be more tolerant for MCEs in idle,
> -		   but most likely they occur at boot anyways, where
> -		   it is best to just halt the machine. */
> -		if ((!user_space && (panic_on_oops || tolerant < 2)) ||
> -		    (unsigned)current->pid <= 1)
> -			mce_panic("Uncorrected machine check", &panicm, mcestart);
> -
> -		/* do_exit takes an awful lot of locks and has as
> -		   slight risk of deadlocking. If you don't want that
> -		   don't set tolerant >= 2 */
> -		if (tolerant < 3)
> +
> +		/*
> +		 * If we know that the error was in user space, send a
> +		 * SIGBUS.  Otherwise, panic if tolerance is low.
> +		 *
> +		 * do_exit() takes an awful lot of locks and has a slight
> +		 * risk of deadlocking.
> +		 */
> +		if (user_space) {
>  			do_exit(SIGBUS);
> +		} else if (panic_on_oops || tolerant < 2) {
> +			mce_panic("Uncorrected machine check",
> +				&panicm, mcestart);
> +		}

Why did you remove the idle special case?


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux