Question about Kernel MCE and what exactly they could mean

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I have customer that is getting MCE errors, the errors are
non consistantly on a single cpu (this is a opteron system
with separate memory busses),  and the customer believes
that these correspond with the same time he sshes into the
machine from external, they also appear to only be a serious 
issue just after boot up, if the machine survives the first
few sshes, we don't seem to have an issue later, so things
seem to be a early boot up issue of some sort.

The kernel is a Suse 2.4 kernel variant (2.4.21-143), 
I don't expect anyone to likely know if it is a bug.

I have debugged and fixed a large number of machines getting
kernel panic mce errors (by replacing cpu, and if it still 
occurs replacing memory), but I have never seen one being 
started by something like inbound ssh, and have never seen
it move from cpu to cpu with what looks like the same "cause".

The pro's of it being hardware are:
	It gets mce errors.

The pro's of it being software are:
	They aren't consistant on cpu's
	The machine survives heavy HPL runs and does not appear
		to have broken hardware.
	A specific user action seems to cause the issue.

This issue does seem to occur several times a week.

Anyone have any ideas or experience of whether this is a
kernel bug or hardware problem?

                        Roger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux