Re: Can we ignore errors in mcelog if the server is running fine

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 7/27/06, Robert Hancock <[email protected]> wrote:
Vikas Kedia wrote:
> The server seems to be running fine. A. can I ignore the following
> mcelog errors ? B. If not what should i do to stop the server from
> reporting mcelog errors.

Looks like data cache ECC errors, meaning the CPU 0 is faulty.
Eventually if it's not replaced there will likely be some uncorrectable
errors and the system will likely crash.

I am facing similar, but different errors.

[root@turyxsrv ~]# mcelog
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC 89a560bb249
ADDR 1dfa49690
 Northbridge Chipkill ECC error
 Chipkill ECC syndrome = 2021
      bit46 = corrected ecc error
 bus error 'local node response, request didn't time out
     generic read mem transaction
     memory access, level generic'
STATUS 9410c00020080a13 MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC a6550f2d4de
ADDR 1de74b670
 Northbridge Chipkill ECC error
 Chipkill ECC syndrome = 2021
      bit32 = err cpu0
      bit46 = corrected ecc error
 bus error 'local node origin, request didn't time out
     generic read mem transaction
     memory access, level generic'
STATUS 9410c00120080813 MCGSTATUS 0
MCE 2
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC afe4eba238a
ADDR 1d8049698
 Northbridge Chipkill ECC error
 Chipkill ECC syndrome = 2021
      bit46 = corrected ecc error
 bus error 'local node response, request didn't time out
     generic read mem transaction
     memory access, level generic'
STATUS 9410c00020080a13 MCGSTATUS 0
MCE 3
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC cc945738d0a
ADDR 194c4b670
 Northbridge Chipkill ECC error
 Chipkill ECC syndrome = 2021
      bit40 = error found by scrub
      bit46 = corrected ecc error
 bus error 'local node response, request didn't time out
     generic read mem transaction
     memory access, level generic'
STATUS 9410c10020080a13 MCGSTATUS 0

Repeats whenever I do any kind of operations...
How severe is ChipKill errors? Should I consider throwing away CPU 1
and get another one.

Regards,
Om.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux