Re: EDAC chipkill messages

--- Orion Poplawski <orion@cora.nwra.com> wrote:

> Robert Hancock wrote:
> > Orion Poplawski wrote:
> >> Can someone please explain to me what these mean?
> >>
> >> EDAC k8 MC1: general bus error: participating processor(local node
> 
> >> origin), time-out(no timeout) memory transaction type(generic
> read), 
> >> mem or i/o(mem access), cache level(generic)
> >> EDAC MC1: CE page 0xfbf6f, offset 0x4d0, grain 8, syndrome 0xc8f4,
> row 
> >> 1, channel 0, label "": k8_edac
> >> EDAC MC1: CE - no information available: k8_edac Error Overflow
> set
> >> EDAC k8 MC1: extended error code: ECC chipkill x4 error
> >>
> >> Thanks!
> >>
> > 
> > Sounds like you're having some memory ECC errors.. some Memtest86,
> etc. 
> > runs may be in order. You may be able to figure out from this info
> what 
> > DIMM is having the problem.
> > 
> 
> That was my assumption as well, but was hoping someone could decode
> the 
> above information and point me to the problem chip.  I ran Memtest86 
> overnight but found no problems, but don't know if it needs to run in
> a 
> particular ECC mode.
> 
> This is a dual proc 275 system with 4 1GB DIMMs.  Guessing that MC1
> is 
> the controller on the second CPU.  Would row 1 be the second DIMM?

No that would be the FIRST DIMM, on Channel 0

Each DIMM has 2 ChipSelect Rows (CSROW)

Each csrow covers two channels across, therefore on a 4 socket memory
array, there are CSROWS 0 and 1 on the first DIMM row and CSROWS 2 and
3 on the second DIMM row.

WWWWWWWWW  XXXXXXXXXXX
YYYYYYYYY  ZZZZZZZZZZZ

The W and the Y DIMMs are channel 0
The X and the Z DIMMs are channel 1

csrows 0 and 1 would cross over Y and Z DIMMs
csrows 2 and 3 would cross over W and X DIMMs

The mapping problem occurs in then identifying each of the above goes
to which silk screen labeled sockets on the mobo.

Usually they are labeled:

H0_DIMM2A   H0_DIMM2B
H0_DIMM1A   H0_DIMM1B

where A is channel 0 and B is channel 1 and
the "DIMM1" would indicate the CSROWs 0 and 1
and "DIMM2" would indicate the CSROWs 2 and 3

The string 'label ""' can be filled in by a userspace script to
properly identify the DIMM silk screen according to the motherboard
used.

The lines with "EDAC MC1:" are EDAC CORE output messages, while the
"EDAC K8:" lines are EDAC Memory Controller driver messages.
"CE" is correctable error 
MC1 is memory controller 1 (0 based)

ECC ChipKill x4 was what found the error and corrected it.

The FRU, (field replaceable unit) is the DIMM located at socket
H1_DIMM1A, according to the labeling I mentioned above.

caveat: the detector is not 100% perfect but gives a general area to
look at, the DIMM specification. Sometimes other errors can cause what
looks like a memory error, but usually a bad memory DIMM is the root
cause of the vast majority of such errors.

In addition, memtest86+ doesn't find all the bad memory in all cases,
but it is still a VERY useful tool

doug thompson

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

References:
- Re: EDAC chipkill messages
  - From: Orion Poplawski <orion@cora.nwra.com>

Prev by Date: Re: Threading...
Next by Date: Re: [PATCH 8/15] ide: disable DMA in ->ide_dma_check for "no IORDY" case
Previous by thread: Re: EDAC chipkill messages
Next by thread: [PATCH 2.6.20-rc5] Gigaset ISDN driver error handling fixes (resend)
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]