Re: data corruption with nvidia chipsets and IDE/SATA drives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi!

* On Sat, Dec 02, 2006 at 05:17 PM (-0800), Kurtis D. Rader wrote:

> On Sat, 2006-12-02 01:56:06, Christoph Anton Mitterer wrote:
> > The issue was basically the following: I found a severe bug mainly by
> > fortune because it occurs very rarely.  My test looks like the following:
> > I have about 30GB of testing data on my harddisk,... I repeat verifying
> > sha512 sums on these files and check if errors occur.  One test pass
> > verifies the 30GB 50 times,... about one to four differences are found in
> > each pass.
> 
> I'm also experiencing silent data corruption on writes to SATA disks
> connected to a Nvidia controller (nForce 4 chipset). The problem is
> 100% reproducible. Details of my configuration (mainboard model, lspci,
> etc.) are near the bottom of this message. What follows is a summation
> of my findings.
> 
> I have confirmed the corruption is occurring on the writes and not the
> reads. Furthermore, if I compare the original and copy while both are
> still cached in memory no corruption is found. But as soon as I flush the
> pagecache (by reading another file larger than memory) to force the copy
> of the file to be read from disk the corruption is seen. The corruption
> occurs with direct I/O and normal buffered filesystem I/O (ext3).
> 
> Booting with "mem=1g" (system has 4 GiB installed) makes no difference.
> So it isn't due to remapping memory above the 4 GiB boundary.  Booting to
> single user and ensuring no unnecessary modules (video, etc.) are loaded
> also makes no difference.
> 
> The problem affects both disks attached to the nVidia SATA controller but
> not the two disks attached to the PATA side of the same controller. All
> four disks are different models. The same SATA disks attached to
> the Silicon Image 3114 SATA RAID controller (on the same mainboard)
> experiences the same corruption but at a lower probability.  The same
> disks attached to a Promise TX2 SATA controller (in the same system)
> experience no corruption.
> 
> The system has run memtest86 for 24 hours with no errors.

Although your problem report seems rather clearly to be related to 
the disk sub-system (e.g. as it only seems to appear at writings), 
I would just like to point out that running "memtest86" for some 
time without getting any errors does not necessarily state that the 
memory is faultless. 

I recently had one case where a machine (Athlon XP 2200+) crashed
irregularly. "memtest86", running several days, didn't find anything,
but running the "stress test" of "Prime95" [1] for a few minutes 
clearly showed that the machine just miscalculated (Prime95's stress
test stops in this case).

Just removing and reinserting the two memory modules (2 x 256 MB 
DDR-RAM) fixed it. The machine is now stable and "Prime95" hasn't 
stopped due to computational errors anymore since then. I suppose 
that one of the modules wasn't seated in its socket correctly, but 
I don't know why "memtest86" (and "memtest86+") didn't find it.

Bye,
Steffen

[1] http://www.mersenne.org/freesoft.htm
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux