BUGHUNTING: sata_sil, Silicon Image 3114 controllers, 2.6.18 and numerous errors

Hello,

This is a long-winded explanation of events so I'll try to keep it concise.

I am running the following system (a large file server):

Athlon XP 2600+ CPU
1GB PC2700 RAM
A7N8X (nForce2) motherboard, BIOS 1008 (latest)
2x Silicon Image SiI3114 SATA controllers, output of LSPCI in [1]lspci.gz
D-Link DG530-T Gigabit PCI network card
Twinhan VisionPlus VisionDTV TV card

Drives =

2xWD3000JD 300GB western digital
2xWD3200JD 320GB western digital

3x6L300S0 300GB maxtor

3x6B300S0 300GB maxtor (identical to above but slightly older firmware andnon-RoHS compliant)

The mess of problems began with _all_ these drives whilst using kernel2.6.15-26-k7 in the Ubuntu Dapper distribution. I got hit with a bugconcerning FUA:

http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/b0c495e4cf9d6d2/6ac40ae91be51b23?lnk=st&q=libpata+code+issues&rnum=1#6ac40ae91be51b23

At this stage my testing methodology was the following:

1) Make filesystems (reiserfs) on each of the drives

2) Make a huge file (11GB) by piping the output of /bin/yes "0123456789"to a file

3) Sync the disks
4) Calculate MD5 checksum of the file
5) Copy hugefile across to the next drive
6) Calculate MD5sum on new file

The writing of the hugefile to EACH drive would fill up the kernel logwith errors seen all over the above linked thread, about one every 20seconds and the MD5sums of the files copied were different.

Noting that the problem occured on both the maxtor drives (with much moreseverity leading to device resets) and the western digitals I manuallyinstalled the 2.6.18 kernel from kernel.org which disabled FUA by default.This made the errors "disappear" on the maxtor drives but the westerndigital drives still displayed errors the same as the ones before.

Because I am naiive I went out and purchased another four drives, Seagatesthis time. I replaced the Western Digitals in the machine with thefollowing model: 4xST3250824NS (250GB) and started testing again. Thistime I get errors like:

[ 1876.112335] attempt to access beyond end of device
[ 1876.112429] sdc1: rw=0, want=4517265416, limit=488392002
[ 1876.112516] attempt to access beyond end of device
[ 1876.112588] sdc1: rw=0, want=2110783496, limit=488392002
[ 1876.112723] attempt to access beyond end of device
[ 1876.112793] sdc1: rw=0, want=4656529416, limit=488392002
[ 1876.122250] attempt to access beyond end of device
[ 1876.122339] sdc1: rw=0, want=4517265416, limit=488392002

and even a kernel Oops: [2]reiseroops.gz

These errors occured after I copied hugefile to a destination drive andduring the calculation of the md5sum, i.e. upon reading the new file.

Digging deeper, I am now at the stage where I can report the following:

The errors are happening independent of the controller hardware. Onebrand-new controller and one used and verified working (on Windows) withthe same errors happening in each case.The errors are independent of type of drive - Seagates and Maxtors bothexhibit the same errors.The errors are independent of filesystem - I tried and got the same resultwith ext2, ext3 and reiserfs 3.6.The errors ONLY occur when reading from newly-created files. I amcurrently badblocks -n'ing the drives which will obviously take some timeon drives this large in order to find out if this does happen with simpleblock read/writes.

Also I can say this:

The hugefile copied from the FIRST drive on one controller to the FIRSTdrive on the other controller exhibited NO ERRORS in either direction. Bythis I mean a Seagate attached to port0 and a Maxtor attached to port0.[3]dmesg-detection.gz may help with this - it is the kernel's detection ofthe drives. Whether this is due to dumb luck or a quirk in this "bug", Idon't know but I will keep trying to make this error happen on the firstdrives in the system.

The hugefile copied from the FIRST Seagate drive to the SECOND, THIRD andFOURTH Seagate drives all make md5sum say "input/output error" andassociated "access beyond end of device" errors in dmesg. The same thinghappens when I copy from the FIRST maxtor drive to the second and third(not fourth as it contains NTFS data).

I will keep copying back and forth between drives in an effort to map outwhat is causing the error, but I'm going to need some pointers to trackthis to the source.

Any help appreciated,
Jonathan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Prev by Date: Re: [patch] remove MNT_NOEXEC check for PROT_EXEC mmaps
Next by Date: Re: wpa supplicant/ipw3945, ESSID last char missing
Previous by thread: [PATCH 1/3] UEAGLE : use interruptible sleep
Next by thread: [PATCH] scsi: Scsi_Cmnd convertion in psi240i driver
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]