Fedora Users — SCSI disk errors, but disk diagnostics say disk is OK

Hi all,

Sorry for posting here, but I have a RH9 machine running the 2.4.20-8 kernel that's having problems, and the shrike-list is just about dead, so I thought I might have a better chance of getting responses here...

I have a brand new (3 months old) Seagate ST3146807LW 146GB U320 SCSI drive that is having write problems:

SCSI disk error : host 0 channel 0 id 12 lun 0 return code = 8000002
Info fld=0x7df6bbf, Deferred sd08:11: sense key Hardware Error
Additional sense indicates Internal target failure

Eventually, the OS has enough problems, and it does this:

journal_bmap_Rsmp_e68c71a3: journal block not found at offset 6156 on sd(8,17) Aborting journal on device sd(8,17). ext3_abort called. EXT3-fs abort (device sd(8,17)): ext3_journal_start: Detected aborted journal Remounting filesystem read-only

It sounds like a simple bad disk, especially with the references to hardware errors, but I have shut down and rebooted with the Seagate SeaTools diagnostics CD, and run both the quick diagnostics and the complete surface scan of the disk, and it comes back clean (4 hours!). The vendor won't take the disk back unless the SeaTools reports errors, and I tend to believe the tool, since I have used it before and it definitely catches things...

So my conclusion is that the problem is OS-related. I did an "mke2fs -c -j /dev/sdb1" to rebuild the filesystem last night, and I still get the same problems when trying to restore from tape. I am going to run "mke2fs -c -c -j /dev/sdb1" tonight on the disk, to get it to do a complete destructive read/write test as it rebuilds the filesystem. I tried to do that last night, but gave up at 2AM as it started to do the second pass with 0x55555555, after 2 hours of writing 0xaaaaaaa and verifying that seemingly without problems (though I'm not sure if it reports problems as it encounters them or at the end; I had to ^C out). The disk I/O with the "-c -c" option slows the machine right down, and it's a main file server, so I can't do that during the day.

There are other disks on that SCSI chain, including another ST3146807LW in the same cabinet as the one that is having problems. The only thing unique about this one is that it is a single 140GB ext3 partition, whereas the other ST3146807LW is partitioned into a ~90Gb and a ~50GB partition.

Would anyone have any idea where to start looking for this problem? This machine is badly in need of patching, but because it is RH9 and not RHEL or Fedora, I'm not sure how to do that. Any thoughts would be appreciated.

-Tom

--
_______________________________________________________________________
Tom Haws               Manager, Systems Administration
trh@xxxxxxxxxxxxx      Timberline Forest Inventory Consultants
Tel: (250) 562-2628    1579 9th Ave, Prince George, B.C. Canada V2L 3R8
Fax: (250) 562-6942    http://www.timberline.ca
_______________________________________________________________________