Tom Haws wrote:
Hi all,
Sorry for posting here, but I have a RH9 machine running the 2.4.20-8
kernel that's having problems, and the shrike-list is just about dead,
so I thought I might have a better chance of getting responses here...
I have a brand new (3 months old) Seagate ST3146807LW 146GB U320 SCSI
drive that is having write problems:
SCSI disk error : host 0 channel 0 id 12 lun 0 return code = 8000002
Info fld=0x7df6bbf, Deferred sd08:11: sense key Hardware Error
Additional sense indicates Internal target failure
Eventually, the OS has enough problems, and it does this:
journal_bmap_Rsmp_e68c71a3: journal block not found at offset 6156 on
sd(8,17)
Aborting journal on device sd(8,17).
ext3_abort called.
EXT3-fs abort (device sd(8,17)): ext3_journal_start: Detected aborted
journal
Remounting filesystem read-only
It sounds like a simple bad disk, especially with the references to
hardware errors, but I have shut down and rebooted with the Seagate
SeaTools diagnostics CD, and run both the quick diagnostics and the
complete surface scan of the disk, and it comes back clean (4 hours!).
The vendor won't take the disk back unless the SeaTools reports errors,
and I tend to believe the tool, since I have used it before and it
definitely catches things...
So my conclusion is that the problem is OS-related. I did an "mke2fs -c
-j /dev/sdb1" to rebuild the filesystem last night, and I still get the
same problems when trying to restore from tape. I am going to run
"mke2fs -c -c -j /dev/sdb1" tonight on the disk, to get it to do a
complete destructive read/write test as it rebuilds the filesystem. I
tried to do that last night, but gave up at 2AM as it started to do the
second pass with 0x55555555, after 2 hours of writing 0xaaaaaaa and
verifying that seemingly without problems (though I'm not sure if it
reports problems as it encounters them or at the end; I had to ^C out).
The disk I/O with the "-c -c" option slows the machine right down, and
it's a main file server, so I can't do that during the day.
There are other disks on that SCSI chain, including another ST3146807LW
in the same cabinet as the one that is having problems. The only thing
unique about this one is that it is a single 140GB ext3 partition,
whereas the other ST3146807LW is partitioned into a ~90Gb and a ~50GB
partition.
Would anyone have any idea where to start looking for this problem?
This machine is badly in need of patching, but because it is RH9 and not
RHEL or Fedora, I'm not sure how to do that. Any thoughts would be
appreciated.
Whenever you have SCSI problems, the very first thing to check is the
cabling and the terminator. Remember that SCSI only guarantees 3M of
cable length (about 10 feet), so a drive at the end of a cable much
longer than that (and remember, that 10 feet includes the cable inside
the cabinet) is very likely to have issues.
As for the terminator, make sure that the terminators are enabled ONLY
on the units at the ENDS of the cable.
Controllers are typically at one end of a cable and should have the
terminator enabled. The last drive on the cable should also have its
terminator enabled. No other terminators should be enabled. Multiple
terminators will cause lots of problems.
So, measure your SCSI cable. Look at every drive on the cable and make
sure that ONLY the drive at the end has a terminator.
----------------------------------------------------------------------
- Rick Stevens, Senior Systems Engineer rstevens@xxxxxxxxxxxxxxx -
- VitalStream, Inc. http://www.vitalstream.com -
- -
- To err is human, to forgive, beyond the scope of the OS -
----------------------------------------------------------------------