Fedora Users — RAID drive failed, but SMART shows no errors?

One of my FC6 machines just claimed that one of two RAID-1 SCSI drives hadan error:


Mar 12 21:44:33 headache kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
Mar 12 21:44:33 headache kernel: sda: Current: sense key: Hardware Error
Mar 12 21:44:33 headache kernel:     Additional sense: Defect list error
Mar 12 21:44:33 headache kernel: end_request: I/O error, dev sda, sector 143363856
Mar 12 21:44:33 headache kernel: md: super_written gets error=-5, uptodate=0

Mar 12 21:44:33 headache kernel: raid1: Disk failure on sda3, disablingdevice.Mar 12 21:44:33 headache kernel: Operation continuing on 1 devices

Mar 12 21:44:33 headache kernel: RAID1 conf printout:
Mar 12 21:44:33 headache kernel:  --- wd:1 rd:2
Mar 12 21:44:33 headache kernel:  disk 0, wo:1, o:0, dev:sda3
Mar 12 21:44:33 headache kernel:  disk 1, wo:0, o:1, dev:sdb3
Mar 12 21:44:33 headache kernel: RAID1 conf printout:
Mar 12 21:44:33 headache kernel:  --- wd:1 rd:2
Mar 12 21:44:33 headache kernel:  disk 1, wo:0, o:1, dev:sdb3

I have two SCSI drives off an Adaptec AIC-7902B U320 (rev 10) controller.

But smartctl gives this drive a clean bill of health:

[root@headache ~]# smartctl -H /dev/sda
smartctl version 5.36 [i386-redhat-linux-gnu] Copyright © 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

SMART Health Status: OK

I have three RAID-1 partitions on these disks. The one that reported anerror was the largest one. I dropped the degraded partition, and hot-addedit back. Immediately, another error was logged to /var/log/messages, forthe same block, but despite the error, the kernel started resyncing thearray:


Mar 12 22:37:33 headache kernel: Buffer I/O error on device sda3, logical block 35262625
Mar 12 22:37:41 headache kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
Mar 12 22:37:41 headache kernel: sda: Current: sense key: Medium Error
Mar 12 22:37:41 headache kernel:     Additional sense: Unrecovered read error
Mar 12 22:37:41 headache kernel: Info fld=0x88b8f16
Mar 12 22:37:41 headache kernel: end_request: I/O error, dev sda, sector 143363862
Mar 12 22:37:41 headache kernel: Buffer I/O error on device sda3, logical block 35262625
Mar 12 22:37:41 headache kernel: md: bind<sda3>
Mar 12 22:37:42 headache kernel: RAID1 conf printout:
Mar 12 22:37:42 headache kernel:  --- wd:1 rd:2
Mar 12 22:37:42 headache kernel:  disk 0, wo:1, o:1, dev:sda3
Mar 12 22:37:42 headache kernel:  disk 1, wo:0, o:1, dev:sdb3

Despite the second error, the resync of the failed partition completedsuccesfully.

smartctl -a shows 80000+ read errors corrected by ECC/fast, no rereads,and 6 rewrites. My knowledge of SMART is limited. The other drive in thisarray shows 50000+ read errors corrected by ECC/fast, no rereads and norewrites.

So, do the 6 rewrites on this drive is an indication of a looming failure?My second question is that the two drives are in a hot-swappable bay, andconnected to the Adaptec AIC-7902B U320 controller. Hardware-wise, thedrives are hot-swappable, but what about software-wise? If I take thisdrive entirely off RAID-1, cut the power to the hot-swap bay, pull the driveout, replace it, plug in back in, and reenable power, will the FC6 kernel beable to deal with this?

If I cannot do this, my third question is what do I need to do, grub-wise,to be able to swap sdb with sda? sda is the one that's failing the RAID-1array. If I can't hot-swap it, I'll need to replace it with the sdb drive,but right now grub is installed only on sda, so how do I install a copy ofall the grub boot-related stuff on sdb?

Attachment: pgpYkskX1SRTO.pgp
Description: PGP signature