RAID drive failed, but SMART shows no errors?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



One of my FC6 machines just claimed that one of two RAID-1 SCSI drives had an error:

Mar 12 21:44:33 headache kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
Mar 12 21:44:33 headache kernel: sda: Current: sense key: Hardware Error
Mar 12 21:44:33 headache kernel:     Additional sense: Defect list error
Mar 12 21:44:33 headache kernel: end_request: I/O error, dev sda, sector 143363856
Mar 12 21:44:33 headache kernel: md: super_written gets error=-5, uptodate=0
Mar 12 21:44:33 headache kernel: raid1: Disk failure on sda3, disabling device. Mar 12 21:44:33 headache kernel: Operation continuing on 1 devices
Mar 12 21:44:33 headache kernel: RAID1 conf printout:
Mar 12 21:44:33 headache kernel:  --- wd:1 rd:2
Mar 12 21:44:33 headache kernel:  disk 0, wo:1, o:0, dev:sda3
Mar 12 21:44:33 headache kernel:  disk 1, wo:0, o:1, dev:sdb3
Mar 12 21:44:33 headache kernel: RAID1 conf printout:
Mar 12 21:44:33 headache kernel:  --- wd:1 rd:2
Mar 12 21:44:33 headache kernel:  disk 1, wo:0, o:1, dev:sdb3

I have two SCSI drives off an Adaptec AIC-7902B U320 (rev 10) controller.

But smartctl gives this drive a clean bill of health:

[root@headache ~]# smartctl -H /dev/sda
smartctl version 5.36 [i386-redhat-linux-gnu] Copyright © 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

SMART Health Status: OK

I have three RAID-1 partitions on these disks. The one that reported an error was the largest one. I dropped the degraded partition, and hot-added it back. Immediately, another error was logged to /var/log/messages, for the same block, but despite the error, the kernel started resyncing the array:

Mar 12 22:37:33 headache kernel: Buffer I/O error on device sda3, logical block 35262625
Mar 12 22:37:41 headache kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
Mar 12 22:37:41 headache kernel: sda: Current: sense key: Medium Error
Mar 12 22:37:41 headache kernel:     Additional sense: Unrecovered read error
Mar 12 22:37:41 headache kernel: Info fld=0x88b8f16
Mar 12 22:37:41 headache kernel: end_request: I/O error, dev sda, sector 143363862
Mar 12 22:37:41 headache kernel: Buffer I/O error on device sda3, logical block 35262625
Mar 12 22:37:41 headache kernel: md: bind<sda3>
Mar 12 22:37:42 headache kernel: RAID1 conf printout:
Mar 12 22:37:42 headache kernel:  --- wd:1 rd:2
Mar 12 22:37:42 headache kernel:  disk 0, wo:1, o:1, dev:sda3
Mar 12 22:37:42 headache kernel:  disk 1, wo:0, o:1, dev:sdb3

Despite the second error, the resync of the failed partition completed succesfully.

smartctl -a shows 80000+ read errors corrected by ECC/fast, no rereads, and 6 rewrites. My knowledge of SMART is limited. The other drive in this array shows 50000+ read errors corrected by ECC/fast, no rereads and no rewrites.

So, do the 6 rewrites on this drive is an indication of a looming failure? My second question is that the two drives are in a hot-swappable bay, and connected to the Adaptec AIC-7902B U320 controller. Hardware-wise, the drives are hot-swappable, but what about software-wise? If I take this drive entirely off RAID-1, cut the power to the hot-swap bay, pull the drive out, replace it, plug in back in, and reenable power, will the FC6 kernel be able to deal with this?

If I cannot do this, my third question is what do I need to do, grub-wise, to be able to swap sdb with sda? sda is the one that's failing the RAID-1 array. If I can't hot-swap it, I'll need to replace it with the sdb drive, but right now grub is installed only on sda, so how do I install a copy of all the grub boot-related stuff on sdb?

Attachment: pgpYkskX1SRTO.pgp
Description: PGP signature


[Index of Archives]     [Current Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [Yosemite Photos]     [KDE Users]     [Fedora Tools]     [Fedora Docs]

  Powered by Linux