Fedora Users — Re: RAID drive failed, but SMART shows no errors?

Sam Varshavchik wrote:
...

But smartctl gives this drive a clean bill of health:

[root@headache ~]# smartctl -H /dev/sda

Home page is http://smartmontools.sourceforge.net/

SMART Health Status: OK


Try running a SMART test on the drive:

smartctl -t long /dev/sda

It will tell you how long time it takes to run the test,
you'll have to probe once in a while with

smartctl -a /dev/sda

to get the result of the test. It will be at the end:

SMART Self-test log

Num Test Status segment LifeTimeLBA_first_err [SK ASC ASQ]

     Description                              number   (hours)

# 1 Background long Completed - 12641- [- - -]

I have three RAID-1 partitions on these disks. The one that reported anerror was the largest one. I dropped the degraded partition, andhot-added it back. Immediately, another error was logged to/var/log/messages, for the same block, but despite the error, the kernelstarted resyncing the array:

...

If it were me, I would replace this disk. The next time you
run into this read error could be when sdb fails and you try
to resync a new sdb :-(

...

My second question is that the two drives are in ahot-swappable bay, and connected to the Adaptec AIC-7902B U320controller. Hardware-wise, the drives are hot-swappable, but what aboutsoftware-wise? If I take this drive entirely off RAID-1, cut the powerto the hot-swap bay, pull the drive out, replace it, plug in back in,and reenable power, will the FC6 kernel be able to deal with this?


On my system (with 8 146G hotswap SCSI drives on a dual
channel Adaptec AHA-3960D / AIC-7899A), I would:

0. Keep a window open with a "tail -10f /var/log/messages"
1. take all partitions from the failing drive out of the array with mdadm
2. Remove the drive from the kernel:

echo "scsi remove-single-device 0 0 0 0" >/proc/scsi/scsi

The four zeros are: controller#,channel,SCSI id,LUN,
try "echo /proc/scsi/scsi" to see these numbers, if
you make a mistake and remove the wrong drive, you'll
have a problem...

3. Physically remove the drive
4. Insert a new drive
5. Tell the kernel that a new drive exists:

echo "scsi add-single-device 0 0 0 0" >/proc/scsi/scsi

This can take awhile, the drive has to spin up.

6. Partition the drive
7. Add the partitions with mdadm, follow the sync in /proc/mdstat
8. After the sync, run grub to reinstall the boot loader

We have done this several times when drives have failed.

If I cannot do this, my third question is what do I need to do,grub-wise, to be able to swap sdb with sda? sda is the one that'sfailing the RAID-1 array. If I can't hot-swap it, I'll need to replaceit with the sdb drive, but right now grub is installed only on sda, sohow do I install a copy of all the grub boot-related stuff on sdb?


Hm? If you have used the GUI to create the RAID partitions during
installation GRUB should be on both drives.

Mogens

--
Mogens Kjaer, Carlsberg A/S, Computer Department
Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark
Phone: +45 33 27 53 25, Fax: +45 33 27 47 08
Email: mk@xxxxxx Homepage: http://www.crc.dk