Re: RAID drive failed, but SMART shows no errors?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sam Varshavchik wrote:
...
But smartctl gives this drive a clean bill of health:

[root@headache ~]# smartctl -H /dev/sda
smartctl version 5.36 [i386-redhat-linux-gnu] Copyright © 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

SMART Health Status: OK

Try running a SMART test on the drive:

smartctl -t long /dev/sda

It will tell you how long time it takes to run the test,
you'll have to probe once in a while with

smartctl -a /dev/sda

to get the result of the test. It will be at the end:

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1 Background long Completed - 12641 - [- - -]


I have three RAID-1 partitions on these disks. The one that reported an error was the largest one. I dropped the degraded partition, and hot-added it back. Immediately, another error was logged to /var/log/messages, for the same block, but despite the error, the kernel started resyncing the array:
...

If it were me, I would replace this disk. The next time you
run into this read error could be when sdb fails and you try
to resync a new sdb :-(

...
My second question is that the two drives are in a hot-swappable bay, and connected to the Adaptec AIC-7902B U320 controller. Hardware-wise, the drives are hot-swappable, but what about software-wise? If I take this drive entirely off RAID-1, cut the power to the hot-swap bay, pull the drive out, replace it, plug in back in, and reenable power, will the FC6 kernel be able to deal with this?

On my system (with 8 146G hotswap SCSI drives on a dual
channel Adaptec AHA-3960D / AIC-7899A), I would:

0. Keep a window open with a "tail -10f /var/log/messages"
1. take all partitions from the failing drive out of the array with mdadm
2. Remove the drive from the kernel:

echo "scsi remove-single-device 0 0 0 0" >/proc/scsi/scsi

The four zeros are: controller#,channel,SCSI id,LUN,
try "echo /proc/scsi/scsi" to see these numbers, if
you make a mistake and remove the wrong drive, you'll
have a problem...

3. Physically remove the drive
4. Insert a new drive
5. Tell the kernel that a new drive exists:

echo "scsi add-single-device 0 0 0 0" >/proc/scsi/scsi

This can take awhile, the drive has to spin up.

6. Partition the drive
7. Add the partitions with mdadm, follow the sync in /proc/mdstat
8. After the sync, run grub to reinstall the boot loader

We have done this several times when drives have failed.


If I cannot do this, my third question is what do I need to do, grub-wise, to be able to swap sdb with sda? sda is the one that's failing the RAID-1 array. If I can't hot-swap it, I'll need to replace it with the sdb drive, but right now grub is installed only on sda, so how do I install a copy of all the grub boot-related stuff on sdb?

Hm? If you have used the GUI to create the RAID partitions during
installation GRUB should be on both drives.

Mogens

--
Mogens Kjaer, Carlsberg A/S, Computer Department
Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark
Phone: +45 33 27 53 25, Fax: +45 33 27 47 08
Email: mk@xxxxxx Homepage: http://www.crc.dk


[Index of Archives]     [Current Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [Yosemite Photos]     [KDE Users]     [Fedora Tools]     [Fedora Docs]

  Powered by Linux