Sam Varshavchik wrote:
...
But smartctl gives this drive a clean bill of health:
[root@headache ~]# smartctl -H /dev/sda
smartctl version 5.36 [i386-redhat-linux-gnu] Copyright © 2002-6 Bruce
Allen
Home page is http://smartmontools.sourceforge.net/
SMART Health Status: OK
Try running a SMART test on the drive:
smartctl -t long /dev/sda
It will tell you how long time it takes to run the test,
you'll have to probe once in a while with
smartctl -a /dev/sda
to get the result of the test. It will be at the end:
SMART Self-test log
Num Test Status segment LifeTime
LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 12641
- [- - -]
I have three RAID-1 partitions on these disks. The one that reported an
error was the largest one. I dropped the degraded partition, and
hot-added it back. Immediately, another error was logged to
/var/log/messages, for the same block, but despite the error, the kernel
started resyncing the array:
...
If it were me, I would replace this disk. The next time you
run into this read error could be when sdb fails and you try
to resync a new sdb :-(
...
My second question is that the two drives are in a
hot-swappable bay, and connected to the Adaptec AIC-7902B U320
controller. Hardware-wise, the drives are hot-swappable, but what about
software-wise? If I take this drive entirely off RAID-1, cut the power
to the hot-swap bay, pull the drive out, replace it, plug in back in,
and reenable power, will the FC6 kernel be able to deal with this?
On my system (with 8 146G hotswap SCSI drives on a dual
channel Adaptec AHA-3960D / AIC-7899A), I would:
0. Keep a window open with a "tail -10f /var/log/messages"
1. take all partitions from the failing drive out of the array with mdadm
2. Remove the drive from the kernel:
echo "scsi remove-single-device 0 0 0 0" >/proc/scsi/scsi
The four zeros are: controller#,channel,SCSI id,LUN,
try "echo /proc/scsi/scsi" to see these numbers, if
you make a mistake and remove the wrong drive, you'll
have a problem...
3. Physically remove the drive
4. Insert a new drive
5. Tell the kernel that a new drive exists:
echo "scsi add-single-device 0 0 0 0" >/proc/scsi/scsi
This can take awhile, the drive has to spin up.
6. Partition the drive
7. Add the partitions with mdadm, follow the sync in /proc/mdstat
8. After the sync, run grub to reinstall the boot loader
We have done this several times when drives have failed.
If I cannot do this, my third question is what do I need to do,
grub-wise, to be able to swap sdb with sda? sda is the one that's
failing the RAID-1 array. If I can't hot-swap it, I'll need to replace
it with the sdb drive, but right now grub is installed only on sda, so
how do I install a copy of all the grub boot-related stuff on sdb?
Hm? If you have used the GUI to create the RAID partitions during
installation GRUB should be on both drives.
Mogens
--
Mogens Kjaer, Carlsberg A/S, Computer Department
Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark
Phone: +45 33 27 53 25, Fax: +45 33 27 47 08
Email: mk@xxxxxx Homepage: http://www.crc.dk