Hi,
Ik accidentally discovered that my software RAID5 array was degraded:
one of the three disks had been kicked out of the array.
It appeared that disk /dev/sda1 had been disabled 5 days before! I was
totally unaware of the problem, which is odd.
syslog shows the moment the array got into trouble:
2010-02-23T17:10:31.974066+01:00 home07 kernel:
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
2010-02-23T17:10:31.974200+01:00 home07 kernel: ata1.00: cmd
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
2010-02-23T17:10:31.974213+01:00 home07 kernel: res
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
2010-02-23T17:10:31.974222+01:00 home07 kernel: ata1.00: status:
{ DRDY }
2010-02-23T17:10:31.974230+01:00 home07 kernel: ata1: hard
resetting link
2010-02-23T17:10:31.974240+01:00 home07 kernel: ata1: nv:
skipping hardreset on occupied port
2010-02-23T17:10:37.479060+01:00 home07 kernel: ata1: link is
slow to respond, please be patient (ready=0)
2010-02-23T17:10:42.018051+01:00 home07 kernel: ata1: SRST failed
(errno=-16)
2010-02-23T17:10:42.018074+01:00 home07 kernel: ata1: hard
resetting link
2010-02-23T17:10:42.018081+01:00 home07 kernel: ata1: nv:
skipping hardreset on occupied port
2010-02-23T17:10:47.521060+01:00 home07 kernel: ata1: link is
slow to respond, please be patient (ready=0)
2010-02-23T17:10:52.062048+01:00 home07 kernel: ata1: SRST failed
(errno=-16)
2010-02-23T17:10:52.062072+01:00 home07 kernel: ata1: hard
resetting link
2010-02-23T17:10:52.062079+01:00 home07 kernel: ata1: nv:
skipping hardreset on occupied port
2010-02-23T17:10:57.564052+01:00 home07 kernel: ata1: link is
slow to respond, please be patient (ready=0)
2010-02-23T17:11:27.094054+01:00 home07 kernel: ata1: SRST failed
(errno=-16)
2010-02-23T17:11:27.094086+01:00 home07 kernel: ata1: limiting
SATA link speed to 1.5 Gbps
2010-02-23T17:11:27.094096+01:00 home07 kernel: ata1: hard
resetting link
2010-02-23T17:11:27.094107+01:00 home07 kernel: ata1: nv:
skipping hardreset on occupied port
2010-02-23T17:11:32.139070+01:00 home07 kernel: ata1: SRST failed
(errno=-16)
2010-02-23T17:11:32.139103+01:00 home07 kernel: ata1: reset
failed, giving up
2010-02-23T17:11:32.139112+01:00 home07 kernel: ata1.00: disabled
2010-02-23T17:11:32.139147+01:00 home07 kernel: ata1.00: device
reported invalid CHS sector 0
2010-02-23T17:11:32.139158+01:00 home07 kernel: end_request: I/O
error, dev sda, sector 164842814
2010-02-23T17:11:32.139168+01:00 home07 kernel: md: super_written
gets error=-5, uptodate=0
2010-02-23T17:11:32.139177+01:00 home07 kernel: raid5: Disk
failure on sda2, disabling device.
2010-02-23T17:11:32.139186+01:00 home07 kernel: raid5: Operation
continuing on 2 devices.
2010-02-23T17:11:32.139194+01:00 home07 kernel: ata1: EH complete
2010-02-23T17:11:32.154040+01:00 home07 kernel: RAID5 conf
printout:
2010-02-23T17:11:32.154064+01:00 home07 kernel: --- rd:3 wd:2
2010-02-23T17:11:32.154069+01:00 home07 kernel: disk 0, o:0,
dev:sda2
2010-02-23T17:11:32.154074+01:00 home07 kernel: disk 1, o:1,
dev:sdb2
2010-02-23T17:11:32.154079+01:00 home07 kernel: disk 2, o:1,
dev:sdc2
2010-02-23T17:11:32.157289+01:00 home07 kernel: RAID5 conf
printout:
2010-02-23T17:11:32.157419+01:00 home07 kernel: --- rd:3 wd:2
2010-02-23T17:11:32.157426+01:00 home07 kernel: disk 1, o:1,
dev:sdb2
2010-02-23T17:11:32.157431+01:00 home07 kernel: disk 2, o:1,
dev:sdc2
Upon discovery I rebuilt the array of course:
2010-03-01T10:21:13.944903+01:00 home07 kernel: md:
bind<sda2>
2010-03-01T10:21:13.962118+01:00 home07 kernel: RAID5 conf
printout:
2010-03-01T10:21:13.962272+01:00 home07 kernel: --- rd:3 wd:2
2010-03-01T10:21:13.962284+01:00 home07 kernel: disk 0, o:1,
dev:sda2
2010-03-01T10:21:13.962293+01:00 home07 kernel: disk 1, o:1,
dev:sdb2
2010-03-01T10:21:13.962302+01:00 home07 kernel: disk 2, o:1,
dev:sdc2
2010-03-01T10:21:13.968175+01:00 home07 kernel: md: recovery of
RAID array md1
2010-03-01T10:21:13.968215+01:00 home07 kernel: md: minimum
_guaranteed_ speed: 1000 KB/sec/disk.
2010-03-01T10:21:13.968228+01:00 home07 kernel: md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for
recovery.
2010-03-01T10:21:13.968239+01:00 home07 kernel: md: using 128k
window, over a total of 81923328 blocks.
2010-03-01T10:42:39.093334+01:00 home07 dhclient[1804]:
DHCPREQUEST on br0 to 192.168.254.254 port 67
2010-03-01T10:42:39.095010+01:00 home07 dhclient[1804]: DHCPACK
from 192.168.254.254
2010-03-01T10:42:39.846397+01:00 home07 dhclient[1804]: bound to
192.168.254.7 -- renewal in 1797 seconds.
2010-03-01T10:46:41.279910+01:00 home07 kernel: md: md1: recovery
done.
2010-03-01T10:46:41.320932+01:00 home07 kernel: RAID5 conf
printout:
2010-03-01T10:46:41.320973+01:00 home07 kernel: --- rd:3 wd:3
2010-03-01T10:46:41.320984+01:00 home07 kernel: disk 0, o:1,
dev:sda2
2010-03-01T10:46:41.320996+01:00 home07 kernel: disk 1, o:1,
dev:sdb2
2010-03-01T10:46:41.321004+01:00 home07 kernel: disk 2, o:1,
dev:sdc2
What surprises me is the fact that the system didn't inform me (except
for the syslog messages) that there was something seriously wrong, It
should have done so by alarming popups in X, or whatever.
Probably I just misconfigured something, but maybe Fedora has no
software installed (or available) to alarm the user about these serious
events.
Any suggestions?
Thanks,
Rolf
|