Sean Bruno wrote:
You have found yourself in the same situation I found myself in recently.
Actually my situation was slightly different, but the resulting problem is the
same. In my case at re-boot md decided that one partition of a mirror was out of
sync, and so initiated a re-sync with the other partition. However, the
partition which was active contained a bad sector, so the re-sync failed, over
and over and over..., just like yours is doing.
In order to fix my system I used the following steps.
The first step is to take the offending filesystem offline. Then I copied the
existing partition onto the good disk using dd, with the noerror option so it
would continue past read errors. In my case I knew that the read error was not
part of the actual filesystem in use because it passed fsck. When the copy was
complete I ran fsck on the new filesystem just to be sure it had copied ok.
After this I created a new RAID consisting of just the good partition (in my
case the RAID was md1 and the new partition was sda3):
# mdadm -C /dev/md1 --force -n 1 -l 1 /dev/sda3
As a temporary fix, until a new disk arrived, I ran
# e2fsk -c -d -f /dev/sdb3
to mark back blocks (sdb3 was the failing partition).
Then I ran:
# mdadm --zero-superblock /dev/sdb3
to remove the md superblock from the partition so it was no longer part of a RAID.
Finally, I used mdadm to add the dodgy partition back into the RAID:
# mdadm -a /dev/md1 /dev/sdb3
and to grow the RAID to 2 partitions:
# mdadm --grow -n 2 /dev/md1
Thanks for the assistance with this Nigel. I was able to recover from
this 'double' failure with your procedure. I had purchased 2 new disks
in order to replace the failed drives and I am back up at this time.
Sean
You may want to do some additional testing to verify the status of the new
filesystem. In my original message I implied that fsck was sufficient, but as
Tony quite rightly pointed out, it isn't. On my failing disk I knew that the bad
block wasn't part of the active filesystem, so a simple copy/fsck was
sufficient. During the copy there were no errors, and a comparison of the two
filesystems showed no discrepancies.
When you copied your filesystem, did the system generate any error messages? If
so, you will probably want to investigate which file the bad block belonged to,
and determine the impact that having that file corrupted might cause, and
whether you can restore that file from a backup.
--
Nigel Wade, System Administrator, Space Plasma Physics Group,
University of Leicester, Leicester, LE1 7RH, UK
E-mail : nmw@xxxxxxxxxxxx
Phone : +44 (0)116 2523548, Fax : +44 (0)116 2523555