FC3 machine, two disks, one mirrors the other. Software RAID. Lovely. (Time passes.) The first disk (a Hitachi Deskstar 7K250, if anyone cares) dies suddenly. The RAID software does the right thing (more or less: the machine was unusable until after a reboot). In a previous message, I asked "But now what?", and set out some specific worries. I have now changed the disk -- no, it didn't go particular well :-) -- and this message is to fill in what I've learned. *Thank you* to Bill Rugolsky for chipping in with some ideas; had I understood them correctly, I might've done better :-( (Repeat Hint: a nice Fedora Docs topic; it's SORELY UNDERDOCUMENTED in general :-) This article [http://mark.foster.cc/articles/raid-rebuild.html] is an exception.) On to the fun, step by step... * Get RAID to "stand down" re the dead disk; something like... ("mdadm --query --detail /dev/md<whatever>" to get the facts...) # mdadm /dev/md0 --set-faulty /dev/sda1 --remove /dev/sda1 # mdadm /dev/md1 --set-faulty /dev/sda2 --remove /dev/sda2 # mdadm /dev/md2 --set-faulty /dev/sda5 --remove /dev/sda5 # mdadm /dev/md3 --set-faulty /dev/sda6 --remove /dev/sda6 [Yep, that worked fine.] * Which physical disk is the guilty party? (Oooh, shoulda thought about this _much_ earlier...) OK, I've gotta yank one disk out, now which one is it? They're identical; oops. Well, it's the first one (/dev/sda, also logged as 'ata1'), so if I look in the motherboard manual and find where SATA channel 0 is connected, and follow that wire... that'll be right, won't it? [Update: yes, that is right.] A better plan? -- take the cover off, fire the machine up, it's only using one disk, right? -- so I feel which one is doing something, and take out the other. [Update: no, useless idea -- there's too much noise/vibration to be able to distinguish a doing-something disk from an idle one.] * Yank old, put in new disk. (*Mark* it as '/dev/sda' or whatever, for future reference!) * Will the machine's GRUBness work when the disk is yanked and replaced? I just replaced what was /dev/sda with a fresh disk. Has GRUB got hidden secrets, e.g. in the MBR, such that it won't boot without the expected first disk? Yes, this turned out to be exactly the case. Put in the new disk, and it won't boot at all (because the first GRUB piece is in the MBR of the disk I just yanked). So I booted with the rescue CD instead; stay tuned. * Getting the right partitioning on the replacement disk. It would seem straightforward: # sfdisk -d /dev/sdb > /tmp/my-disk-partitioning # edit /tmp/my-disk-partitioning, replacing 'sdb' with 'sda' # sfdisk /dev/sda < /tmp/my-disk-partitioning FAILED: sfdisk off the rescue CD crashed hopelessly. I had to do the partitioning "by hand" with the ever-reliable 'fdisk'. * Sorting out GRUB (or "being able to boot the machine") BIG SURPRISE: the rescue CD DOES NOT INCLUDE 'grub' or 'grub-install'! What I _think_ I should've done is: 1. Before I did anything (i.e. with the old faltering /dev/sda), I should've done... dd if=/dev/sda of=/dev/sdb count=1 ... to get GRUB-stage-0 into the MBR of the disk I'm about to keep. 2. Don't I also need to mangle /boot/grub/grub.conf... - to get rid of 'hiddenmenu' (so that I will be shown the choices) - add at least one item to the grub menu so that grub will look on the second disk (hd1)? (This is the critical step, and I'm not sure about it.) 3. While rebooting (now with the new /dev/sda), tell the BIOS to boot from the second disk 4. Upon successful boot from 2nd disk, do whatever I'm going to do... (below) 5. If I wish, reverse the process: # copy grub-stage-0 onto MBR of new disk: dd if=/dev/sdb of=/dev/sda count=1 Tell BIOS to go back to booting from first disk. Note 1: the above is ***UNTESTED*** Note 2: this couldn't have worked if the old /dev/sda had been *DEAD* rather than just "failing". I'm not at all sure how I would've gotten grub-stage-0 onto the still-alive disk (/dev/sdb) at all, were I unwise enough to do a 'shutdown -h'. * A nice thing about this software RAID stuff... As far as I could tell, you can use a degraded partition (half-a-RAID-1) as a normal ext3 partition. So, for example, you can do: mkdir /mnt/foo ; mount /dev/sdb2 /mnt/foo However, this may just be playing with fire (see next item). * A _big mistake_ that I made! -- At some point, with my new disk (/dev/sda) in and booted off the rescue CD, I did something like... dd if=/dev/sdb2 of=/dev/sda2 ... i.e. brute-force copy all of a raid-1 partition onto its presently-empty cousin. Theory: "/dev/sda is empty and not in play; what harm can it do?" Answer (I think): Lots. The RAID software snoops around on these (type 'fd') partitions and silently decides what to make of the situation. This is really not what you want in this delicate state. Information is OK ("I've spotted a degraded array on /dev/sdb2, which seems odd"), and doing nothing is OK, but "being helpful" isn't. (Is there a kernel boot parameter to turn off RAID cleverness?) * Where I ended up: by mounting my degraded /dev/sdb2, I was able to get access to a copy of GRUB, and run it. What I ended up with: a /boot/grub/ that had all the grub stuff in it, *but no kernel* (not really sure why). I told the BIOS to boot from 2nd disk, and it duly dropped me into a GRUB prompt. As I was no longer sure I even had a copy of the kernel on my disk, I decided it was time for an FC4 install/upgrade :-) * A happy ending: I had originally done an old-fashioned several-partitions install (/boot, /, /var, /home). As I was forced into an install, I was going to "lose" /boot and /. Happily, however, the whole process stayed well away from the /home partition -- what I was most keen to preserve in the first place. * Anyway, after the install, all that was left was to tell the RAID array for /home of its new friend: # mdadm /dev/md3 --add /dev/sda6 Machine upgraded, no data lost, a little time wasted. Again, I would advise more attention to this topic (raid-you-set-up-a-year-or-two-ago-and-forgot-about => disk fails => getting things back painlessly). With disks as cheap as they are, *everyone* should consider mirroring. But as things stand, I'd expect a high failure rate even among the technically-inclined. Will