Fedora Users — FOLLOW-UP: RAID-1 (mirroring) disk failed; now what? [failed]

FC3 machine, two disks, one mirrors the other.  Software RAID.
Lovely.  (Time passes.)  The first disk (a Hitachi Deskstar 7K250, if
anyone cares) dies suddenly.  The RAID software does the right thing
(more or less: the machine was unusable until after a reboot).

In a previous message, I asked "But now what?", and set out some
specific worries.  I have now changed the disk -- no, it didn't go
particular well :-) -- and this message is to fill in what I've
learned.  *Thank you* to Bill Rugolsky for chipping in with some
ideas; had I understood them correctly, I might've done better :-(

(Repeat Hint: a nice Fedora Docs topic; it's SORELY UNDERDOCUMENTED in
general :-) This article [http://mark.foster.cc/articles/raid-rebuild.html]
is an exception.) On to the fun, step by step...

* Get RAID to "stand down" re the dead disk; something like...

  ("mdadm --query --detail /dev/md<whatever>" to get the facts...)

  # mdadm /dev/md0 --set-faulty /dev/sda1 --remove /dev/sda1
  # mdadm /dev/md1 --set-faulty /dev/sda2 --remove /dev/sda2
  # mdadm /dev/md2 --set-faulty /dev/sda5 --remove /dev/sda5
  # mdadm /dev/md3 --set-faulty /dev/sda6 --remove /dev/sda6

  [Yep, that worked fine.]

* Which physical disk is the guilty party?

  (Oooh, shoulda thought about this _much_ earlier...) OK, I've gotta
  yank one disk out, now which one is it?  They're identical; oops.

  Well, it's the first one (/dev/sda, also logged as 'ata1'), so if I
  look in the motherboard manual and find where SATA channel 0 is
  connected, and follow that wire...  that'll be right, won't it?
  [Update: yes, that is right.]

  A better plan? -- take the cover off, fire the machine up, it's only
  using one disk, right? -- so I feel which one is doing something,
  and take out the other.
  [Update: no, useless idea -- there's too much noise/vibration to be
  able to distinguish a doing-something disk from an idle one.]

* Yank old, put in new disk.

  (*Mark* it as '/dev/sda' or whatever, for future reference!)

* Will the machine's GRUBness work when the disk is yanked and replaced?

  I just replaced what was /dev/sda with a fresh disk.  Has GRUB
  got hidden secrets, e.g. in the MBR, such that it won't boot without
  the expected first disk?

  Yes, this turned out to be exactly the case.  Put in the new
  disk, and it won't boot at all (because the first GRUB piece is in
  the MBR of the disk I just yanked).  So I booted with the rescue CD
  instead; stay tuned.

* Getting the right partitioning on the replacement disk.

  It would seem straightforward:

  # sfdisk -d /dev/sdb > /tmp/my-disk-partitioning
  # edit /tmp/my-disk-partitioning, replacing 'sdb' with 'sda'
  # sfdisk /dev/sda < /tmp/my-disk-partitioning

  FAILED: sfdisk off the rescue CD crashed hopelessly.  I had to do
  the partitioning "by hand" with the ever-reliable 'fdisk'.

* Sorting out GRUB (or "being able to boot the machine")

  BIG SURPRISE: the rescue CD DOES NOT INCLUDE 'grub' or
  'grub-install'!

  What I _think_ I should've done is:

  1.  Before I did anything (i.e. with the old faltering /dev/sda), I
      should've done...

      dd if=/dev/sda of=/dev/sdb count=1  

      ... to get GRUB-stage-0 into the MBR of the disk I'm about to
      keep.

  2.  Don't I also need to mangle /boot/grub/grub.conf...
  
      - to get rid of 'hiddenmenu' (so that I will be shown the choices)

      - add at least one item to the grub menu so that grub will look
        on the second disk (hd1)?

      (This is the critical step, and I'm not sure about it.)

  3.  While rebooting (now with the new /dev/sda), tell the BIOS to
      boot from the second disk

  4.  Upon successful boot from 2nd disk, do whatever I'm going to do... (below)

  5.  If I wish, reverse the process:

      # copy grub-stage-0 onto MBR of new disk:
      dd if=/dev/sdb of=/dev/sda count=1  

      Tell BIOS to go back to booting from first disk.

  Note 1: the above is ***UNTESTED***

  Note 2: this couldn't have worked if the old /dev/sda had been
  *DEAD* rather than just "failing".  I'm not at all sure how I
  would've gotten grub-stage-0 onto the still-alive disk (/dev/sdb) at
  all, were I unwise enough to do a 'shutdown -h'.

* A nice thing about this software RAID stuff...

  As far as I could tell, you can use a degraded partition
  (half-a-RAID-1) as a normal ext3 partition.  So, for example,
  you can do: mkdir /mnt/foo ; mount /dev/sdb2 /mnt/foo

  However, this may just be playing with fire (see next item).

* A _big mistake_ that I made! --

  At some point, with my new disk (/dev/sda) in and booted off the
  rescue CD, I did something like...

  dd if=/dev/sdb2 of=/dev/sda2

  ... i.e. brute-force copy all of a raid-1 partition onto its
  presently-empty cousin.  Theory: "/dev/sda is empty and not in play;
  what harm can it do?"

  Answer (I think): Lots.  The RAID software snoops around on these
  (type 'fd') partitions and silently decides what to make of the
  situation.  This is really not what you want in this delicate
  state.  Information is OK ("I've spotted a degraded array on
  /dev/sdb2, which seems odd"), and doing nothing is OK, but "being
  helpful" isn't.  (Is there a kernel boot parameter to turn off RAID
  cleverness?)

* Where I ended up: by mounting my degraded /dev/sdb2, I was able to
  get access to a copy of GRUB, and run it.  What I ended up with: a
  /boot/grub/ that had all the grub stuff in it, *but no kernel* (not
  really sure why).  I told the BIOS to boot from 2nd disk, and it
  duly dropped me into a GRUB prompt.
  
  As I was no longer sure I even had a copy of the kernel on my disk,
  I decided it was time for an FC4 install/upgrade :-)

* A happy ending: I had originally done an old-fashioned
  several-partitions install (/boot, /, /var, /home).  As I was forced
  into an install, I was going to "lose" /boot and /.  Happily,
  however, the whole process stayed well away from the /home partition
  -- what I was most keen to preserve in the first place.

* Anyway, after the install, all that was left was to tell the RAID
  array for /home of its new friend:

  # mdadm /dev/md3 --add /dev/sda6

  Machine upgraded, no data lost, a little time wasted.

Again, I would advise more attention to this topic
(raid-you-set-up-a-year-or-two-ago-and-forgot-about => disk fails =>
getting things back painlessly).  With disks as cheap as they are,
*everyone* should consider mirroring.  But as things stand, I'd expect
a high failure rate even among the technically-inclined.

Will