Repairing a Faulty Disk in a Software RAID Array

I was doing some system maintenance today and came across the following horrific screen:

/dev/md0:
        Version : 00.90.03
  Creation Time : Sun Nov 16 14:13:20 2007
     Raid Level : raid5
     Array Size : 732587712 (698.65 GiB 750.17 GB)
  Used Dev Size : 244195904 (232.88 GiB 250.06 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0  
    Persistence : Superblock is persistent

    Update Time : Wed Dec 31 10:41:15 2008
          State : clean, degraded
 Active Devices : 3
Working Devices : 3  
 Failed Devices : 1
  Spare Devices : 0

...

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
       3       8       49        3      active sync   /dev/sdd1
       4       8        1        -      faulty spare

One of the drives in my fileserver had died! Time to back up and get that sucker running again.

Please note: The following is only a guide to help you replace a failed disc. I cannot guarantee this will work for you, but it is what I do and has worked every time without any data loss.

As you can see, it is a 4 disc software RAID 5 array with no hot-swap spares. The following should work for most single disc failure situations in RAID 1, 5, or 6.

It appears that sda has bailed on me. First things first, backup the machine. If anything happens, you can rebuild from scratch.

You can see the faulty disc has already been removed from the array, but if yours hasn't been removed yet, the commands:

mdadm --manage /dev/md0 -f /dev/sda1  
mdadm --manage /dev/md0 -r /dev/sda1  

will mark it as failed (so it can be removed) and remove the sda1 partition.

Shutdown the machine and switch out the hard drives, make sure you only replace the faulty drive, don't mess up the order of the drives cause it'll be a pain to get it back in.

Boot up the machine. Your RAID array will be in the same degraded state. We need to partition the new drive exactly the same way we partitioned the drives in the existing array. Luckily this is a one-liner with sfdisk:

sfdisk -d /dev/sdb | sfdisk /dev/sda  

The above code will dump the partition table of sdb (or use any of the functioning drives) and pipe it to sfdisk to partition sda the same way. It should only take a second.

Then we can simply add the new drive to the array:

mdadm --manage /dev/md0 -a /dev/sda1  

If you take a look at cat /proc/mdstat or mdadm --detail /dev/md0 you should see that the array is recovering (with a percentage done).

After the recovery is done, you'll be back to new and clean!

Good luck!