14-04-2014 | Remy van Elst
Last week Nagios alerted me about a broken disk in one of my clients testing servers. There is a best effort SLA on the thing, and there were spare drives of the same type and size in the datacenter. Lucky me. This particular data center is on biking distance, so I enjoyed a sunny ride there.
Simply put, I needed to replace the disk and rebuild the raid 1 array. This server is a simple Ubuntu 12.04 LTS server with two disks running in raid 1, no spare. Client has a tight budget, and with a best effort SLA not in production, fine with me. Consultant tip, make sure you have those things signed.
_ in the
cat /proc/mdstat tells me the second disk (
/dev/sdb) has failed:
Personalities : [raid1] [raid6] [raid5] [raid4] md0 : active raid1 sda1 sdb1 129596288 blocks [2/2] [U_]
U means up,
_ means down [source]
First we remove the disk from the RAID array:
mdadm --manage /dev/md0 --remove /dev/sdb1
Make sure the server can boot from a degraded RAID array:
grep BOOT_DEGRADED /etc/initramfs-tools/conf.d/mdadm
If it says true, continue on. If not, add or change it and rebuild the initramfs using the following command:
(Thank you Karssen)
We can now safely shut down the server:
shutdown -h 10
Replacing the disk was an issue on itself, it is a Supermicro 512L-260B chassis where the disks are not in a drive bay, rather they are screwed in from the bottom. Therefore the whole server needs to be removed from the rack (no rails...) when replacing the disk.
Normally I would replace them while the server is on, but this server has no hot swap disks so that would be kind of an issue in a full rack.
After that, boot the server from the first disk (via the BIOS/UEFI). Make sure you boot to recovery mode. Select the root shell and mount the disk read/write:
mount -o remount,rw /dev/sda1
Now copy the partition table to the new (in my case, empty) disk:
sfdisk -d /dev/sda > sfdisk /dev/sdb
This will erase data on the new disk.
Add the disk to the RAID array and wait for the rebuilding to be complete:
mdadm --manage /dev/md0 --add /dev/sdb1
This is a nice progress command:
watch cat /proc/mdstat
It will take a while on large disks:
Personalities : [raid1] [raid6] [raid5] [raid4] md0 : active raid1 sda1 sdb1 129596288 blocks [2/2] [U_] [=>...................] recovery = 2.6% (343392/129596288) finish=67min speed=98840K/sec unused devices: <none>