Replacing an unavailable ZFS drive

The day I’ve known for a while has come: a drive in my ZFS array has become degraded. It’s my first drive failure in over four years, not bad.

Here’s the status report:

  pool: tank
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 8h33m with 0 errors on Mon Jul 12 09:03:55 2021
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          raidz2-0  DEGRADED     0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     UNAVAIL      0     0     0
            sde     ONLINE       0     0     0

The first thing that stands out to be me is the last scrub was back in July and it’s now early November. It’s recommended to scrub an array frequently, ranging from weekly to monthly. Four months between scrubs is outside that range.

Monitoring

I had an automated weekly scrub job. It’s worked fine for years, what happened to it? I checked root’s cron, my cron, systemd timers before locating the culprit: /etc/cron.d/zfsutils-linux. This cron job called into a script that is short enough to inline here:

#!/bin/sh -eu

# Scrub all healthy pools.
zpool list -H -o health,name 2>&1 | \
        awk 'BEGIN {FS="\t"} {if ($1 ~ /^ONLINE/) print $2}' | \
while read pool
do
        zpool scrub "$pool"
done

“Scrub all healthy pools” – ugh. Since the last scrub was back in July that means that the pool has been in a degraded state for 4 months. This is horrible and I feel bad for not noticing. And the ZFS dashboard that I created and looked at every day didn’t illuminate the problem. Can I fix this monitoring problem?

The ZFS dashboard is created with grafana, prometheus, and node_exporter. Turns out node_exporter only recently learned how to communicate a ZFS pool’s state. So step 1 is to update node_exporter, step 2 is to update the dashboard. Except it’s not that simple. I have ZFS on Linux v0.7.5 (which I’ll informally keep referring to as ZFS for brevity) installed and it does not write to the following location that node_exporter expects:

/proc/spl/kstat/zfs/*/state

These stats were added in ZFS 0.8. I’m stuck on ZFS 0.7.5 with Ubuntu 18.04. I say “stuck” as I’m using the official zfsutils-linux system package. I could do a distro upgrade to Ubuntu 20.04 to get ZFS 0.8.x, but that may be a little more risky than I’d like. Alternatively, I could build from source but circumventing package maintainers, whose duty consists of ensuring stability and compatibility, could be even riskier. I don’t want to rehash the perennial debate of upgrade vs clean install, but sometimes it comes down to a feeling. I built this NAS and am running several applications on it, and I’m worried about instability on upgrade. Things could go wrong and I could end up spending the next few days pulling my hair out troubleshooting.

Aside: perhaps my next NAS should be a TrueNAS mini and run it purely as a storage device and decouple applications onto compute nodes. I’d be much more confident performing updates on vanilla installations. Though I know plenty about system administrations, I’m not a system administrator and am happy to delegate responsibility to others.

I did find a suitable workaround. I created a grafana alert that will fire whenever there is zero activity detected for any of the disks, as the only time there should be zero activity is when a disk is unavailable. There’s probably edge cases with this approach but it gives me the confidence that the next error will be much more obvious.

Additionally, one should have more than one way to monitor system health so that in the event of the failure of one (ie: grafana down or inaccessible) the other one is still available (ie: email). For email you really can’t go wrong with smartd.

Before addressing the disk replacement, should one commence a scrub before swapping disks? I was unable to find any literature on this. My intuition is conflicted. Scrubbing can be an intense process and could cause additional failures, but on the other hand I can see the benefits to knowing the data is pristine before replacement. I decided to get a scrub in, and thankfully there weren’t any data or disk issues in store.

Identification

The first step is to identify the degraded drive, which is an elementary problem to solve when the case has status lights for the drives that will identify the troublesome one. Unfortunately, my DIY NAS case lacks this, so when installing drives I labelled them with their linux assigned drive labels. I don’t trust these labels anymore as there’s nothing from preventing the system from shuffling drive names around. A quick google search confirms this. The arch wiki mentions:

[The] order in which their corresponding device nodes are added is arbitrary

Aside: While the NAS is running Ubuntu 18.04, the arch wiki is still applicable and oftentimes the best source of info.

We can see that the array is built from sd{a-e}, so we’ll query each drive using hdparm to find its serial number:

for i in a b c d e; do 
    echo -n "/dev/sd$i: "
    hdparm -I /dev/sd$i | awk '/Serial Number/ {print $3}'
done

Outputs:

/dev/sda: ZGYAAAAA
/dev/sdb: WDHBBBBB
/dev/sdc: ZGYCCCCC
/dev/sdd:  HDIO_DRIVE_CMD(identify) failed: Input/output error
/dev/sde: ZGYDDDDD

Armed with serial numbers, I can turn off the system and locate the drive. Even though the case (LIAN LI PC-Q25B) has hot swappable bays, I want to play things safe. I don’t want to take chances of further degrading the system by removing a working drive in a case of mistaken identity.

Drive Replacement

Replacing the drive was simple:

The wait is a long process, but at least the pool is still usable in the meantime. You may experience a stuck progress indicator in zpool status. In my instance, resilvering started with an estimate of 18 hours. Now, a couple days later, it indicates several more weeks are necessary. If you’re like me, when searching the internet you’ll land on the page What to Do When Resilver Takes Very Long, and find that if the drive is reported as healthy your options are either to wait it out or rebuild the array.

I’m impatient, so I detached the replacement drive, both with zpool detach and physically, rebooted the machine, reattached, and ran a long smart test. Everything passed but interestingly enough, while I was testing overwriting all the data on the disk with:

dd if=/dev/zero of=/dev/sdd bs=1M status=progress

There were kernel exceptions about timeouts logged in dmesg:

[130927.781521] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[130927.783397] ata4.00: failed command: SMART
[130927.784801] ata4.00: cmd b0/d1:01:00:4f:c2/00:00:00:00:00/00 tag 23 pio 512 in
                         res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[130927.785736] ata4.00: status: { DRDY }
[130927.786218] ata4: hard resetting link

A timeout error is described by the libata wiki as:

Controller failed to respond to an active ATA command. This could be any number of causes. Most often this is due to an unrelated interrupt subsystem bug (try booting with ‘pci=nomsi’ or ‘acpi=off’ or ’noapic’), which failed to deliver an interrupt when we were expecting one from the hardware.

Helpful to know that there are a number of causes as to the timeout errors, but not informative about what our next steps should be.

We haven’t looked at SMART attributes of the drives yet. Backblaze uses five SMART metrics to determine if a drive has an increased likelihood of failure. So let’s look at these attributes of our replacement drive:

smartctl -A /dev/sdd | grep \
 -e Reallocated_Sector_Count \
 -e Reported_Uncorrectable_Errors \
 -e Command_Timeout \
 -e Current_Pending_Sector_Count \
 -e Offline_Uncorrectable

The output showed that there had been over 300 billion Command_Timeout reports on our replacement drive. Anything above 100 is a candidate for Backblaze to replace the drive. Considering the drive only has a few days worth of powered on hours, that’s nearly a million timeouts a second.

The replacement drive must be bad. A shame that the drive is outside of warranty and return period, so a new drive was purchased.

…and the new drive worked flawlessly. A zfs replace /dev/sdd was all that was needed. Resilvering was done after several hours and a follow up scrub a day later.

Conclusion

When one encounters the happy path in drive replacement, it can be a “that’s it?” moment. Going into this process having never done it before, I was a bit apprehensive. Searching for guides on drive replacement will land one on pages that contain many instructions and can be overwhelming. Turns out a zfs replace can be really the only thing that is needed.

But not everything will go as planned, as we saw here. The replacement drive can be bad as well. I’m thankful I did enough research to inform me that the probable cause of failure was the replacement drive and not another component in the system, like a bad sata cable, hot swap panel, or software issue.

And if anything, hopefully this article has motivated you to check on your monitoring system so you are aware as soon as there is a drive failure.

Comments

If you'd like to leave a comment, please email [email protected]