Understanding and Resolving ZFS Disk Failure

Modified: 08 Mar 2023 00:28 UTC

Introductory Concepts

This document is written for administrators and those who have familiarity with computing hardware platforms and storage concepts such as RAID. If you're already versed in the general failure process, you can skip ahead to how to replace a drive and repairing the pool.

Degrees of verbosity

When a drive fails or has errors, a great degree of logging data is available on SmartOS. We can drill down in more detail to help us find the underlying cause of disk failure. In descending order, these commands will present the disk failure cause in increasing verbosity:

* `zpool status`
* `iostat -en` 
* `iostat -En`
* `fmadm faulty`
* `fmdump -et {n}days`
* `fmdump -eVt {n}days`

The zpool status command will present us with a high level view of pool health.

iostat will present us with high level error counts and specifics as to the devices in question.

fmadm faulty will tell us more specifically which event led to the disk failure. (fmadm can also be used to clear transitory faults; this, however, is outside the scope of this document. Refer to the fmadm man page for more information.) fmdump is much more specific still, presenting us of a log of the last {n} days of fault events. This information is often extraneous to replacing faulted disks, but if the problem is more complex than a simple single disk failure, it is extremely useful in isolating a root cause.

General failure process

ZFS is not the first component in the system to be aware of a disk failure. When a disk fails or becomes unavailable or has a functional problem, this general order of events occurs:

A failed disk is detected and logged by FMA.
The disk is removed by the operating system.
ZFS sees the changed state and responds by faulting the device.

ZFS device (and virtual device) states

The overall health of a pool, as reported by zpool status, is determined by the aggregate state of all devices within the pool. Here are some definitions to help with clarity throughout this document.

ONLINE

All devices can (and should while operating optimally) be in the ONLINE state. This includes the pool, top-level VDEVs (parity groups of type mirror, raidz{1,2,3}) and the drives themselves. Transitory errors may still occur without the drive changing state.

OFFLINE

Only bottom-level devices (drives) can be OFFLINE. This is a manual administrative state, and healthy drives can be brought back online and active into the pool.

UNAVAIL

The device (or VDEV) in question can not be opened. If a VDEV is UNAVAIL, the pool will not be accessible or able to be imported. UNAVAIL devices may also report as FAULTED in some scenarios. Operationally, UNAVAIL disks are roughly equivalent to FAULTED disks.

DEGRADED

A fault in a device has occurred, impacting all VDEVs above it. The pool is still operable, but redundancy may have been lost in a VDEV.

REMOVED

The device was physically removed while the system was running. Device removal detection is hardware-dependent and might not be supported on all platforms.

FAULTED

All components (top and redundancy VDEVs, and drives) of the pool can be in a FAULTED state. A FAULTED component is completely inaccessible. The severity of a device being DEGRADED depends a lot on which device it is.

INUSE

This is a status reserved for spares which have been used to replace a faulted drive.

Degree of failure

Due to its integrated volume management characteristics, failures at different levels within ZFS impact the system and overall pool health to different degrees.

The pool itself

This is the worst possible scenario, typically resulting from loss of more drives from a redundancy group than the group was designed to withstand. The pool itself has no concept of redundancy, instead relying on integrity to be maintained within with individual RAIDZ or mirror VDEVs. For instance, losing 2 disks out of a RAIDZ would result in both the VDEV and the pool (top level VDEV) becoming FAULTED.

It should be noted that, should this scenario occur, it may still be possible to bring the pool ONLINE in rare scenarios, such as those brought on by controller failure where a large swath of disks are FAULTED as a secondary cause. Under most scenarios, a FAULTED pool is unrecoverable and its data will need to be recreated from backup.

Redundancy groups

If more disks are lost in a redundancy group than there exists redundancy (2 out of 2 disks in a mirror, 3 disks from RAIDZ-2, or 2 from RAIDZ), the redundancy will become FAULTED. A FAULTED state at the VDEV level will result in a pool FAULTED state: each redundancy group should be thought of as your top level protection against data loss, with the pool itself serving to stripe data across the redundancy groups.

Individual drive failure

Individual drives becoming faulted is not problematic to the pool or redundancy group, as long as fewer disks than there exists redundancy fail. For instance, 2 disks in a RAIDZ-2 VDEV can fail without cascading upwards.

How to replace a drive

High-level overview of drive replacement

At a high level, replacing a specific faulted drive takes the following steps:

Identify the FAULTED or UNAVAILABLE drive
zpool replace the drive in question
Wait for the resilver to finish
zpool remove the replaced drive
zpool offline the removed drive
Perform any necessary cleanup

These steps can vary somewhat depending on specific redundancy level and hardware configuration.

In-detail steps for drive replacement

Let's start with an example scenario involving multiple faulted and degraded drives:

[root@headnode (dc-example-1) ~]# zpool status
  pool: zones
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: resilvered 7.64G in 0h6m with 0 errors on Fri May 26 10:45:56 2017
config:

        NAME              STATE     READ WRITE CKSUM
        zones             DEGRADED     0     0     0
          mirror-0        ONLINE       0     0     0
            c1t0d0        ONLINE       0     0     0
            c1t1d0        ONLINE       0     0     0
          mirror-1        DEGRADED     0     0     0
            c1t2d0        ONLINE       0     0     0
            c1t3d0        FAULTED      0     0     0  external device fault
          mirror-2        ONLINE       0     0     0
            c1t4d0        ONLINE       0     0     0
            c1t5d0        ONLINE       0     0     0
          mirror-3        DEGRADED     0     0     0
            1173487       UNAVAIL      0     0     0  was /dev/dsk/c1t16d0
            c1t6d0       ONLINE       0     0     0
          mirror-4        ONLINE       0     0     0
            c1t7d0        ONLINE       0     0     0
            c1t8d0        ONLINE       0     0     0
          mirror-5        DEGRADED     0     0     0
            spare-0       DEGRADED     0     0     0
              c1t10d0     REMOVED      0     0     0
              c1t11d0     ONLINE       0     0     0
            c1t9d0        FAULTED      0     0     0  external device fault
          mirror-6        ONLINE       0     0     0
            c1t12d0       ONLINE       0     0     0
            c1t13d0       ONLINE       0     0     0
        logs
          c1t14d0         ONLINE       0     0     0
        spares
          c1t15d0         INUSE     currently in use
      c1t16d0     ONLINE    0   0   0

errors: No known data errors

In the above example, there are two faulted devices and one that is unavailable. From an administrative perspective, these two states are functionally identical: you want to replace them with known working drives.

ZFS will know when a drive hits a limit on a number of errors and will automatically take it out of the pool. This can happen for any type of failure. As an operator, all that matters is that a drive has faulted; the manufacturer can determine why it happened when you RMA it.

Identify the physical location of the FAULTED or UNAVAILABLE drive

Use diskinfo to get this information.

For instance, zpool status had shown c1t3d0 faulted:

                NAME                STATE        READ  WRITE CKSUM
        zones           DEGRADED     0     0      0
        [...]
                          mirror-1        DEGRADED     0     0      0
            c1t2d0      ONLINE       0     0      0
            c1t3d0      FAULTED      0     0      0     external device fault

diskinfo -cH will show where c1t3d0 is located. For instance:

$ diskinfo -cH
=== Output from 00000000-0000-0000-0000-003590935999 (hostname):
<snip>
SCSI    c1t4d0  HITACHI HUS723030ALS640 YHK16Z7G        2794.52 GiB     ----    [0] Slot 02
SCSI    c1t13d0 HITACHI HUS723030ALS640 YHJZMU7G        2794.52 GiB     ----    [0] Slot 03
<snip>
SCSI    c1t3d0  HITACHI HUS723030ALS640 YHK08JHG        2794.52 GiB     ----    [1] Slot 05   <--- here

Blink the drive in question

Blinking the drive will require a third party tool for your storage controller(s). In most cases where LSI cards are in use, you will want sas2ircu for Solaris. Drive location on other platforms will not be covered here.

Install the sas2ircu somewhere of use, and then run it similar to this:

p /opt/custom/bin/sas2ircu 0 locate 1:5 ON

The above command will light the LED on the 5th slot on the second ([1]) expander via the first (0) HBA.

the drive in question

Now that the FAULTED/UNAVAIL drive has been identified for replacement, there are several different ways we can replace the drive.

Replace the drive with a spare

This is the preferred method, as it is less prone to human error. However, if the chassis does not have room for spares, it will not be possible.

zpool replace zones <bad_drive> <spare_drive>

For instance, using the example above, we would do zpool replace zones c1t3d0 c1t16d0, using the other available spare.

Once the drive has been replaced and you have verified the drive is resilvering (zpool status), offline the failed drive:

zpool offline zones c1t3d0

Even in a FAULTED state, drives must be 'offlined' prior to being removed.

To then remove the now-offline drive:

zpool remove zones c1t3d0

With this replacement approach, you also have the option to wait until the resilver completes before removing the drive. However, be cautious that this does not result in forgotten dead disks.

Continue and perform any necessary cleanup.

Replace the drive with one in the same slot (using cfgadm)

In order to replace a drive in the same slot as the faulted drive, it must be removed from the pool and unconfigured from the OS before a new disk can be inserted.

The following steps outline the general procedure:

Offline c1t3d0, the disk to be replaced. You cannot unconfigure a disk that is in use.
Use cfgadm to identify the disk to be unconfigured and unconfigure it (i.e. c1t3d0). The pool will continue to be available, though it will be degraded with the now offline disk.
Physically replace the disk. The Ready to Remove LED must be illuminated before you physically remove the faulted drive.
Reconfigure c1t3d0.
Bring the new c1t3d0 online.
Run the zpool replace command to replace the disk.

The following example walks through the steps to replace a disk in a ZFS storage pool with a disk in the same slot.

# zpool offline zone c1t3d0
# cfgadm | grep c1t3d0
sata1/3::dsk/c1t3d0            disk         connected    configured   ok
# cfgadm -c unconfigure sata1/3
Unconfigure the device at: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:3
This operation will suspend activity on the SATA device
Continue (yes/no)? yes
# cfgadm | grep sata1/3
sata1/3                        disk         connected    unconfigured ok
<Physically replace the failed disk c1t3d0>
# cfgadm -c configure sata1/3
# cfgadm | grep sata1/3
sata1/3::dsk/c1t9d0            disk         connected    configured   ok
# zpool online zone c1t3d0
# zpool replace zone c1t3d0
# zpool status zone
  pool: zones
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 errors on Tue Feb  2 13:17:32 2010
config:

        NAME        STATE            READ  WRITE CKSUM
        zones       ONLINE             0     0     0
          mirror-0        ONLINE       0     0     0
            c1t0d0        ONLINE       0     0     0
            c1t1d0        ONLINE       0     0     0
          mirror-1        ONLINE       0     0     0
            c1t2d0        ONLINE       0     0     0
            c1t3d0        ONLINE       0     0     0
          mirror-2        ONLINE       0     0     0
            c1t4d0        ONLINE       0     0     0
            c1t5d0        ONLINE       0     0     0

errors: No known data errors

Note that the preceding zpool output might show both the new and old disks under a replacing heading. For example:

replacing     DEGRADED     0     0    0
  c1t3d0s0/o  FAULTED      0     0    0
  c1t3d0      ONLINE       0     0    0

This is normal and not something to elicit concern. The replacing status will remain until the replace is complete.

Replacing a drive in the same slot (with devfsadm)

devfsadm -Cv can also be used instead of the above cfgadm commands to rebuild the device files.

This process is more straightforward than the above process using cfgadm, and does not require new replacement drive to be in the same slot.

offline the faulted drive
physically remove the faulted drive
run devfsadm -Dv to unconfigure the old disk
insert the new drive
run devfsadm -Dv again to configure the new disk.
online the disk as above
replace the disk (also as above)

Perform any necessary cleanup

Turn the drive notification LED off

When turning the drive notification light off on the chassis, be sure to use the same slot and chassis IDs as you did when enabling it.

'''$ p /opt/custom/bin/sas2ircu 0 locate 1:5 OFF'''

Spare replacement

Be certain to follow up and pull the failed drive and replace the spare in the pool, if one was used.

Validation

Use zpool status command to verify that:

The pool status is ONLINE
There is one or more spare disks (if your environment uses them).
There is the expected number of log devices

Other Considerations

Mirrors - Special Considerations

Mirror management is different than working with RAIDZ{123} members, as unlike with RAIDZ, there is no parity to be concerned with. Because of this, mirror members can be 'detached' where you would normally remove them on RAIDZ.

zpool detach zones c1t3d0

Detaching a device is only possible if there are valid replicas of the data.

Working with spares

Hot spare disks can be added with the zpool add command and removed with the zpool remove command.

zpool add zones <disk> zpool remove zones <disk>

Once a spare replacement is initiated, a new "spare" VDEV is created within the configuration that will remain there until the original device is replaced. At this point, the hot spare becomes available again if another device fails.

An in-progress spare replacement can be cancelled by detaching the hot spare. This can only be done if the original faulted device has not yet been detached. If the disk it is replacing has been removed, then the hot spare assumes its place in the configuration.

zpool detach zones <disk>

Spares cannot replace log devices.

Working with ZIL logs

ZIL log devices are a special case in ZFS. They 'front' synchronous writes to the pool: slower sync writes get pushed to the pool and are effectively cached to fast temporary storage to allow storage consumers to continue, with the mechanisms for the ZIL flushing transactions from the log to permanent storage in bursts.

This in effect makes the ZIL a dangerous single point of failure to the pool in certain situations. For instance, if a single log device fails, it effectively cuts out the middle of the data pipeline while losing any transactions in-flight: data which was already written to the pool and present on the log, but not yet committed to persistent storage, will be lost.

Running mirrored ZIL log devices is highly recommended and mitigates this single point of failure. Working with ZIL log mirrors is contextually identical to other VDEV mirrors: you use detach to remove a mirror member.

If there is only a single ZIL log device, it is removed, not detached:

zpool remove zones c1t3d0

Please note that removal of a ZIL log is a potentially disruptive action and that it should only be done during a low I/O maintenance window.

Working with L2 ARC

L2 ARC devices can be added to the pool to provide it with secondary ARC caching for primary and meta data. Whether one or multiple L2 ARC devices are added, they will be used more in a 'striped' fashion. These are not mirrored devices, as the data they contain is transient.

To add a cache device:

zpool add zones cache <disk>

Replacement is much the same as with mirrors: single devices can be removed outright should they fail and new ones added.

One significant caveat about L2ARC is that these devices are not free to the system: their metadata is maintained within primary ARC, which is in turn wired in system memory. Memory use considerations must be made if adding L2ARC devices.

Repairing the pool

Checksum errors

Checksum errors can occur transiently on individual disks or across multiple disks. The most likely culprits are bit rot or transient storage subsystem errors - oddities like signal loss due to solar flares and so on.

With ZFS, they are not of much concern, but some degree of preventative maintenance is necessary to prevent a failure from accumulation.

From time to time you may see zpool status output similar to this:

        NAME        STATE            READ  WRITE CKSUM
        zones       ONLINE             0     0     0
          mirror-0        ONLINE       0     0     0
            c1t0d0        ONLINE       0     0     23
            c1t1d0        ONLINE       0     0     0

Note the "23" in the CKSUM column.

If this number is significantly large or growing rapidly, the drive is likely in a "pre-failure" state and will fail soon, and is otherwise (in this case) potentially compromising the redundancy of the VDEV.

One thing to make note of is that checksum errors on individual drives, from time to time, is normal and expected behavior (if not optimal). So are many errors on single drives which are about to fail. Many checksum failures across multiple drives can be indicative of a significant storage subsystem problem: a damaged cable, a faulty HBA, or even power problems. If this is noticed, consider contacting Support for assistance with identification.

Hint: You can audit pool health across the entire datacenter from the headnode with: sdc-oneachnode -c 'zpool status -x'

Resilver

A zpool resilver is an operation to rebuild parity across a pool due to either a degraded device (for instance, a disk may temporarily disappear and need to 'catch up') or a newly replaced device. In other words, moving the data from one device (the degraded/old disk) to a new device.

Multiple resilvers can occur at the same time within multiple VDEVs.

Please note that resilvers can degrade performance on a busy pool. Plan performance projections accordingly.

Resilvers are automatic. They can (and should) not be interrupted short of physical removal or failure of a device.

Scrub

Scrub examines all data in the specified pools to verify that it checksums correctly. For replicated (mirror or raidz) devices, ZFS automatically repairs any damage discovered during the scrub. The zpool status command reports the progress of the scrub and summarizes the results of the scrub upon completion.

To start a scrub:

zpool scrub zones

To stop a scrub:

zpool scrub zones -s

If a zpool resilver is in progress, it will not be able to started until the resilver completes.

Scrub and resilver concurrency

Scrubbing and resilvering are very similar operations. The difference is that resilvering only examines data that ZFS knows to be out of date (for example, when attaching a new device to a mirror or replacing an existing device), whereas scrubbing examines all data to discover silent errors due to hardware faults or disk failure.

Because scrubbing and resilvering are I/O-intensive operations, ZFS only allows one at a time. If a scrub is already in progress, the "zpool scrub" command returns an error.

Autoreplace

By enabling ZFS autoreplace on a pool (a property disabled by default) you will enable your system to automatically use a spare drive to replace FAULTED/UNAVAIL drives.

It should be cautioned that there are potential drawbacks from this approach: in the event of something like misbehaving firmware or a HBA failure, multiple drives may be replaced and then the replacements may fault prior to initial resilver, resulting in a more difficult scenario from which to recover from. Enabling auto-replace is highly inadvisable unless you've got a responsive 24/7 DC operations team.

To enable:

zpool set autoreplace=on zones

Further assistance needed

If this document is unclear, incorrect, or does not appear to cover your specific scenario, please contact MNX Support.

Additional information

Please reference the associated man pages on your systems for further in-depth information:

zfs(1M), zpool(1M), cfgadm(1M), devfsadm(1M), fmadm(1M), fmd(1M), fmdump(1M)

DOCUMENTATION

APIs

Understanding and Resolving ZFS Disk Failure