Understanding and Resolving ZFS Disk Failure
Introductory Concepts
This document is written for administrators and those who have familiarity with computing hardware platforms and storage concepts such as RAID. If you're already versed in the general failure process, you can skip ahead to how to replace a drive and repairing the pool.
Degrees of verbosity
When a drive fails or has errors, a great degree of logging data is available on SmartOS. We can drill down in more detail to help us find the underlying cause of disk failure. In descending order, these commands will present the disk failure cause in increasing verbosity:
* `zpool status`
* `iostat -en`
* `iostat -En`
* `fmadm faulty`
* `fmdump -et {n}days`
* `fmdump -eVt {n}days`
The zpool status
command will present us with a high level view of pool health.
iostat
will present us with high level error counts and specifics as to the devices in question.
fmadm faulty
will tell us more specifically which event led to the disk failure. (fmadm
can also be used to clear transitory faults; this, however, is outside the scope of this document. Refer to the fmadm man page for more information.)
fmdump
is much more specific still, presenting us of a log of the last {n} days of fault events. This information is often extraneous to replacing faulted disks, but if the problem is more complex than a simple single disk failure, it is extremely useful in isolating a root cause.
General failure process
ZFS is not the first component in the system to be aware of a disk failure. When a disk fails or becomes unavailable or has a functional problem, this general order of events occurs:
- A failed disk is detected and logged by FMA.
- The disk is removed by the operating system.
- ZFS sees the changed state and responds by faulting the device.
ZFS device (and virtual device) states
The overall health of a pool, as reported by zpool status
, is determined by the aggregate state of all devices within the pool. Here are some definitions to help with clarity throughout this document.
ONLINE
All devices can (and should while operating optimally) be in the ONLINE state. This includes the pool, top-level VDEVs (parity groups of type mirror, raidz{1,2,3}) and the drives themselves. Transitory errors may still occur without the drive changing state.
OFFLINE
Only bottom-level devices (drives) can be OFFLINE. This is a manual administrative state, and healthy drives can be brought back online and active into the pool.
UNAVAIL
The device (or VDEV) in question can not be opened. If a VDEV is UNAVAIL
, the pool will not be accessible or able to be imported. UNAVAIL devices may also report as FAULTED in some scenarios. Operationally, UNAVAIL
disks are roughly equivalent to FAULTED disks.
DEGRADED
A fault in a device has occurred, impacting all VDEVs above it. The pool is still operable, but redundancy may have been lost in a VDEV.
REMOVED
The device was physically removed while the system was running. Device removal detection is hardware-dependent and might not be supported on all platforms.
FAULTED
All components (top and redundancy VDEVs, and drives) of the pool can be in a FAULTED state. A FAULTED component is completely inaccessible. The severity of a device being DEGRADED depends a lot on which device it is.
INUSE
This is a status reserved for spares which have been used to replace a faulted drive.
Degree of failure
Due to its integrated volume management characteristics, failures at different levels within ZFS impact the system and overall pool health to different degrees.
The pool itself
This is the worst possible scenario, typically resulting from loss of more drives from a redundancy group than the group was designed to withstand. The pool itself has no concept of redundancy, instead relying on integrity to be maintained within with individual RAIDZ or mirror VDEVs. For instance, losing 2 disks out of a RAIDZ would result in both the VDEV and the pool (top level VDEV) becoming FAULTED.
It should be noted that, should this scenario occur, it may still be possible to bring the pool ONLINE in rare scenarios, such as those brought on by controller failure where a large swath of disks are FAULTED as a secondary cause. Under most scenarios, a FAULTED pool is unrecoverable and its data will need to be recreated from backup.
Redundancy groups
If more disks are lost in a redundancy group than there exists redundancy (2 out of 2 disks in a mirror, 3 disks from RAIDZ-2, or 2 from RAIDZ), the redundancy will become FAULTED. A FAULTED state at the VDEV level will result in a pool FAULTED state: each redundancy group should be thought of as your top level protection against data loss, with the pool itself serving to stripe data across the redundancy groups.
Individual drive failure
Individual drives becoming faulted is not problematic to the pool or redundancy group, as long as fewer disks than there exists redundancy fail. For instance, 2 disks in a RAIDZ-2 VDEV can fail without cascading upwards.
How to replace a drive
High-level overview of drive replacement
At a high level, replacing a specific faulted drive takes the following steps:
- Identify the
FAULTED
orUNAVAILABLE
drive zpool replace
the drive in question- Wait for the resilver to finish
zpool remove
the replaced drivezpool offline
the removed drive- Perform any necessary cleanup
These steps can vary somewhat depending on specific redundancy level and hardware configuration.
In-detail steps for drive replacement
Let's start with an example scenario involving multiple faulted and degraded drives:
[root@headnode (dc-example-1) ~]# zpool status
pool: zones
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: resilvered 7.64G in 0h6m with 0 errors on Fri May 26 10:45:56 2017
config:
NAME STATE READ WRITE CKSUM
zones DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
mirror-1 DEGRADED 0 0 0
c1t2d0 ONLINE 0 0 0
c1t3d0 FAULTED 0 0 0 external device fault
mirror-2 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
mirror-3 DEGRADED 0 0 0
1173487 UNAVAIL 0 0 0 was /dev/dsk/c1t16d0
c1t6d0 ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
c1t7d0 ONLINE 0 0 0
c1t8d0 ONLINE 0 0 0
mirror-5 DEGRADED 0 0 0
spare-0 DEGRADED 0 0 0
c1t10d0 REMOVED 0 0 0
c1t11d0 ONLINE 0 0 0
c1t9d0 FAULTED 0 0 0 external device fault
mirror-6 ONLINE 0 0 0
c1t12d0 ONLINE 0 0 0
c1t13d0 ONLINE 0 0 0
logs
c1t14d0 ONLINE 0 0 0
spares
c1t15d0 INUSE currently in use
c1t16d0 ONLINE 0 0 0
errors: No known data errors
In the above example, there are two faulted devices and one that is unavailable. From an administrative perspective, these two states are functionally identical: you want to replace them with known working drives.
ZFS will know when a drive hits a limit on a number of errors and will automatically take it out of the pool. This can happen for any type of failure. As an operator, all that matters is that a drive has faulted; the manufacturer can determine why it happened when you RMA it.
Identify the physical location of the FAULTED or UNAVAILABLE drive
Use diskinfo
to get this information.
For instance, zpool status
had shown c1t3d0
faulted:
NAME STATE READ WRITE CKSUM
zones DEGRADED 0 0 0
[...]
mirror-1 DEGRADED 0 0 0
c1t2d0 ONLINE 0 0 0
c1t3d0 FAULTED 0 0 0 external device fault
diskinfo -cH
will show where c1t3d0
is located. For instance:
$ diskinfo -cH
=== Output from 00000000-0000-0000-0000-003590935999 (hostname):
<snip>
SCSI c1t4d0 HITACHI HUS723030ALS640 YHK16Z7G 2794.52 GiB ---- [0] Slot 02
SCSI c1t13d0 HITACHI HUS723030ALS640 YHJZMU7G 2794.52 GiB ---- [0] Slot 03
<snip>
SCSI c1t3d0 HITACHI HUS723030ALS640 YHK08JHG 2794.52 GiB ---- [1] Slot 05 <--- here
Blink the drive in question
Blinking the drive will require a third party tool for your storage controller(s). In most cases where LSI cards are in use, you will want sas2ircu for Solaris. Drive location on other platforms will not be covered here.
Install the sas2ircu somewhere of use, and then run it similar to this:
p /opt/custom/bin/sas2ircu 0 locate 1:5 ON
The above command will light the LED on the 5th slot on the second ([1]) expander via the first (0) HBA.
the drive in question
Now that the FAULTED/UNAVAIL drive has been identified for replacement, there are several different ways we can replace the drive.
Replace the drive with a spare
This is the preferred method, as it is less prone to human error. However, if the chassis does not have room for spares, it will not be possible.
zpool replace zones <bad_drive> <spare_drive>
For instance, using the example above, we would do zpool replace zones c1t3d0 c1t16d0
, using the other available spare.
Once the drive has been replaced and you have verified the drive is resilvering (zpool status
), offline the failed drive:
zpool offline zones c1t3d0
Even in a FAULTED state, drives must be 'offlined' prior to being removed.
To then remove the now-offline drive:
zpool remove zones c1t3d0
With this replacement approach, you also have the option to wait until the resilver completes before removing the drive. However, be cautious that this does not result in forgotten dead disks.
Continue and perform any necessary cleanup.
Replace the drive with one in the same slot (using cfgadm)
In order to replace a drive in the same slot as the faulted drive, it must be removed from the pool and unconfigured from the OS before a new disk can be inserted.
The following steps outline the general procedure:
- Offline
c1t3d0
, the disk to be replaced. You cannot unconfigure a disk that is in use. - Use
cfgadm
to identify the disk to be unconfigured and unconfigure it (i.e.c1t3d0
). The pool will continue to be available, though it will be degraded with the now offline disk. - Physically replace the disk. The Ready to Remove LED must be illuminated before you physically remove the faulted drive.
- Reconfigure
c1t3d0
. - Bring the new
c1t3d0
online. - Run the
zpool replace
command to replace the disk.
The following example walks through the steps to replace a disk in a ZFS storage pool with a disk in the same slot.
# zpool offline zone c1t3d0
# cfgadm | grep c1t3d0
sata1/3::dsk/c1t3d0 disk connected configured ok
# cfgadm -c unconfigure sata1/3
Unconfigure the device at: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:3
This operation will suspend activity on the SATA device
Continue (yes/no)? yes
# cfgadm | grep sata1/3
sata1/3 disk connected unconfigured ok
<Physically replace the failed disk c1t3d0>
# cfgadm -c configure sata1/3
# cfgadm | grep sata1/3
sata1/3::dsk/c1t9d0 disk connected configured ok
# zpool online zone c1t3d0
# zpool replace zone c1t3d0
# zpool status zone
pool: zones
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Tue Feb 2 13:17:32 2010
config:
NAME STATE READ WRITE CKSUM
zones ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c1t3d0 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
errors: No known data errors
Note that the preceding zpool output might show both the new and old disks under a replacing heading. For example:
replacing DEGRADED 0 0 0
c1t3d0s0/o FAULTED 0 0 0
c1t3d0 ONLINE 0 0 0
This is normal and not something to elicit concern. The replacing
status will remain until the replace is complete.
Replacing a drive in the same slot (with devfsadm)
devfsadm -Cv
can also be used instead of the above cfgadm commands to rebuild the device files.
This process is more straightforward than the above process using cfgadm, and does not require new replacement drive to be in the same slot.
- offline the faulted drive
- physically remove the faulted drive
- run
devfsadm -Dv
to unconfigure the old disk - insert the new drive
- run
devfsadm -Dv
again to configure the new disk. - online the disk as above
- replace the disk (also as above)
Perform any necessary cleanup
Turn the drive notification LED off
When turning the drive notification light off on the chassis, be sure to use the same slot and chassis IDs as you did when enabling it.
'''$ p /opt/custom/bin/sas2ircu 0 locate 1:5 OFF'''
Spare replacement
Be certain to follow up and pull the failed drive and replace the spare in the pool, if one was used.
Validation
Use zpool status
command to verify that:
- The pool status is ONLINE
- There is one or more spare disks (if your environment uses them).
- There is the expected number of log devices
Other Considerations
Mirrors - Special Considerations
Mirror management is different than working with RAIDZ{123} members, as unlike with RAIDZ, there is no parity to be concerned with. Because of this, mirror members can be 'detached' where you would normally remove them on RAIDZ.
zpool detach zones c1t3d0
Detaching a device is only possible if there are valid replicas of the data.
Working with spares
Hot spare disks can be added with the zpool add
command and removed with the zpool remove
command.
zpool add zones <disk>
zpool remove zones <disk>
Once a spare replacement is initiated, a new "spare" VDEV is created within the configuration that will remain there until the original device is replaced. At this point, the hot spare becomes available again if another device fails.
An in-progress spare replacement can be cancelled by detaching the hot spare. This can only be done if the original faulted device has not yet been detached. If the disk it is replacing has been removed, then the hot spare assumes its place in the configuration.
zpool detach zones <disk>
Spares cannot replace log devices.
Working with ZIL logs
ZIL log devices are a special case in ZFS. They 'front' synchronous writes to the pool: slower sync writes get pushed to the pool and are effectively cached to fast temporary storage to allow storage consumers to continue, with the mechanisms for the ZIL flushing transactions from the log to permanent storage in bursts.
This in effect makes the ZIL a dangerous single point of failure to the pool in certain situations. For instance, if a single log device fails, it effectively cuts out the middle of the data pipeline while losing any transactions in-flight: data which was already written to the pool and present on the log, but not yet committed to persistent storage, will be lost.
Running mirrored ZIL log devices is highly recommended and mitigates this single point of failure. Working with ZIL log mirrors is contextually identical to other VDEV mirrors: you use detach to remove a mirror member.
If there is only a single ZIL log device, it is removed, not detached:
zpool remove zones c1t3d0
Please note that removal of a ZIL log is a potentially disruptive action and that it should only be done during a low I/O maintenance window.
Working with L2 ARC
L2 ARC devices can be added to the pool to provide it with secondary ARC caching for primary and meta data. Whether one or multiple L2 ARC devices are added, they will be used more in a 'striped' fashion. These are not mirrored devices, as the data they contain is transient.
To add a cache device:
zpool add zones cache <disk>
Replacement is much the same as with mirrors: single devices can be removed outright should they fail and new ones added.
One significant caveat about L2ARC is that these devices are not free to the system: their metadata is maintained within primary ARC, which is in turn wired in system memory. Memory use considerations must be made if adding L2ARC devices.
Repairing the pool
Checksum errors
Checksum errors can occur transiently on individual disks or across multiple disks. The most likely culprits are bit rot or transient storage subsystem errors - oddities like signal loss due to solar flares and so on.
With ZFS, they are not of much concern, but some degree of preventative maintenance is necessary to prevent a failure from accumulation.
From time to time you may see zpool status
output similar to this:
NAME STATE READ WRITE CKSUM
zones ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 23
c1t1d0 ONLINE 0 0 0
Note the "23" in the CKSUM column.
If this number is significantly large or growing rapidly, the drive is likely in a "pre-failure" state and will fail soon, and is otherwise (in this case) potentially compromising the redundancy of the VDEV.
One thing to make note of is that checksum errors on individual drives, from time to time, is normal and expected behavior (if not optimal). So are many errors on single drives which are about to fail. Many checksum failures across multiple drives can be indicative of a significant storage subsystem problem: a damaged cable, a faulty HBA, or even power problems. If this is noticed, consider contacting Support for assistance with identification.
Hint: You can audit pool health across the entire datacenter from the headnode with: sdc-oneachnode -c 'zpool status -x'
Resilver
A zpool resilver is an operation to rebuild parity across a pool due to either a degraded device (for instance, a disk may temporarily disappear and need to 'catch up') or a newly replaced device. In other words, moving the data from one device (the degraded/old disk) to a new device.
Multiple resilvers can occur at the same time within multiple VDEVs.
Please note that resilvers can degrade performance on a busy pool. Plan performance projections accordingly.
Resilvers are automatic. They can (and should) not be interrupted short of physical removal or failure of a device.
Scrub
Scrub examines all data in the specified pools to verify that it checksums correctly. For replicated (mirror or raidz) devices, ZFS automatically repairs any damage discovered during the scrub. The zpool status
command reports the progress of the scrub and summarizes the results of the scrub upon completion.
To start a scrub:
zpool scrub zones
To stop a scrub:
zpool scrub zones -s
If a zpool resilver is in progress, it will not be able to started until the resilver completes.
Scrub and resilver concurrency
Scrubbing and resilvering are very similar operations. The difference is that resilvering only examines data that ZFS knows to be out of date (for example, when attaching a new device to a mirror or replacing an existing device), whereas scrubbing examines all data to discover silent errors due to hardware faults or disk failure.
Because scrubbing and resilvering are I/O-intensive operations, ZFS only allows one at a time. If a scrub is already in progress, the "zpool scrub" command returns an error.
Autoreplace
By enabling ZFS autoreplace
on a pool (a property disabled by default) you will enable your system to automatically use a spare drive to replace FAULTED/UNAVAIL drives.
It should be cautioned that there are potential drawbacks from this approach: in the event of something like misbehaving firmware or a HBA failure, multiple drives may be replaced and then the replacements may fault prior to initial resilver, resulting in a more difficult scenario from which to recover from. Enabling auto-replace is highly inadvisable unless you've got a responsive 24/7 DC operations team.
To enable:
zpool set autoreplace=on zones
Further assistance needed
If this document is unclear, incorrect, or does not appear to cover your specific scenario, please contact MNX Support.
Additional information
Please reference the associated man pages on your systems for further in-depth information:
zfs(1M), zpool(1M), cfgadm(1M), devfsadm(1M), fmadm(1M), fmd(1M), fmdump(1M)