Understanding disk space usage in instances

Modified: 08 Sep 2022 04:28 UTC

There are numerous tools that can be used to manage containers and hardware virtual machines in Triton. This page discusses those tools as well as some of the key differences between how disk space is allocated, used, and managed between the two types of instances. Note: this page is primarily geared towards users of Linux, SmartOS, or FreeBSD operating systems. Windows users should consult their windows documentation for troubleshooting and management tools to use inside their Windows server.

Disk usage in infrastructure containers running SmartOS or Container Native Linux

In containers, disks are allocated as virtual storage pools (known as zpools), which are constructed of virtual devices (vdevs). Virtual devices are constructed of physical drives.

During provisioning of an infrastructure container running SmartOS or Container Native Linux, the disk space that gets allocated is based on the predefined quota set in the package that the instance is provisioned with.

Disk usage in hardware virtual machines

For hardware virtual machines, disks are allocated as zvols, and come with two mounted drives upon provisioning:

The OS disks are not intended for any data storage, and are allocated the same size across the board. The /dev/vdb disk, however, is there for storing data and varies in capacity depending on the package size chosen for the instance.

Monitoring and analyzing disk usage in instances

There are many tools available to monitor and analyze disk usage for instances. The following outlines a few of the more common tools used to monitor disk usage and i/o:

Working with df

The df command is used to view free disk space. This is useful to monitor the amount of space left on disk devices and filesystems.

To view a current snapshot of disk space on an instance (in human-readable format), use the -kh options as shown below:

container# df -kh
Filesystem                                  Size  Used Avail Use% Mounted on
zones/2ccc2818-0948-e401-a957-b1aa3b5a2228   17G  427M   16G   3% /
/lib                                        263M  235M   28M  90% /lib
/lib/svc/manifest                           2.4T  761K  2.4T   1% /lib/svc/manifest
/lib/svc/manifest/site                       17G  427M   16G   3% /lib/svc/manifest/site
/sbin                                       263M  235M   28M  90% /sbin
/usr                                        417M  374M   44M  90% /usr
/usr/ccs                                     17G  427M   16G   3% /usr/ccs
/usr/local                                   17G  427M   16G   3% /usr/local
swap                                        512M   35M  478M   7% /etc/svc/volatile
/usr/lib/libc/libc_hwcap1.so.1              417M  374M   44M  90% /lib/libc.so.1
swap                                        256M  8.0K  256M   1% /tmp
swap                                        512M   35M  478M   7% /var/run

If you see / (root), /tmp (swap), getting close to 100%, then the instance is in jeopardy of running out of disk space, which can lead to other problems (including hard hangs).

It's important to keep filesystems cleaned up, by deleting any unnecessary data, or moving files over to an external backup solution.

Working with iostat

The iostat utility reports i/o activity in specified interval iterations. You can use iostat to monitor disk activity as a tool for ensuring disk i/o performance stays healthy.

To view how busy disk i/o activity is, use the -xnmz options and an interval to run (in seconds), such as exampled below:

container# iostat -xnmz 10
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.1    0.0  0.0  0.0    0.2    0.6   0   0 lofi1
    0.0    0.0    0.2    0.2  0.0  0.0    0.0    0.0   0   0 ramdisk1
   38.7  262.7  100.7 20106.7  0.0  0.3    0.0    0.8   0  10 c0t0d0
   38.7  262.7  100.7 20106.7 54.7  0.3  181.4    0.9   5  10 zones
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  246.3  514.9  230.9 33402.2  0.0  0.4    0.0    0.5   1  35 c0t0d0
  246.3  514.9  230.9 33402.2 33.6  0.5   44.1    0.6   3  36 zones
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  476.7    0.0 38763.1  0.0  0.5    0.0    1.0   1  18 c0t0d0
    0.0  476.7    0.0 38763.1 54.6  0.5  114.6    1.0  14  18 zones
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.2  134.6    0.7 5569.8  0.0  0.0    0.0    0.2   0   1 c0t0d0
    0.2  134.6    0.7 5569.8  0.7  0.0    4.9    0.2   1   1 zones
                    extended device statistics

The above runs in 10 second iterations, providing an average of read and writes per second, along with various other details. Please see the iostat man(1) pages for more information.

One of the key columns to focus on (specifically for monitoring disk i/o activity), is the %b column. This column indicates how busy the disks are.

Seeing this column spike to 100% on occasion is less concerning than seeing it at, or close to 100% on a consistent basis. In general, if a device is constantly busy, then it's important to determine who or what is doing all of the disk i/o.

Troubleshooting hard drive failures on compute nodes

Hard drive failures on compute nodes affect all instances that live on the server. There are a couple of key tools that can be used to help troubleshoot and identify potential problems with the underlying storage:

Verifying failures using iostat

For a quick look at potential disk errors, you can run the iostat command with the -En option as shown below:

computenode# iostat -En
c0t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: Generic  Product: STORAGE DEVICE   Revision: 9451 Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c1t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: Kingston Product: DataTraveler 2.0 Revision: PMAP Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 82 Predictive Failure Analysis: 0
c2t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: WDC WD10EZEX-00K Revision: 1H15 Serial No: WD-WCC1S5975038
Size: 1000.20GB <1000204886016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 116 Predictive Failure Analysis: 0
c2t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: WDC WD5000AVVS-0 Revision: 1B01 Serial No: WD-WCASU7437291
Size: 500.11GB <500107862016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 121 Predictive Failure Analysis: 0
c2t2d0           Soft Errors: 0 Hard Errors: 26 Transport Errors: 0
Vendor: HL-DT-ST Product: DVDRAM GH24NS95  Revision: RN01 Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 26 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

If you see any errors being reported (soft, hard, or transport), then it's definitely worth further investigating for potential hard drive failures or other possible storage-related issues.

Working with the fault management configuration tool fmadm

The fmadm utility is used to administer and service problems detected by the Solaris Fault Manager, fmd(1M). If a component has been diagnosed as faulty, fmadm will report what component has failed, and the response taken for the failed device.

To view a list of failed components, run fmadm with the faulty option:

computenode# fmadm faulty
 -------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
 -------------- ------------------------------------  -------------- ---------
May 02 20:00:34 abe52661-52aa-ec45-983e-f019e465db53  ZFS-8000-FD    Major

Host        : headnode
Platform    : MS-7850   Chassis_id  : To-be-filled-by-O.E.M.
Product_sn  :

Fault class : fault.fs.zfs.vdev.io
Affects     : zfs://pool=zones/vdev=5dbf266cd162b324
                  faulted and taken out of service
Problem in  : zfs://pool=zones/vdev=5dbf266cd162b324
                  faulted and taken out of service

Description : The number of I/O errors associated with a ZFS device exceeded
                     acceptable levels.  Refer to
              http://illumos.org/msg/ZFS-8000-FD for more information.

Response    : The device has been offlined and marked as faulted.  An attempt
                     will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.

The above provides an example of what you may see if a device in a zpool was diagnosed as faulted. For more details on fmadm, please view the fmadm man(1M) pages.

Verifying the status of zpools

The zpool command manages and configures ZFS storage pools, which are simply a collection of virtual devices (generally physical drives) that are provided to ZFS datasets (zones).

You can obtain a quick snapshot of the health and status of storage pools by running the following zpool command:

computenode# zpool status
  pool: zones
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zones       ONLINE       0     0     0
          c2t0d0    ONLINE       0     0     0
        cache
          c2t1d0    ONLINE       0     0     0

errors: No known data errors

The above output indicates a healthy storage pool with no errors or disk maintenance activities going on.