Recovering a compute node in Triton

Modified: 08 Sep 2022 04:28 UTC

When a hardware fault occurs on a compute node (CN), in most cases, recovering the compute node and the instances that live on that compute node can be done as Triton DataCenter makes this process incredibly simple.

General things to keep in mind

While we can (in most cases) recover the instances on a failed compute node, backup and recovery of instances is ultimately the customers responsibility.

There is no permanent repository of meta data for compute nodes or instances.

The Triton heartbeater frequently sends buckets of information back to the head node describing the server and the instances running on it. This info is held in caches by CNAPI and VMAPI and updated each time a new set of information is provided.

Thus, if a new compute node boots up and its disks already contain a zpool and some instances, the compute node will immediately appear in Triton as a fully set up compute node and the instances will be accessible and manageable via the API's.

Possible failure scenarios

Compute node requires a new motherboard
Compute node suffers from a complete chassis failure (in which case the physical disks can be migrated to a new chassis)

Recovery process

Delete the old, broken server from Triton in one of two following ways (this server should be showing as Unavailable):

Delete in the Operations Portal using the Forget Server button at the foot of the Server Details Page
Delete via CNAPI using this command (run from the head node via the Triton (tools) zone): sdc sdc-cnapi /servers/:UUID -X DELETE

Boot up the new or repaired server.

Verify that the server and all of it's instances show up as RUNNING in the Operations Portal and/or on the new server itself by running vmadm list on the compute node.

If the server was previously running fabrics, run the post-setup command to ensure the sdc_underlay NIC is updated.

Get the UUID of the underlay network:

headnode# sdc-napi /networks?name=sdc_underlay | json -H 0.uuid

Get the UUID of the compute nodes you are operating on; these all should be tagged with the sdc-underlay nictag:

headnode# sdc-server list

This command will need to be run once for each cloud node that has been tagged with the sdc-underlay NIC tag:

headnode# sdcadm post-setup underlay-nics <UNDERLAY_NETWORK> <CN_UUID>

Reboot the server again. Confirm the new server and all instances show as running.

DOCUMENTATION

APIs

Recovering a compute node in Triton

General things to keep in mind

Possible failure scenarios

Recovery process