Recovering a compute node in Triton

Modified: 28 Apr 2022 01:26 UTC

When a hardware fault occurs on a compute node (CN), in most cases, recovering the compute node and the instances that live on that compute node can be done as Triton DataCenter makes this process incredibly simple.

General things to keep in mind

While we can (in most cases) recover the instances on a failed compute node, backup and recovery of instances is ultimately the customers responsibility.

There is no permanent repository of meta data for compute nodes or instances. 

The Triton heartbeater frequently sends buckets of information back to the head node describing the server and the instances running on it. This info is held in caches by CNAPI and VMAPI and updated each time a new set of information is provided.

Thus, if a new compute node boots up and its disks already contain a zpool and some instances, the compute node will immediately appear in Triton as a fully set up compute node and the instances will be accessible and manageable via the API's.

Possible failure scenarios

Recovery process 

Delete the old, broken server from Triton in one of two following ways (this server should be showing as Unavailable):

Boot up the new or repaired server.

Verify that the server and all of it's instances show up as RUNNING in the Operations Portal and/or on the new server itself by running vmadm list on the compute node.

If the server was previously running fabrics, run the post-setup command to ensure the sdc_underlay NIC is updated.

Get the UUID of the underlay network:

headnode# sdc-napi /networks?name=sdc_underlay | json -H 0.uuid

Get the UUID of the compute nodes you are operating on; these all should be tagged with the sdc-underlay nictag:

headnode# sdc-server list

This command will need to be run once for each cloud node that has been tagged with the sdc-underlay NIC tag:

headnode# sdcadm post-setup underlay-nics <UNDERLAY_NETWORK> <CN_UUID>

Reboot the server again. Confirm the new server and all instances show as running.