Recovering a compute node in Triton
When a hardware fault occurs on a compute node (CN), in most cases, recovering the compute node and the instances that live on that compute node can be done as Triton DataCenter makes this process incredibly simple.
General things to keep in mind
While we can (in most cases) recover the instances on a failed compute node, backup and recovery of instances is ultimately the customers responsibility.
There is no permanent repository of meta data for compute nodes or instances.
The Triton heartbeater frequently sends buckets of information back to the head node describing the server and the instances running on it. This info is held in caches by CNAPI and VMAPI and updated each time a new set of information is provided.
Thus, if a new compute node boots up and its disks already contain a zpool and some instances, the compute node will immediately appear in Triton as a fully set up compute node and the instances will be accessible and manageable via the API's.
Possible failure scenarios
- Compute node requires a new motherboard
- Compute node suffers from a complete chassis failure (in which case the physical disks can be migrated to a new chassis)
Recovery process
Delete the old, broken server from Triton in one of two following ways (this server should be showing as Unavailable):
- Delete in the Operations Portal using the Forget Server button at the foot of the Server Details Page
- Delete via CNAPI using this command (run from the head node via the Triton (tools) zone):
sdc sdc-cnapi /servers/:UUID -X DELETE
Boot up the new or repaired server.
Verify that the server and all of it's instances show up as RUNNING in the Operations Portal and/or on the new server itself by running vmadm list
on the compute node.
If the server was previously running fabrics, run the post-setup command to ensure the sdc_underlay
NIC is updated.
Get the UUID of the underlay network:
headnode# sdc-napi /networks?name=sdc_underlay | json -H 0.uuid
Get the UUID of the compute nodes you are operating on; these all should be tagged with the sdc-underlay
nictag:
headnode# sdc-server list
This command will need to be run once for each cloud node that has been tagged with the sdc-underlay
NIC tag:
headnode# sdcadm post-setup underlay-nics <UNDERLAY_NETWORK> <CN_UUID>
Reboot the server again. Confirm the new server and all instances show as running.