Checking the health of Triton
Periodically and when investigating problems you should check the overall health of your Triton installation.
This page provides details of a number of check you can perform using tools that come with Triton and other techniques.
Status of core services and agents
The status of the Triton core services and agents can be checked using the sdc-healthcheck
command as follows:
# sdc-healthcheck
ZONE STATE AGENT STATUS
global running - online
assets running - online
sapi running - online
binder running - online
amonredis running - online
ufds running - online
redis running - online
workflow running - online
papi running - online
sdc running - online
amon running - online
napi running - online
rabbitmq running - online
cnapi running - online
dhcpd running - online
dapi running - online
fwapi running - online
vmapi running - online
adminui running - offline
imgapi running - online
cloudapi running - online
manatee running - online
moray running - online
global running provisioner online
global running zonetracker online
global running heartbeat online
global running ur online
global running smartlogin online
The status field can show one of 4 values
Value | Status Description |
---|---|
online | All good. |
offline | The Zone or Agent is stopped. |
error | The Zone specific check failed. |
svc-err | One or more services in the Zone is not online. |
Standard checks for each zone are to confirm it is running and that all services are online (svcs -x
).
Standard check for the agents is to attempt to connect to the agent using /opt/smartdc/agents/bin/ping-agent
. For the Smartlogin agent svcs
is used to verify the service is running.
Zone specific checks
The following specific checks are made in each zone to verify it is functioning as expected. Most comprise a call to an API endpoint.
Zone | Check |
---|---|
amon | sdc-amon /pub/admin/probes |
cloudapi | sdc-listdatacenters |
cnapi | sdc-cnapi /servers?headnode=true |
fwapi | sdc-fwapi /rules |
imgapi | sdc-imgapi /images?name=imgapi |
napi | sdc-napi /networks?name=admin |
sapi | sdc-sapi /services?name=sapi |
ufds | sdc-ldap search login=admin |
vmapiv,api | vmadm lookup -1 tags.smartdc_role=vmapi |
workflow | sdc-workflow /workflows |
Parsing sdc-healthcheck output
Output from sdc-healthcheck
can be generated in a parseable format using the -p
flag.
# sdc-healthcheck -p
global:running:-:online
assets:running:-:online
sapi:running:-:online
binder:running:-:online
amonredis:running:-:online
--snip--
Resolving health check errors
Any zone or agent that returns an error is potentially a serious problem and could impact the ability to provision instances or perform other jobs in Triton. End user instances WILL NOT be affected by error conditions in the core services or agents.
Note: Many SmartOS commands have a -z
flag to allow you to call the command from the Global Zone (GZ) for a specific zone. Check the man pages or use --help
to see if -z
is available on a specific command. Commands can also be run inside a core service zone using sdc-login
, e.g.
# sdc-login adminui svcs -x
Both methods are used below.
Error: offline
- Attempt to restart the failed zone, e.g. boot up the adminui zone using
zoneadm
:
# zoneadm -z $(sdc-vmname adminui) boot
- Re-check the zone state:
[root@headnode (mxpa) ~]# zoneadm -z $(sdc-vmname adminui) list -v
ID NAME STATUS PATH BRAND IP
28 faa81fbe-ffc5-4b51-bcee-2d2562f01daf running /zones/faa81fbe-ffc5-4b51-bcee-2d2562f01daf joyent-minimal excl
- Then re-run
sdc-healthcheck
Error: svc-err
- Check which service or services have failed. e.g. for the imgapi zone:
# svcs -x -z $(sdc-vmname imgapi)
svc:/smartdc/site/imgapi:default (Triton Image API)
Zone: 234c005d-63ca-49d2-b27d-7af560cae951
Alias: imgapi0
State: maintenance since 6 March 2014 09:26:37 UTC
Reason: Restarting too quickly.
See: http://illumos.org/msg/SMF-8000-L5
See: /zones/234c005d-63ca-49d2-b27d-7af560cae951/root/var/svc/log/smartdc-site-imgapi:default.log
Impact: This service is not running.
For any service showing a state of Maintenance
you can attempt to restart it using svcadm
. This command requires an action and a service name, but it is typically only necessary to provide enough of the name to uniquely identify a service. For example:
svc:/smartdc/site/imgapi:default
can be abbreviated to:
imgapi
For example:
# svcadm -z $(sdc-vmname imgapi) clear imgapi
# sdc-login imgapi svcadm clear imgapi
Re-check the status of the services using svcs -x -z $(sdc-vmname imgapi)
. If the services are still showing in this output it is time to dig into the log files.
The log file name is shown in the output of svcs
above. It can also be obtained using the -L
flag to svcs
.
# svcs -L -z $(sdc-vmname imgapi) imgapi
/zones/234c005d-63ca-49d2-b27d-7af560cae951/root/var/svc/log/smartdc-site-imgapi:default.log
On examining the log file you may be able to understand the underlying problem and resolve it. However, it is most likely you will need to raise a support issue with MNX Support at portal.mnxsolutions.com. Please provide a support bundle with all issues relating to the operation of the head node and core services.
Additional health checks
The following checks should built into a regular overall health check of Triton. These can and should be automated via cron jobs or as part of a monitoring frame work such as Nagios or Zabbix.
Compute node agent checks
The health of the agents on the Compute nodes should be checked using svcs -x
. This can be done from the Head Node using sdc-oneachnode
as follows.
[root@headnode (mxpa) ~]# sdc-oneachnode -c svcs -x
=== Output from 44454c4c-3700-1039-8034-c2c04f445131 (CN1):
svc:/network/ntp:default (Network Time Protocol (NTP) Version 4)
State: maintenance since Tue Mar 11 15:11:57 2014
Reason: Maintenance requested by "svc:/smartdc/agent/ur:default"
See: /var/svc/log/smartdc-agent-ur:default.log
See: http://illumos.org/msg/SMF-8000-R4
See: ntpd(1M)
See: ntp.conf(4)
See: ntpq(1M)
See: /var/svc/log/network-ntp:default.log
Impact: This service is not running.
The -c
flag tells sdc-oneachnode
to run the command on Compute Nodes only and not on the head node.
Follow the same procedure as described under svc-err
in the above table for any failed/maintenance services.
Compute node networking
Compute nodes do not have plumbed interfaces in the Global Zone for the networks used in instances. Thus it is not possible to simply ping a Compute Node to determine if its networking is functioning. DO NOT be tempted to add a plumbed interface for any networks in the Global Zone of a compute node. This could significantly compromise the security of Compute Nodes which must remain isolated from the public internet.
CloudAPI endpoint
Although sdc-healthcheck
performs a direct query on CloudAPI it does so from within the global zone of the Head Node. This does not verify that there is publicly accessible internet connectivity to CloudAPI.
You should regularly poll the CloudAPI endpoint using sdc-listdatacenters
to ensure it is responding in a timely manor. This should be done from a location that requires communication to pass over the wider internet.
Where to go next
You may want to review Troubleshooting Triton in order to become familiar with handling error conditions and common problems.