Troubleshooting Triton virtual networking and fabrics

Modified: 26 Jan 2023 22:12 UTC

There are multiple things that might go wrong in the world of Triton's fabrics. There are several different points that need to work in order for this to function, these include:

In addition to the component's health, the other things that we need to look at is the actual network state and what different parts of the system believe the state to be. We'll quickly cover component health and then move onto the more complex question of verifying system state.

Verifying component health

The first thing you should do is determine which Moray corresponds to the portolan and NAPI records. Depending on the environment, this may very well be a separate instance of Moray with its own Manatee. Refer to manatee-adm if Moray is not operational.

If Moray is not operational, this will cause future look ups to fail; however, extent information is cached in the system to try and reduce the potential fall out from a broken component.

From there, the next thing to do is to look for the portolan service. The portolan service shows up in the output of the sdc-healthcheck service. In addition, you can log into that instance and follow the log (via tail $(svcs -L portolan) | bunyan) to determine the state of the service and whether or not queries are actively flowing.

From here, the next place to look at things is on the individual compute nodes themselves. Recall that on the compute nodes we have both the individual overlay devices themselves and the more general varpd daemon.

To determine the overall state of the system, the simplest thing to do is run: dladm show-overlay -f. This will display all of the entries and their current state:

computenode# dladm show-overlay -f
LINK          STATUS  DETAILS
sdc_overlay4694778 ONLINE -
sdc_overlay7957669 ONLINE -

This shows us that both of these devices are online and working. The other thing to always check is the state of the varpd service itself, which can be done using the svcs command. We can check this with svcs varpd:

computenode# svcs varpd
STATE          STIME    FMRI
online         Mar_24   svc:/network/varpd:default

It is possible to tail the logs for varpd by using tail -f $(svcs -L varpd) in the global zone on the compute node.

Finally, the last thing to do for general checking is to always remember to check the ifconfig, ipadm, arp, and ndp flags that exist on the given interfaces inside a zone. These can be quite useful. If for example, a duplicate mac address got used, we'll see a flag indicating that in ifconfig. If we don't even see an arp entry for ourselves in the arp tables, something else will have gone wrong.

We'll go into more examples of all of these in the following section.

Common troubleshooting: I cannot ping another zone

We're here, most likely, because you cannot reach another zone on a given fabric network. With this in mind, there are several things and places we can look at. This looks at them in different orders of what may or may not have gone wrong.

The following steps should be performed:

From the global zone

From the global zone, the most important thing to look at is the equivalent of the ARP tables for an overlay device -- these are referred to as the VL2 mappings.