Troubleshooting Triton virtual networking and fabrics

Modified: 26 Jan 2023 22:12 UTC

There are multiple things that might go wrong in the world of Triton's fabrics. There are several different points that need to work in order for this to function, these include:

The Moray Database
The Portolan Entries
The CN vaprd daemon
The CN overlay devices
The individual zone's state

In addition to the component's health, the other things that we need to look at is the actual network state and what different parts of the system believe the state to be. We'll quickly cover component health and then move onto the more complex question of verifying system state.

Verifying component health

The first thing you should do is determine which Moray corresponds to the portolan and NAPI records. Depending on the environment, this may very well be a separate instance of Moray with its own Manatee. Refer to manatee-adm if Moray is not operational.

If Moray is not operational, this will cause future look ups to fail; however, extent information is cached in the system to try and reduce the potential fall out from a broken component.

From there, the next thing to do is to look for the portolan service. The portolan service shows up in the output of the sdc-healthcheck service. In addition, you can log into that instance and follow the log (via tail $(svcs -L portolan) | bunyan) to determine the state of the service and whether or not queries are actively flowing.

From here, the next place to look at things is on the individual compute nodes themselves. Recall that on the compute nodes we have both the individual overlay devices themselves and the more general varpd daemon.

To determine the overall state of the system, the simplest thing to do is run: dladm show-overlay -f. This will display all of the entries and their current state:

computenode# dladm show-overlay -f
LINK          STATUS  DETAILS
sdc_overlay4694778 ONLINE -
sdc_overlay7957669 ONLINE -

This shows us that both of these devices are online and working. The other thing to always check is the state of the varpd service itself, which can be done using the svcs command. We can check this with svcs varpd:

computenode# svcs varpd
STATE          STIME    FMRI
online         Mar_24   svc:/network/varpd:default

It is possible to tail the logs for varpd by using tail -f $(svcs -L varpd) in the global zone on the compute node.

Finally, the last thing to do for general checking is to always remember to check the ifconfig, ipadm, arp, and ndp flags that exist on the given interfaces inside a zone. These can be quite useful. If for example, a duplicate mac address got used, we'll see a flag indicating that in ifconfig. If we don't even see an arp entry for ourselves in the arp tables, something else will have gone wrong.

We'll go into more examples of all of these in the following section.

Common troubleshooting: I cannot ping another zone

We're here, most likely, because you cannot reach another zone on a given fabric network. With this in mind, there are several things and places we can look at. This looks at them in different orders of what may or may not have gone wrong.

The following steps should be performed:

Check to make sure the IP address in question exists. This can be determined either using the Operations Portal (AdminUI) or via a direct call to NAPI (via sdc-napi /networks/UUID/ips).
Assuming the IP exists, you now want to examine the arp tables for the probelmatic zone.
To examine the arp tables, use the command arp -na. The -a option tells arp(1) to print the entire table. The -n option tells arp not to do DNS reverse lookups, a generally useful thing when we're uncertain about what's going on with the network.

For example, this is what we see inside one of our zones:

zone# arp -na
Net to Media Table: IPv4
Device   IP Address               Mask      Flags      Phys Addr
------ -------------------- --------------- -------- ---------------
net0   192.168.128.1        255.255.255.255          90:b8:d0:00:8c:32
net0   192.168.128.6        255.255.255.255 SPLA     90:b8:d0:b6:3f:b2

Review the Flags column, checking for flags such as d, y and U. The presence of these flags indicate uncertainties and problems with an address.
If we don't see an entry for the address that we're trying to find, then that indicates that we cannot find an address for this entry. If we know that the entry exists, but we don't see it, that may indicate a problem with the tables themselves or the series of portolan requests that are being made.

From the global zone

From the global zone, the most important thing to look at is the equivalent of the ARP tables for an overlay device -- these are referred to as the VL2 mappings.

To print these values, pass the -t option to dladm show-overlay.

computenode# dladm show-overlay -t sdc_overlay4694778
LINK          TARGET  DESTINATION
sdc_overlay4694778 90:b8:d0:b6:3f:b2 172.24.1.5:4789
sdc_overlay4694778 90:b8:d0:f4:d:64 172.24.1.4:4789

The target address indicates the MAC address that we're trying to send to on the instance, while the destination address indicates where we will actually be sending the packet in question.
- If an address is indicated here, then we should check connectivity between the current host and the destination specified with something like an ICMP ping.
- If the address is not listed, then that means we have no mapping. There could be a few reasons for this.
  - We may not have an actual mapping.
  - We may be unable to get a mapping.
  - At this point, check the FMA status of the device via dladm show-overlay -f as described in the earlier health check section.
Assuming an address is indicated, you need to try an ICMP ping request.
If there are no arp or overlay entries being added we need to check the portolan service.
Check the portolan service logs by logging into the portolan zone and running tail -f $(svcs -L portolan) | bunyan as described above.
- Review the portolan logs to see if the process is actively answering queries.
- If no queries are being made, the compute node needs to be reviwed.
- If queries are being made but not answered, the problem likely lies within the tables stored in Moray and NAPI.

DOCUMENTATION

APIs

Troubleshooting Triton virtual networking and fabrics

Verifying component health

Common troubleshooting: I cannot ping another zone

From the global zone