Troubleshooting Triton virtual networking and fabrics
There are multiple things that might go wrong in the world of Triton's fabrics. There are several different points that need to work in order for this to function, these include:
- The Moray Database
- The Portolan Entries
- The CN vaprd daemon
- The CN overlay devices
- The individual zone's state
In addition to the component's health, the other things that we need to look at is the actual network state and what different parts of the system believe the state to be. We'll quickly cover component health and then move onto the more complex question of verifying system state.
Verifying component health
The first thing you should do is determine which Moray corresponds to the portolan and NAPI records. Depending on the environment, this may very well be a separate instance of Moray with its own Manatee. Refer to manatee-adm if Moray is not operational.
If Moray is not operational, this will cause future look ups to fail; however, extent information is cached in the system to try and reduce the potential fall out from a broken component.
From there, the next thing to do is to look for the portolan service. The portolan service shows up in the output of the sdc-healthcheck
service. In addition, you can log into that instance and follow the log (via tail $(svcs -L portolan) | bunyan
) to determine the state of the service and whether or not queries are actively flowing.
From here, the next place to look at things is on the individual compute nodes themselves. Recall that on the compute nodes we have both the individual overlay devices themselves and the more general varpd
daemon.
To determine the overall state of the system, the simplest thing to do is run: dladm show-overlay -f
. This will display all of the entries and their current state:
computenode# dladm show-overlay -f
LINK STATUS DETAILS
sdc_overlay4694778 ONLINE -
sdc_overlay7957669 ONLINE -
This shows us that both of these devices are online and working. The other thing to always check is the state of the varpd service itself, which can be done using the svcs
command. We can check this with svcs varpd
:
computenode# svcs varpd
STATE STIME FMRI
online Mar_24 svc:/network/varpd:default
It is possible to tail the logs for varpd
by using tail -f $(svcs -L varpd)
in the global zone on the compute node.
Finally, the last thing to do for general checking is to always remember to check the ifconfig, ipadm, arp, and ndp flags that exist on the given interfaces inside a zone. These can be quite useful. If for example, a duplicate mac address got used, we'll see a flag indicating that in ifconfig. If we don't even see an arp entry for ourselves in the arp tables, something else will have gone wrong.
We'll go into more examples of all of these in the following section.
Common troubleshooting: I cannot ping another zone
We're here, most likely, because you cannot reach another zone on a given fabric network. With this in mind, there are several things and places we can look at. This looks at them in different orders of what may or may not have gone wrong.
The following steps should be performed:
-
Check to make sure the IP address in question exists. This can be determined either using the Operations Portal (AdminUI) or via a direct call to NAPI (via
sdc-napi /networks/UUID/ips
). -
Assuming the IP exists, you now want to examine the arp tables for the probelmatic zone.
-
To examine the arp tables, use the command
arp -na
. The-a
option tells arp(1) to print the entire table. The-n
option tells arp not to do DNS reverse lookups, a generally useful thing when we're uncertain about what's going on with the network. -
For example, this is what we see inside one of our zones:
zone# arp -na Net to Media Table: IPv4 Device IP Address Mask Flags Phys Addr ------ -------------------- --------------- -------- --------------- net0 192.168.128.1 255.255.255.255 90:b8:d0:00:8c:32 net0 192.168.128.6 255.255.255.255 SPLA 90:b8:d0:b6:3f:b2
-
Review the Flags column, checking for flags such as
d
,y
andU
. The presence of these flags indicate uncertainties and problems with an address. - If we don't see an entry for the address that we're trying to find, then that indicates that we cannot find an address for this entry. If we know that the entry exists, but we don't see it, that may indicate a problem with the tables themselves or the series of portolan requests that are being made.
From the global zone
From the global zone, the most important thing to look at is the equivalent of the ARP tables for an overlay device -- these are referred to as the VL2 mappings.
-
To print these values, pass the
-t
option todladm show-overlay
.computenode# dladm show-overlay -t sdc_overlay4694778 LINK TARGET DESTINATION sdc_overlay4694778 90:b8:d0:b6:3f:b2 172.24.1.5:4789 sdc_overlay4694778 90:b8:d0:f4:d:64 172.24.1.4:4789
-
The target address indicates the MAC address that we're trying to send to on the instance, while the destination address indicates where we will actually be sending the packet in question.
-
If an address is indicated here, then we should check connectivity between the current host and the destination specified with something like an ICMP ping.
- If the address is not listed, then that means we have no mapping. There could be a few reasons for this.
- We may not have an actual mapping.
- We may be unable to get a mapping.
- At this point, check the FMA status of the device via
dladm show-overlay -f
as described in the earlier health check section.
-
-
Assuming an address is indicated, you need to try an ICMP ping request.
-
If there are no arp or overlay entries being added we need to check the portolan service.
-
Check the portolan service logs by logging into the portolan zone and running
tail -f $(svcs -L portolan) | bunyan
as described above.-
Review the portolan logs to see if the process is actively answering queries.
-
If no queries are being made, the compute node needs to be reviwed.
- If queries are being made but not answered, the problem likely lies within the tables stored in Moray and NAPI.
-