Triton virtual networking design and architecture

Modified: 26 Jan 2023 22:12 UTC

Triton implements virtual networking through the use of overlay networks built using VXLAN as the encapsulation protocol and our own protocol to determine where to find the destination container or instance. This is the fundamental building block for what the user guide referred to as a fabric. Recall from the Triton networking and fabric user guide, that each customer has a unique fabric in each data center. This fabric is realized by giving each customer a unique VXLAN identifier, thus creating for them, an overlay network.

Instance and compute node view

Consider the traditional two interface model that we have for a compute node, with one interface being used for admin traffic and one being used for all customer traffic. When we have zones with VNICs, they're directly created over that customer interface. In each of the zones below, consider net0 as the customer's public interface and net1 is the customer's private interface. They both are VNICs that end up being created over the same physical interface, though they may have different VLAN tags associated with them. It looks roughly like:

 +---------------------------------------------------------------------+
 | Compute Node                                                        |
 |    +------------------+  +------------------+  +------------------+ |
 |    | Zone 0           |  | Zone 1           |  | Zone 2           | |
 |    | +------+------+  |  | +------+------+  |  | +------+------+  | |
 |    | | net0 | net1 |  |  | | net0 | net1 |  |  | | net0 | net1 |  | |
 |    +------------------+  +------------------+  +------------------+ |
 |         |      |              |      |              |       |       |
 |         +------+--------------+------+--------------+--+----+       |
 |                                                        |            |
 |    +--------+  +-------------+                         |            |
 |    | Agents |  | GZ Services |  +-----------+   +------+----+       |
 |    +--------+  +-------------+  |   Admin   |   | Customer  |       |
 |        |              |         | Interface |   | Interface |       |
 |        +--------------+---------|   ixgbe0  |   |  ixgbe1   |       |
 +---------------------------------+-----------+---+-----------+-------+

In the new world where customers use an overlay device to create private networks, it means that the image will be a different. While, net0 will still look the same in all of the zones, net1 will not be sitting directly on top of a physical device, it will instead sit on top of an overlay device. The operating system names all data links following the pattern , for example net0, ixgbe0, external0. For overlay devices, we always use the name of 'overlay'; however, the number that we use corresponds to their VXLAN identifier.

In addition, the global zone will have an VNIC called underlay0 which is configured on the underlay network. We'll discuss the underlay network more in the next section.

In addition, the global zone has an important daemon that we call varpd that is active here. We'll touch more upon it when we discuss the section "Look Ups, and The Flow of a Packet.

Our new network interface model (with the addition of the overlay device) looks like this:

 +---------------------------------------------------------------------+
 | Compute Node                                                        |
 |    +------------------+  +------------------+  +------------------+ |
 |    | Zone 0           |  | Zone 1           |  | Zone 2           | |
 |    | +------+------+  |  | +------+------+  |  | +------+------+  | |
 |    | | net0 | net1 |  |  | | net0 | net1 |  |  | | net0 | net1 |  | |
 |    +------------------+  +------------------+  +------------------+ |
 |         |      |              |      |              |       |       |
 |         |      |              |      |              |       |       |
 |         | +-----------+       | +-----------+       | +-----------+ |
 |         | | overlay23 |       | | overlay24 |       | | overlay25 | |
 |         | +-----------+       | +-----------+       | +-----------| |
 |         |                     |                     |               |
 |         +---------------------+---------------------+--+            |
 |                                                        |            |
 |          +-------+                  +-----------+      |            |
 |          | varpd |                  | underlay0 |--+   |            |
 |          +-------+                  +-----------+  |   |            |
 |              |                                     |   |            |
 |   +--------+ | +-------------+                     |   |            |
 |   | Agents | | | GZ Services |  +-----------+   +------+----+       |
 |   +--------+ | +-------------+  |   Admin   |   | Customer  |       |
 |        |     |        |         | Interface |   | Interface |       |
 |        +-----+--------+---------|   ixgbe0  |   |  ixgbe1   |       |
 +---------------------------------+-----------+---+-----------+-------+

The underlay network

The first piece of the puzzle is the introduction of an underlay network. In Triton, the primary special network is the admin network, which is required to be on a single layer two (L2) broadcast domain.

From this design, the common Triton environment uses two different interfaces, one for the admin network, and the other for what we call customer traffic -- all traffic that's not part of administering Triton, regardless of whether it's private traffic to the data center or it's traffic that's publicly routable to the broader Internet or a company Intranet. While this traffic could be further segregated among interfaces, the rest of this section will assume the common two interface environment, though it equally applies to the others.

The underlay network is separate from the Triton admin network. The underlay network only carries customer traffic, it isn't used to administrate Triton. Like the admin network, it only exists in the scope of a given data center and addresses on that network should not be accessible outside of the data center.

Every compute node that can support virtual networking is configured with an additional VNIC that is called underlay0. This VNIC belongs to the global zone and is configured on boot by Triton. The operating system is listening for VXLAN traffic on the IP address associated with underlay0 and when it sends VXLAN encapsulated traffic destined for a container or hardware virtual machine on another machine, it sends that traffic out of underlay0.

Underlay network physical design

Unlike the admin network, the underlay network does not have strict requirements in terms of the required topology. The following are the core rules that govern the underlay network:

It should be a separate network from the admin network
All compute nodes on the underlay network must be able to reach each other, it's fine if there is routing.
Customers should not be able to send traffic to those interfaces.

From here, there are many different designs that one could take. You could have a design where the underlay is simply one large layer three (L3) Network on a single VLAN. Alternatively, it could be made up of many routable layer three (L3) networks, such as having one VLAN and layer three (L3) network per rack in the data center which are all interconnected and routable through various different topologies.

Triton itself does not have any preference on what the physical topology is or should look like; operators should determine what design makes the most sense for their physical network.

Underlay network MTU size

Using encapsulation is not a free lunch, and one side effect of using something like VXLAN is that there is that it takes more bytes to put a packet on a wire. Traditionally on networks, the default MTU is 1500 bytes. An encapsulated packet adds another 40-60 bytes of overhead, which means that if the underlay network is at its default MTU, we couldn't run the overlay networks at a 1500 byte MTU.

To that end, the underlay network will need to enable jumbo frames and set the MTU to 9000. By setting it to 9000 this allows us to do a several important things. First, it allows us to leave the default MTU for public networks set to 1500, ensuring consistent behavior. Second, by increasing the MTU, it also means that we can increase the MTU that we use for private networks up from 1500, to say 8500 -- giving us plenty of room for potential overhear, while reducing the packet rate of streaming traffic.

Lookups -- the search for a MAC address

As we discussed earlier, there are two main parts to the overlay network.

Encapsulating and sending traffic on the overlay.
Figuring out where the remote host lives.

We mentioned earlier that the kernel is in charge of doing the encapsulation. When it encounters a destination that it isn't familiar with, it causes user land to do a lookup. We'll get more into the mechanics of the user/kernel interaction in the next section.

When doing these searches, we break addresses into three different categories.

VL3 address A guest container or VM Layer 3 address (usually IP or IPv6). This address is on an overlay network and private to that customer.
VL2 address A guest MAC address that is on an overlay network. This MAC address, is assigned by NAPI.
UL3 address This is an underlay layer 3 address. Every CN has one of these. This is the address that encapsulated packets are sent to and from. The UL3 network is one which no one should be able to access.

There are two different questions that get asked:

If I have an IP address, what is the corresponding MAC address? (VL3->VL2 lookup)
If I have a MAC address, what is the IP address of the physical machine the container or hardware VM lives on? (VL2->UL3 lookup)

The authoritative source for this information is a Moray database that is owned by NAPI. Note, this may be a different moray instance than the one that is used more generally by Triton.

Compute nodes initiate these lookups; however, they don't directly query the moray database. Instead, they query one of many instances of the Portolan service. The compute node's use a simple TCP binary protocol to make lookup requests and service the container or instance. For specifics on the protocol, please see the protocol design document on the Portolan repository.

The overall flow of a look up request is based on the following image:

                                                       +------+------+
                                 +------------+    +---| CN 0 | CN 1 |
  +-----------------+         +--| Portolan 0 |----|   +------+------+
  |                 |         |  +------------+    +---| CN 2 | CN 3 |
  |  Moray Database |         |                    |   +------+------+
  |                 |         |  +------------+    +---| CN 4 | CN 5 |
  | Updated by NAPI |---------+--| Portolan 1 |----|   +------+------+
  |  and Portolan   |         |  +------------+    +---| CN 6 | CN 7 |
  |                 |         |                    |   +------+------+
  +-----------------+         |  +------------+    +---| CN 8 | CN 9 |
                              +--| Portolan 2 |----|   +------+------+
                                 +------------+    +---| CN ...      |
                                                       +-------------+

The portolan service is registered in DNS. The various CNs end up doing a DNS lookup to determine the set up portolan services that exist and maintain connections to them, performing round robins and removing bad connections that are no longer valid.

A CN makes an svp (SDC VXLAN Protocol) request to one of the portolan servers. The portolan server will then make a request to the canonical moray database using the standard moray protocol.

To speed things up, as part of making the VL3->VL2 lookup request, we include the VL2->UL3 response, as this follows from the standard use of IP performing an ARP lookup and then sending traffic off to that guest. The VL3->VL2 lookups are cached in the instance as part of its ARP and NDP tables. The VL2->UL3 lookups are cached by the operating system kernel as part of its management of the overlay device.

Network changes and shoot downs

The network is not a static place, instances are created and destroyed, IP addresses reassigned, and whole instances migrated. When these items occur, then information that we have needs to be invalidated in the tables and looked up. Importantly, the system doesn't try to provide new information, rather it provides what we call a shoot down, an indication to a CN that something needs to be invalidated.

The moray database that we maintain for this maintains what we call a shoot down table, essentially a series of things that should be applied to a given overlay device on a given CN. It isn't removed from the table until the compute node acknowledges it and generally these are designed such that if they are reapplied, it may induce a short latency bump, but generally is safe. This acknowledgement ensures that even if a crash occurs among one of the components along the chain, we'll still make sure it was applied.

The flow of a packet

Consider two zones each with a single interface, owned by the same customer. Zone 1 is on compute node A, while zone 2 is on compute node B. In essence it looks like this:

 +-------------------------------+    +-------------------------------+
 | Compute Node A                |    | Compute Node B                |
 |  +--------------------------+ |    |  +--------------------------+ |
 |  | Zone 0                   | |    |  | Zone 1                   | |
 |  | +------+ 10.2.3.4/24     | |    |  | +------+ 10.2.3.5/24     | |
 |  | | net0 | de:ad:be:ef:0:0 | |    |  | | net0 | de:ad:be:ef:0:1 | |
 |  +--------------------------+ |    |  +--------------------------+ |
 |        |                      |    |        |                      |
 | +-----------+                 |    | +-----------+                 |
 | | overlay42 |                 |    | | overlay42 |                 |
 | +-----------+ +------------+  |    | +-----------+ +------------+  |
 |  +-------+    | underlay0  |  |    |  +-------+    | underlay0  |  |
 |  | varpd |    |172.6.7.8/24|  |    |  | varpd |    |172.6.7.9/24|  |
 |  +-------+    +------------+  |    |  +-------+    +------------+  |
 |      |               |        |    |      |               |        |
 | +-----------+   +-----------+ |    | +-----------+   +-----------+ |
 | |   Admin   |   | Customer  | |    | |   Admin   |   | Customer  | |
 | | Interface |   | Interface | |    | | Interface |   | Interface | |
 | |   ixgbe0  |   |  ixgbe1   | |    | |   ixgbe0  |   |  ixgbe1   | |
 +-------------------------------+    +-------------------------------+

Consider the case where zone0 runs the command:

# ping 10.2.3.5

The first thing that we should understand is how the overlay device, varpd, and underlay0 all fit together. Here's another image that better shows the different pieces of the outgoing data path:

  . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . .
  . Kernel *                          .  . Userland *                  .
  . ********                          .  .***********                  .
  .                                   .  .   +---------+               .
  .  +------------+      +---------+  .  .   | Virutal |   +--------+  .
  .  | TCP/IP/VND |      | Overlay |=======*=| ARP     |---| Lookup |  .
  .  +------------+   +->| Cache   |  .  . * | Daemon  |   | Plugin |  .
  .        |          |  | Lookup  |  .  . * +---------+   | Module |  .
  .    +------+       |  +---------+  .  . *               | (SVP)  |  .
  .    | VNIC |       |       |       .  . *               +--------+  .
  .    +------+       |       |       .  . * Async              |      .
  .        |          |       v       .  . * Upcall             |      .
  .  +------------+   |  +---------+  .  . . . . . . . . . . . .|. . . .
  .  |  Overlay   |->-+  | Encap   |  .                         |
  .  |  Device    |-<----| Plugin  |  .                         |
  .  +------------+      | Engine  |  .                   +----------+
  .        |             | (VXLAN) |  .                   | TCP      |
  .  +-------------+     +---------+  .                   | Portolan |
  .  | GZ IP Stack |                  .                   | Request  |
  .  +-------------+                  .                   +----------+
  .        |                          .
  .  +-----------------+              .
  .  | GZ VNIC, 9K MTU |              .
  .  | undearly0       |              .
  .  +-----------------+              .
  .        |                          .
  . . . . .|. . . . . . . . . . . . . .
           |
  +---------------------+
  |  Top of Rack Switch |
  +---------------------+

The zone generates the ICMP packet the same way that it normally does and sends it down into the IP stack. The IP stack doesn't necessarily have the MAC address for the IPv4 address 10.2.3.5. Because of that, it goes and generates an ARP request, and sends that out the VNIC. At the VNIC, the normal antispoof protection is applied and then instead of sending it out a physical device, as it would normally do, it instead sends it out to the overlay device.

When the packet reaches the overlay device, the first thing that we do is we check it's MAC address. Because we're trying to do an ARP lookup, we first try a layer two (L2) broadcast to the address FF:FF:FF:FF:FF:FF. The kernel will check if we have a standard entry for that -- we do not, therefore it queues the request for the varpd daemon.

The varpd daemon is the virtual arp daemon. It logically is performing the same service that ARP would have traditionally performed on a network, just performing these look ups without any broadcasting involved. To the network stack, it behaves in a similar way to ARP. That is, it has to assume that the lower level might disappear or may get no response. The daemon itself has various plugins which are in charge of answering the question of what is the proper destination for this packet, should this packet be dropped, and should we inject anything into the guest. In Triton, we use the svp plugin, the SDC VXLAN protocol.

Normally a plugin just looks at the mac header, the destination address and the ethertype, and chooses a course of action based on that. However, the broadcast addresses is generally special. We don't support broadcast traffic, and while we would normally drop it, we first check the ethertype to see if it is ARP traffic. Because it is ARP traffic, instead of dropping it we grab a copy of the corresponding packet and look at the ARP request. Here, we'd see that we're trying to figure out what MAC address corresponds to the IP 10.2.3.5, therefore we generate a VL3->VL2 lookup. varpd sends that out to a corresponding portolan. The portolan asks moray, and replies.

When we receive the VL3->VL2 lookup, we do two things. Recall, that the VL3->VL2 lookup includes the VL2->UL3 lookup as well. Therefore, the first thing that the plug-in does is inject the VL2->UL3 mapping back to the kernel. That stores the mapping in the overlay device's lookup cache. This ensures that when it gets the reply from the ARP request that we'll inject shortly, that it won't have to do another varpd query to ask where the MAC address lives.

With the answer to the VL3->VL2 query, the plugin will work with varpd to create a fake ARP reply and inject that. Once that's been done, we end up dropping the original outgoing ARP packet and it never leaves the kernel.

The kernel only queues a certain amount of data for a given overlay device to ensure that if a user is trying to pathologically overload us, it can only end up using a fixed amount of memory.

With the response to the ARP query, the IP layer will now go and construct the proper packet for the ICMP ping request and send that down along to the MAC layer, which is where the VNIC's antispoof properties are checked. From there, it moves on and enters the overlay device.

The overlay device checks its lookup cache and sees if we know where the destination for this packet is. Because we already cached that from the ARP lookup, we don't have to leave the kernel. We generate the VXLAN header for the packet, prepending it to the message, and then send it out through a kernel UDP socket, sending it to the destination that we were told for the lookup table.

In this case, the lookup table would have told us that we should send the packet to the IP address 172.6.7.9. So a VXLAN packet is written out of ixgbe1 via the VNIC underlay0, with an IP address of 172.6.7.8 and a destination address of 172.6.7.9, port 4789 (the IANA port for VXLAN).

It then travels the normal switch path to reach compute node B. Compute node B receives the packet on 172.6.7.9 and unwraps the VXLAN packet. It looks at the VXLAN identifier and determines if it matches a known overlay device. Once it determines that it does, it delivers it into the kernel's general packet receive routine, the same logical one that it enters for a normal packet. However, in this case, it only considers VNICs created on top of overlay42, as opposed to those that would be on top of a physical device.

At that point, the ICMP ping packet is received by Zone 1's IP stack, which will generate an ICMP reply, and send it back to Zone 0, where we do a similar series of look ups, just going in reverse.

Differences from traditional networks

Traffic on an overlay network is different from a traditional network. Importantly, the following kinds of traffic aren't supported:

Broadcast Traffic
Multicast Traffic

Traditional protocols that are required by IP, such as ARP and NDP, are instead emulated by varpd. This means that certain network debugging techniques such as trying to ping the broadcast address of an IP network and getting responses from everything on it, will not work.

Bootstrapping networking state in the global zone

As the network state that a global zone needs to bootstrap overlay networking requires more information than our traditional information, we've changed that around. Traditionally this information was passed in on the kernel command line. However, that itself was not very flexible.

Instead, we're now using what we call boot time modules. When the compute node boots, instead of passing in information via the kernel command line, we can pass in additional files to the boot loader which will preserve them for when then operating system starts up. Each file is followed by a hash, which allows us to verify that they were transmitted across the network correctly and not subject to errors.

This information is read early in boot and is used to bootstrap the state necessary for the compute node in such a way that later dynamic updates will build on top of this state. This new method simplifies the act of passing in the necessary instructions for the underlay network such as the address for the VNIC, any necessary routes, and more.

DOCUMENTATION

APIs