A couple months ago, my advisor asked me if I wanted to develop a small part of the DDiDD project, which would check incoming DNS packets and reply to any packets with an invalid domain automatically, which would free up the DNS server from responding to those. Sounds simple, right? There’s one catch - the packets needed to be processed at line rate, which in my case meant 40 gigabits per second.

40 gigabits per second of pure DNS packets, assuming packet sizes of about 80 bytes per packet, means that the program would have to process 62.5 million packets every second. That gives me 16 nanoseconds to process each packet, or 67 CPU cycles given a 4.2GHz processor with a single core (assuming that the bridge between my NIC and my processor has zero latency). This is not enough time for the Linux kernel’s network stack to even send a packet (L. Rizzo, 2012, p. 3).

So, what I needed was a library that could provide:

  • Fast packet I/O at 40Gbps
  • Ability to create virtual interfaces to communicate with the DNS server on the same machine
  • Ability to read and modify packets before forwarding them

The Competition

There’s many libraries out there that promise higher packet processing speeds than the Linux kernel. Most of them rely on hardware on the NIC or kernel bypass techniques. It’s interesting to note that most of these methods rely on polling, rather than interrupts, since at such high network speeds interrupts would actually slow down packet processing.

P4

Luckily, some of the NICs in the testbed I’m using contain FPGAs inside and support P4, which essentialy turns our NIC into a switch. However, P4 did not support an easy way of looking at the packet’s contents, only the headers. This also requires buying expensive, specialized hardware, which would limit where we could deploy the software.

Mellanox VMA

Mellanox’s VMA runs by using an LD_PRELOAD to override the kernel’s network calls with their own, which lowers the number costly memcpys, interrupts, and context switches you have to do.

Solarflare EF_VI

Cloudflare does a much better job of explaining this than I will on their blog. It works in a similar way to Mellanox’s solution.

Netmap

Netmap is a collection of kernel modules that allow for fast packet I/O. However, it also requires patched drivers, and supports less NICs than DPDK does. It also lets you create virtual network interfaces for non-netmap programs to use.

DPDK

DPDK is another high-speed kernel bypass library sponsored by Intel which supports a wide array of network interfaces, but has a troublesome API which requires rewriting the network stack for everything beyond the physical layer. While far from ideal, since this was the library I was most familiar with, I ended up using this for the project.

The Kernel NIC Interface

One of the more interesting modules in DPDK is the kernel NIC interface, which lets you create a virtual interface for non-DPDK programs to use. However, it’s also faster than traditional virtual interfaces, since it cuts out some of the costly transitions between kernelspace and userspace. This requires a kernel module, kmod/rte_kni.ko. For my use case, I’ll be setting carrier=on so I won’t have to bother with rte_kni_update_link().

The developers also provided a handy example application in their repos, which forwards incoming data from a physical NIC interface to a KNI interface, and vice versa. The magic here happens in the kni_egress and kni_ingress functions, which work similarly.

Each interface has RX and TX ring buffers, which stores packets until they’re read. That makes the sending and receiving packets without processing them rather simple. To transmit a packet, just push a rte_mbuf containing the packet into the TX buffer, and to recieve a packet, read the RX buffer into a different rte_mbuf. These operations are achieved with the rte_eth_tx_burst and rte_eth_rx_burst for our physical interface, and rte_kni_tx_burst and rte_kni_rx_burst for our KNI one. So, all the program needs to do is read data in with rte_*_rx_burst and then write those rte_mbufs out to the other interface using rte_*_tx_burst.

Since this example doesn’t handle headers, it’s fully transparent to the end user, except that all traffic is now routed thorugh vEth* instead of eth*.

Fun Wth Ring Buffers

DPDK also exposes the ring buffers directly to the user, which is a core component of this project. By having kni_ingress write to a new ring buffer instead of the KNI TX ring, I can have another thread running to do work on those packets. Here’s what that looks like:

For this, I’ll need 4 threads. One would forward the packets from our NIC to the WORKER_RX_RING. There, another thread reads WORKER_RX_RING and parses through each packet. All DNS packets with an invalid TLD as determined by ICANN are then passed to the WORKER_TX_RING, while the rest continue to the KNI interface. Finally, one thread would pass invalid TLD responses to the NIC, while the other passes outgoing packets from the KNI.

Decoding a DNS Packet

Now that we have our ring workers passing data between each other, we’ll also have to parse the incoming DNS packets and read them into our program. Here’s an example of a DNS query packet opened in Wireshark:

What we’re focusing on here is the DNS section, so we can skip the first 42 bytes, which are the layer 2-4 header. At the beginning, we have 2 bytes that act as an identifier for the client to match up replies. After that, we have two bytes of flags, which RFC 1035 goes into on detail in section 4.1.1. The next 4 sets of 2 bytes list the number of questions, answers, name server resource records (RRs), and resource records following the packet. It’s important to note that these records are in big-endian, which means you’ll have to reverse them when running on a little-endian architecture like x86_64. We’ll skip the additional RRs and focus on the question RRs.

Each query (or question resource record) is split up into substrings by domain, so in our case ns5.SPOTIFY.COM will become ns5, SPOTIFY, and COM. Preceding each string is the length of the string, so ns5 would be 03 6e 73 35. The same applies to the other two domains. The name ends with the null terminator 00. Following that we’ve got indicators for query type (0x0001 in this case for an A record), and query class (0x0001 for Internet addresses).

Knowing this, it’s trivial to implement an algorithm to go through the query name until we find the TLD:

  // Loop until the end of the query name
  std::string query;
  int str_len, offset = 0;
  char *qname_start = qname;

  while (true) {
    // Read in the string length
    str_len = *qname;
    qname = qname + str_len + 1;
    if (*qname != 0x0)
      offset += str_len + 1;
    else
      break;
  }

In this case, the TLD is COM. That happens to be a valid TLD, so we’ll push this packet into the KNI’s RX_QUEUE and continue. But what if it isn’t?

Building a Packet

Thankfully, for our use case, I didn’t need to include authority sections or anything extra. Therefore all I had to do was modify the existing packet (thus saving on mallocs and memcpys) by swapping the destination address and ports with the source address and ports. This had to be done on the Ethernet, IPv4, and UDP headers. Following that, I’ll modify the NXDOMAIN flags while keeping everything else the same using a simple bitmask:

// Modify DNS headers
*(dns_hdr + 2) |= 0b10000000;   // Standard query authoritative answer, no
                                // truncation or recursion
*(dns_hdr + 3) = 0b00000011;    // Name error

One last thing to do now: generate our IPv4 checksums (and ignore the UDP ones):

  // Set IPv4 checksum
  ip_hdr->hdr_checksum = 0;
  ip_hdr->hdr_checksum = rte_ipv4_cksum(ip_hdr);
  udp_hdr->dgram_cksum = 0;    // Ignore UDP checksum

And that’s it! All that’s left to do is push the packet to the NIC’s RX_QUEUE, and it’ll be sent back.

The code for this project is available here