Sunday, October 26, 2014

Four Reasons Why 10GbE NIC Design Matters

Apple, Ferrari, & Manolo Blahnik have all proven the immense power artful design has over influencing our purchasing decisions. Does this really apply to those products that we don’t see, is design just as critical? Yes, and just last week a master mechanic who’d just rebuilt my friend's transmission explained why. He then handed me the failed Ford part, which was made out of pressed steel, and explained that it typically lasted 200K miles. He then showed me the equivalent Chevy part which was cast, and considerably heavier, and rarely failed. So why doesn’t Ford use a cast metal part? Simple, one of Ford's key design criteria is weight. This enables Ford to more easily meet their Corporate Average Fuel Economy (CAFE) regulations that establish the ratio of high MPG to low MPG vehicles a company can sell. So to save two pounds on this part, perhaps a 1/100 of a MPG, Ford intentionally is designing their truck transmission to fail at 200K miles.

Well what about 10Gb NIC design, perhaps the third most important  subsystem in your server behind the CPU & memory, how is this important? For this discussion we’ll compare Solarflare’s SFN7002F to Intel’s X520-DA2 (which uses Intel's 82599 controller). Solarflare’s key design criteria is low latency, a fancy way for saying reducing the time it takes to get your data into & out of main memory. Contrast this to Intel whose focus is on producing a general purpose commodity product that is designed to meets the widest range of requirements. We’re going to look at four areas that highlight significant performance differences between these two approaches: transmit & receive queues, MSI-X interrupt vectors, Receive Side Scaling (RSS) queues, and physical/virtual functions (a foundational approach to supporting virtualized computing).

First we need to look at what a network interface card (NIC) really does. A NIC receives information from an external network, and places it into your computer's memory. It also takes information provided to it, and places that on the network. Solarflare’s approach to networking has been honed over the past five years by servicing the financial markets of the world, the folks who dollarize every nanosecond of the trading day. As such Solarflare has 1,024 transmit & receive queues, or Virtual NICs (vNICs) connected to each 10GbE port. On the Ethernet controller chip Solarflare has also placed a layer-2 network switch in front of those 1,024 vNICs. This network switch can use the packet’s VLAN tag to intelligently steer packets to the proper vNIC assigned to a given VLAN. While Intel on the other hand has only 128 receive/transmit queues attached to each port, or 1/8th of what Solarflare has committed.

Message Signaled Interrupts (MSI-X), is a very common way to inform the processor that data is waiting at an I/O device to be picked up. With PCI Express, we no longer have dedicated hardware interrupt request lines, so I/O devices have to use a shared messaging interface to inform the processor data is waiting. Solarflare supports 1,024 MSI-X interrupts compared to Intel’s 128. Again, this is 1/8th the underlying infrastructure necessary for passing high performance data to the host CPU complex. All of these numbers are per port.

As computers moved from one processor chip to two, then from single core chips to now 18 cores/chip (Intel) the challenge has always been mapping pathways from I/O devices directly to these processor cores. One of the most efficient mechanisms for linking cores to ethernet receive queues is a process known as Receive Side Scaling (RSS). On Intel servers PCI Express slots have an affinity for a specific CPU socket. So for optimal performance you align your 10G ethernet NICs to utilize specific CPUs by the PCI Express slot you install them into. Suppose for example you have a state-of-the-art Haswell dual socket server, and each socket has an 18 core processor. For optimal performance you might install two dual port 10GbE adapters, one in a slot that maps to CPU socket 0, and a second in a slot that maps to CPU socket 1. With this approach you can then achieve peak network performance on your server. There’s a problem though, Intel’s 82599 controller only supports 16 RSS queues per port so two cores on each of your sockets will receive less then optimal performance, as their traffic has to be routed through other cores. Solarflare’s controller on the other hand has 64 RSS queues per port, and can easily spread traffic over multiple paths to every core in your server.

Today many computers rely on virtualization to fully utilize all the resources of the server. To do this network adapters support what are called Physical & Logical functions. Physical Functions (PFs) are a method for exposing to the Virtual Machine’s Hypervisor to what is essential a fully complete physical instances of the network adapters. Solarflare supports 16 Physical Functions while Intel only supports two. Virtual Functions (VFs), are a method for creating full virtualized NICs, Solarflare supports 240 while Intel only has 128. In testing with 32 VMs running Solarflare has demonstrated that it delivers 18% greater overall performance. Note that these PF & VF numbers are per adapter.  

Earlier we mentioned that Solarflare's NICs were designed around latency, yet we’ve not covered latency. Solarflare’s generic kernel device driver that runs on their commodity SFN7002F adapter delivers sub 4 microsecond latency for a 1/2 round trip. Note a half round trip is a single send & receive combined. Intel with their generic driver is more than double this. Furthermore, Solarflare also sells an optional driver called Open Onload that further reduces latency to under 1.7 microseconds!  When it comes to overall server performance, network adapter latency really does matter.

So next time you’re selecting components for a server deployment please consider what you’ve learned above about 10GbE NICs, and “Choose Wisely.”

3 comments:

  1. Hi Scott, I just saw this post and really enjoyed it. I got clued into Solarflare when Cloudflare blogged about them, and I'm amazed that the NICs only cost a few hundred bucks – I assumed they'd be MUCH more expensive, given the tie-in to HFT and do-or-die financial firms.

    I wonder if at some point we do away with the separation of NIC and CPU. It seems like if we want the best performance, the CPU should migrate toward the network ports, and ultimately the NIC becomes the CPU or the CPU becomes the NIC. So much of what these servers are doing revolves around networking anyway that networking, or responding to requests, IS the application. I guess the Big Iron 18-core Intel server CPU is too much to put on a NIC at this point, but maybe in ten years... I'd like to see a server that is basically a SFP+ module, kind of like those cool SATA DOMs – the server would basically be wrapped around the network port.

    Take it easy.

    ReplyDelete
  2. Joe, there are several answers to your second paragraph. Intel is focused on bringing the network into the host CPU. Conversely Mellanox (via Bluefield) is bringing the CPU into the NIC creating a Network Processing Unit (NPU). Note this is not new, Mellanox bought the IP ultimately from Tilera which had peddled it unsuccessfully for two decades.

    Look to your car as a good analogy, you have an engine and a transmission as two distinct units for a number of good reasons. One reason is that the same engine that sits in a truck could also be placed in an SUV or perhaps even a passenger vehicle. While the transmissions for each of these should be different. The truck is optimized for power to tow and haul, while the car might be optimized for speed.

    Also it should be noted that the people who design engines are very good at engines, but are likely unfamiliar with all the "tricks" to properly design a highly efficient transmission. While those designing transmissions are very good at that, but would likely make poor engine designers. Combining those into a single unit while it sounds ideal that single unit

    If on the other hand you're designing an engine/transmission for a very specific application only, then bringing both teams together and creating a unified engine/transmission could produce a more efficient solution. Think Tesla, and how their electric motors leverage built in transmissions, and how these units directly power drive wheels. That's one reason why a Tesla has the production car 0-60 acceleration record.

    So the best answer is that on a per-application basis it can be very beneficial to combine processing and networking into a single package, but for general computing use it's often best to separate them.

    ReplyDelete
  3. Thanks Scott, good answer. I think the convergence makes most sense for simple applications. For example, a web server. As far as I can tell, web servers aren't that complicated. Having to implement HTTP/2 in addition to HTTP/1.1 makes them more complicated, but even so what web servers do is pretty straightforward.

    Given that, I'm surprised we don't see web-servers-on-a-chip. Putting things in hardware is supposed to pay off when the thing is something you do a lot, and what these servers do is extremely repetitive and the protocols are fairly stable. People seem satisfied with the performance of nginx and IIS, but a server-on-a-chip embedded in a NIC would probably have attractive energy-use characteristics compared to a full-blown Xeon E5 platform.

    Same drill for things like Redis and memcached – those seem like simple applications that are wasting electricity running on monster Intel chips. Though I think the problem there would be that RAM may not be dense enough to fit the necessary amount into a NIC or SFP+ transceiver form factor.

    Ultimately, I'm imagining a 1U mega-server that looks like a switch – dozens of SFP+ ports (or its future successor), with a standalone server behind each one, including the compute and memory/NVM.

    ReplyDelete