Friday, January 6, 2017

RoCE versus TCP for Low-Latency Applications

The effectiveness of our communication as a species is one of our defining characteristics. Earlier this week while waiting in a customer's lobby in Chicago I noticed four framed posters displaying all the hand signals used in the trading pits of four major markets. Having been focused on electronic trading for the past decade this "ancient" form of communications became an instant curiosity worthy of inspection. On reflection I was amazed to think that trillions of dollars in transactions over decades had been conducted purely by people motioning with their hands.

About a decade ago in the High Performance Computing (HPC) market, a precursor market for High Frequency Trading (HFT), there was a dust up regarding the effectiveness of Remote Direct Memory Access (RDMA). One of Myricom's senior researchers wrote an article for HPCWire titled "A Critique of RDMA" that set off a chain reaction of critical response articles:
At the time Myricom was struggling to establish relevance for it's new Myrinet-10G protocol against a competing technology, Infiniband, which was rapidly gaining traction. Now to be fair, at the time I was in sales at Myricom. The crux of the article was that the one sided RDMA communications model, which rose from the ashes of the Virtual Interface Architecture (VIA), was still more of a problem than a solution when compared to the existing two sided Send/Recv model used by four other competing HPC protocols (QsNet, SeaStar, Infinipath & Myrinet Express).

Now RDMA has had a decade to improve as it spread from Infiniband to Ethernet under the name RDMA over Converged Ethernet (RoCE), but it still has performance issues. The origin of RDMA is cast in a closed lossless layer-2 Infiniband network with deterministic latency. Let's take a moment and adapt a NASCAR analogy. Think of RDMA as the vehicle and Infiniband as the track. One can take a Sprint Cup Series vehicle tuned for the Charlotte Motor Speedway, and take it for a spin on the local roads, but is that really practical (it certainly isn't legal)? Yes, its origin is in the stock car, but how well will it do in stop and go traffic, particularly on up hill grades? How about parallel parking, oh wait there's no reverse. Tight turns at low speeds, signaling, weather, etc. Sprint Cup Series vehicles are designed for 200MPH on a closed extremely well defined and maintained track. Ethernet by contrast is the road driven by everyone else, it's unpredictable with thousands of obstacles, and is ever changing.

Those familiar with Ethernet know that lossless and deterministic latency are not two characteristics often normally associated with this network fabric. Some of us have been around the block and lived through Carrier Sense Multiple Access with Collision Detection (CSMA/DA) where packets often collided and random delays before retransmission attempts were common. TCP/IP was developed during these early days and it was designed with this packet loss as a key criteria. In the past three decades Ethernet has evolved considerably from its roots as a shared coax cable utilizing vampire taps to where we are today with dedicated twisted pair cabling and fiber optics, but on rare occasion packets are still dropped, and performance isn't always deterministic. Today most packet drops are as a result of network congestion. As discussed TCP/IP is equipped to handle this, unfortunately RoCE is not.

For RoCE to perform properly it requires a lossless layer-2 network. Essentially a NASCAR track overlaid onto our public roads. To accomplish this over a routed Ethernet network a new protocol was developed: Data Center Bridging Capabilities Exchange (shortened to DCB or DCBX). DCB is used at every hop of the network to negotiate and create a lossless layer-2 fabric on top of Ethernet. It achieves this by more tightly managing queue overflows and by adjusting network flow priorities as if they were traversing separate physical medias. In essence RoCE traffic is prioritized into essentially its own carpool lane ahead of other traffic in hopes of avoiding drops as a result of congestion. While this all sounds great, in talking with several large Web2.0 customers who've invested years in RoCE we learned that the vast number will never deploy it in production. There are far to many challenges to get and keep it working, and in high traffic volumes it suffers. Unlike Infiniband HPC clusters which are stood up as self contained networks (closed course race tracks) to address specific computational problems, Ethernets are in a constant state of flux with servers and switches being added and removed (our public road system) as the needs of the business change. To be clear TCP/IP is resilient to packet loss, while RoCE is not.

On the latency performance side of things, in the past decade we've achieved roughly one microsecond for a 1/2 round trip (a send + receive) with both TCP and UDP, when using Solarflare's OpenOnload. This is in line with RoCE latency which is also in the domain of one microsecond. Keep in mind that normal TCP or UDP transactions over 10GbE typically run in the range of 5 to 15 microseconds, so 1 microsecond is a huge improvement. By now you're likely saying "So what?" For most applications like file sharing, databases, etc... the difference between one microsecond and even fifteen microseconds is lost in the 10,000+ microseconds a whole transaction might take. It turns out though that there are new breeds of network latency sensitive applications that depend on technologies like Non-Volatile Memory Express (NVMe), Neural Networks, and high volume compound web transactions that can see significant improvements when latency is reduced. When low latency TCP is applied to these problems the performance gains are both measurable and significant.

So the next time someone suggests RoCE ask if they've considered a little known competing protocol called TCP/IP. While RoCE is the shiny new object, TCP/IP has had several decades of innovation behind it which explains why it's the underlying "language of the Internet". Consider asking those promoting RoCE what their porting budget is, and if they've factored in the cost of the new network switches that will be required to support DCB? It's very likely that the application they want to deploy already supports TCP/IP, and if latency and throughput are key factors then consider contacting Solarflare about OpenOnload. OpenOnload accelerates existing sockets based applications without having to modify them.


Sunday, July 31, 2016

The Fifth wave in the 10GbE Market

In 2003 we saw the emergence of the 10GbE server adapter market with only several players, we'll call this the first wave. Early products by Neterion, and Intel carried extremely high price tags, often approaching $10K. This lead to a flood of companies jumping into the market in an effort to secure an early mover advantage. High Performance Computing (HPC) companies like Myricom with it's Myrinet 2G, and Mellanox with Infiniband SDR 10G were viewed by some as possibly having a competitive advantage as they'd already developed silicon in this area. In August of 2005 I joined Myricom to help them transition from HPC to the wider Ethernet market. By March of 2006 we launched a single port 10GbE product with a $595 price point, three years accompanied by a 10X drop in market price. That year the 10GbE market had grown to 18 different companies all offering 10GbE server adapters, we'll consider this the second wave. In my 2013 article "Crash & Boom: Inside the 10GbE Adapter Market" I explored what had happened up to that point to take the market from 18 players down to 10, you guessed it the third wave. Today only six companies remain who are actually advancing the Ethernet Controller market forward, and this is perhaps the fourth wave.

Intel is the dominant 10GbE adapter market player. They are viewed by many as the commodity option who checks the majority of the feature boxes while delivering reasonable performance. Both Mellanox and QLogic are the exascale players as their silicon caries Infiniband specific features which they've convinced this market are important. In storage Chelsio rules as they've focused considerable silicon towards offloading the computational requirements of iSCSI. For the low latency and performance over BSD compliant TCP and UDP sockets sought by the financial traders of the world Solarflare is king. This leaves one remaining actor, Broadcom, and in fact they were acquired by Avago who also picked up Emulex. The word is they've dramatically cut their Ethernet controller development staff right after having completed their 25GbE controller ASIC, which may be why we've not seen it reach the market.

So as the 10GbE market sees feature & performance gains as the silicon is migrated over the next several years to 25GbE and 50GbE expect to continue seeing these four players dominate in their respective niches: Intel, Mellanox, Qlogic, Solarflare & Chelsio. I view this final phase as the fifth wave.

Wednesday, July 20, 2016

Stratus and Solarflare for Capital Markets and Exchanges

by David Whitney, Director of Global Financial Services, Stratus

The partnership of Stratus, the global standard for fault tolerant hardware solutions, and Solarflare, the unchallenged leader in application network acceleration for financial services, at face value seems like an odd one. Stratus ‘always on’ server technology removes all single points of failure, which eliminates the need to write and maintain costly code to ensure high availability and fast failover scenarios.  Stratus and high performance are rarely been used in the same sentence.

Lets go back further… Throughout the 1980’s and 90’s Stratus, and their proprietary VOSS operating system, globally dominated financial services from exchanges to investment banks. In those days, the priority for trading infrastructures was uptime which was provided by resilient hardware and software architectures. With the advent of electronic trading the needs of today's capital markets has shifted. High Frequency Trading (HFT) has resulted in an  explosion in transactional volumes. Driven by the requirements of one of the largest stock exchanges in the world, they realized that critical applications need to not only be highly available, but also extremely focused on performance (low latency) and deterministic (zero jitter) behavior.

Stratus provides a solution that guarantees availability in mission critical trading systems, without the costly overhead associated with todays software based High Availability (HA) solutions as well as the need for multiple physical servers.. You could conceivably cut your server footprint in half by using a single Stratus server where before you’d need at least two physical servers. Stratus is also a “drop and go” solution. No custom code needs to be written, there is no concept of Stratus FT built customer applications. This isn’t just for Linux environments, Stratus also has hardened OS solutions for Windows and VMWare as well.

Solarflare brings low latency networking to the relationship  with their custom ethernet controller ASIC and Onload Linux Operating System Bypass communications stack. Normally network traffic arrives at the server’s network interface card (NIC) and is passed to the Operating System through the host CPU. This process involves copying the network data several times, and switching the CPU’s context from kernel to user mode one or more times. All of these events take both time and CPU cycles. With over a decade of R&D Solarflare has considerably shortened this path. Under Solarflare’s control applications often receive data in about 20% of the time it would typically take. The savings is measured in micro-seconds (millionths of a second), typically several or more. In trading speed often speed matters most, so a dollar value can be placed on this savings. Back in 2010 one trader valued the savings at $50,000/micro-second for each day of trading.

Both Stratus and Solarflare have worked together to dramatically reduce jitter to nearly zero. Jitter is caused by those seemingly inevitable events that distract a CPU core from it’s primary task of electronic trading. For example the temperature of  thermal sensor somewhere in the system may exceed a predetermined level and it raises a system interrupt. A CPU core is then assigned to handle that interrupt and determine which fan needs to be turned on or sped up. While this event, known as “Jitter”, sounds trivial the distraction to processes this interrupt and return to trading often results in a delay measured in the 100’s of micro-seconds. Imagine you’re trading strategy normally executes in 10s of micro-seconds, network latency adds 1-2 microseconds, and then all the sudden the system pauses your trading algorithm for 250 micro-seconds while it does some system house keeping. By the time control is returned to your algo it’s very possible that the value of what you’re trading has changed. Both Stratus and Solarflare have worked exceedingly hard to remove Jitter from the FT platform.  

Going forward, Solarflare and Stratus will be adding Precision Time Protocol support to a new version of Onload for the Stratus FT Platform.

Tuesday, July 19, 2016

Black Hat 2016 - Packet Filtering in the NIC


Solarflare wants to talk with you at Black Hat in Las Vegas next month, and we're raffling off a Wifi Pineapple to those who sign up for a meeting. What is a Wifi Pineapple you ask, perhaps one of the best tools available for diagnosing wireless security issues.

At Black Hat Solarflare will be talking about their new line of SFN8xxx series adapters that support five tuple packet filtering directly in hardware. The SFN8xxx series adapters support thousands of filters, and an additional one thousand counters that can be applied to track filter usage. Along with filtering we'll be discussing the tamper-proof nature of this new line of adapters, and it's capability to support over the wire firmware or filter-table updates via an SSL/TLS link directly to the controller on the adapter.

To learn more or setup a meeting for Wednesday August 3 or Thursday August 4th at Black Hat please send an email to scollins@solarflare.com, and you'll be automatically enrolled in our drawing for a Wifi Pineapple.

Friday, July 1, 2016

OCP & the Death of the OEM Mezz Card

Since the early days of personal computing we’ve had expansion cards. The first Apple, and Radio Shack TRS-80 micro-computers enabled hackers like myself to buy a foundational system from the OEM then over time upgrade it with third party peripherals. For example, my original TRS-80 Model III shipped with 4KB of RAM and a cassette tape drive (long term data storage, don’t ask). Within a year I’d maxed the system out to 48KB of RAM (16KB per pay check) and a pair of internal single sided, single density 5.25” floppy drives (90KB of storage per side). A year or so later the IBM PC debuted, and transformed what was a hobby for most of us into a whole new market, personal computing (PC). For the better part of two decades IBM lead the PC market with an open standards approach, yeah they brought out MicroChannel Architecture (MCA) and PCNetwork, but we won’t hold that against them. Then in 2006 as the push towards denser server computing reached a head IBM introduced the BladeCenter H. A blade based computing chassis with integrated internal switching. This created an interesting new twist in the market the OEM proprietary mezzanine  I/O card format (mezz), unique to IBM Bladecenter H.

At that time I was with another 10Gb Ethernet adapter company managing their IBM OEM relationship. To gain access to the new specification for the IBM BladeCenter H mezz card standard you had to license it from IBM. This required that your company pay IBM a license fee (a serious six figure sum), or provide them with a very compelling business case for how your mezz card adapter would enable IBM to sell thousands more Bladecenter H systems. In 2006 we went the business case route, and in 2007 delivered a pair of mezz cards and a new 32-port Bladecenter H switch for the High Performance Computing (HPC) market. All three of these products required a substantial amount of new engineering to create OEM specific products for a captive IBM customer base. Was it worth it, sure the connected revenue was easily well into the eight figures. Of course IBM couldn’t be alone in having a unique mezz card design so soon HP and Dell debuted their blade products with their own unique mezz card specifications. Now having one, two or even three OEM mezz card formats to comply with isn’t that bad, but over the past decade nearly every OEMs from Dell through SuperMicro, and a bunch of smaller ones, have introduced various unique mezz card formats.

Customers define markets, and huge customers can really redefine a market. Facebook is just such a customer. In 2011 Facebook openly shared their data center designs in an effort to reduce the industry’s power consumption. Learning from other tech giants Facebook spun off this effort into a 501c non-profit called the Open Compute Project Foundation (OCP) which quickly attracted rock star talent to its board like Andy Bechtolsheim (SUN & Arista Networks) and Jason Waxman (Intel). Then in April of last year Apple, Cisco, and Juniper joined the effort, and by then OCP had become an unstoppable force. Since then Lenovo and Google have hopped on the OCP wagon. So what does this have to do with mezz cards? Everything, OCP is all about an open system design with a very clear specification for a whole new mezz card architecture. Several of the big OEMs, and many of the smaller ones, have already adopted the OCP specification. In early 1Q17 servers sporting Intel’s Skylake Purley architecture will hit the racks, and we’ll see the significant majority of them supporting the new OCP mezz card format. I’ve been told by a number of OEMs that the trend is away from proprietary mezz card formats, and towards OCP. Hopefully this will last for at least the next decade.

Monday, May 23, 2016

Beyond SDN: Micro-Segmentation, Macro-Segmentation or Application-Segmentation Part-2

Large publicly traded companies like Cisco, EMC (VMWare) and Arista Networks are deeply entrenched with their customers giving them a beachhead on which they can fairly easily launch new products. Since their brands, and value are well understood and established it’s often a matter of just showing up with a product that is good enough to win new business. By contrast startups like Illumio and Tufin have to struggle to gain brand recognition and work exceptionally hard to secure each and every proof of concept (PoC) engagement. For a PoC to be considered successful these new startups have to demonstrate significant value over the entrenched players as they also need to overcome the institutional inertia behind every buying decision.  So how are Illumio or Tufin any different, and what value could they possibly deliver to justify even considering them? While both Illumio and Tufin are focused on making enterprises and the deployment of enterprise applications more secure, they each leverage a dramatically different approach. First we’ll explore Tufin, then Illumio.

Tufin has a feature called the Interactive Topology Map, which enables them to traverse your entire physical network, including your use of hybrid clouds to craft a complete map of your infrastructure. This enables them to quickly display on a single pane of glass how everything in your enterprise is connected. They then offer visual path analysis from which you can explore your security and firewall policies across your network. Furthermore, you can use a sophisticated discovery mechanism by which you select two systems, and it explores the path between them and displays all the security policies that might impact data flows between these two systems. In actual practice as you define an application within Tufin you can leverage this sophisticated discovery or manually define the source, destination and service. Tufin will then show you the status of the connection, at which point you can drill down to see what if any components in your infrastructure require a change request. They then have a six-step change ticket workflow: Request, Business Approval, Risk Identification, Risk Review, Technical Design, and Auto Verification. To date they appear to support the following vendors: Cisco, Check Point, Palo Alto Networks, Fortinet, Juniper, F5, Intel Security, VMWare NSX, Amazon Web Services, Microsoft Azure and OpenStack.

By contrast Illumio takes a much different approach, it designs security from the inside out with no dependencies on infrastructure. They attach an agent to each enterprise application as it is launched. This attached agent then watches over every network flow into and out of the application, and records exactly what the application requires to be effective. From this it computes a security policy for that application that can then be enforced every time that application is launched. It can then adapt to workflow changes, and it also has the capability to encrypt all data flowing between systems. While their videos, and data sheets don’t specifically say this it sounds as though they’ve placed a shim into the network OS stack hosting the application so that they can then record all the network traffic flow characteristics, that's likely how they support on the fly encryption between nodes. They do call out that they use IPTables, so it is possible that their code is an extension of this pre-existing security platform. Clearly though they are just above the adapter, and Jimmy Ray confirms this in one of his awesome videos that Illumio is based on an "adapter security platform" view. Illumio then provides an enterprise management application to gather the flow data from all these agents to craft, and manage its view of the network.

So while Tufin looks into your network from the outside and enumerates what it finds, Illumio looks from the application out. Both are impressive, and yield interesting perspectives worthy of evaluating. Moving forward both are tuned to delivery Application-Segmentation, it will be interesting to see how the market evaluates each, both had strong presences at RSA2016, but it will ultimately be revenue from customers that determines success.

Monday, May 16, 2016

Beyond SDN: Micro-Segmentation, Macro-Segmentation or Application-Segmentation Part-1

Software Defined Networking (SDN) originally came out of work done to extend the functionality of the Java software framework as early as 1995. In 1998 a number of the key people from Sun Microsystems and Javasoft left to found WebSprocket the first commercial implementation of SDN. Two years later Gartner had recognized SDN as an emerging market and created a new category to track commercial efforts engaged in this space. Now some 15+ years later we have cyber warfare, espionage, and financially motived hackers constantly questing for the chewy center that is our enterprise data. While SDN has addressed some of the deficiencies found in perimeter only systems, it’s still not comprehensive enough. Some have proposed, and possibly even implemented, setting up zero-trust zones for key enterprise servers where the default policy for access to these systems is "Deny All". Then as applications are added to a server, specific access controls are added to the switch port of that server to enable the new functionality. The problem with this approach is that it can quickly become very tedious to craft and daunting to maintain. This still leaves several problems.

The smallest practical unit on which security can be applied is the IP address, while a switch can have different policies for each Virtual Machine's (VM) unique IP address the switch itself will never see VM to VM traffic within the same server. Using switch Access Control Lists (ACL) to enforce security policies at the application and server port layers can quickly tax a switches Content Addressable Memory (CAM). Finally, we have maintenance, can you quickly resolve why a given switch or firewall has a specific security policy or rule? Most organization are incapable of knowing the origin of every single rule as often there is no centralized, cross vendor, audit-able database that exists.      

To address some of these issues VMWare and Cisco crafted a new approach they named Micro-Segmentation which defines a new overlay framework where a software management layer takes control of both the hypervisor’s virtual soft switch and the enterprise switching fabric providing a single management perspective. VMWare branded this NSX, and it offers three major advantages: management to the virtualized Network Interface Card (NIC), automated deployment of the security policy with the VM, extending the management framework to include legacy switching. Including legacy switching isn’t just for compliance, but it’s to address all the issues around deployment across the entire enterprise. Cisco calls this Application Centric Infrastructure (ACI).

Not wanting to be left out of the post SDN ecosystem Arista Networks added to this by crafting a Macro-Segmentation view. Rather than using an overlay framework instantiated as a series of services woven into hypervisor they are leveraging existing firewall services in both software and hardware. These firewalls  can then be easily stood up or reconfigured between servers by leveraging existing software & hardware from well established firewall vendors like Fortinet and Palo Alto Networks. Actually weaving together a management layer that includes switching and firewalls is much more coherent.

The best solution though resides somewhere between Micro-Segmentation and Macro-Segmentation, and some have called it Application-segmentation. Because in the end all we really care about is the security of the applications we deploy on our infrastructures. So while VMWare, Cisco & Arista have taken a sort of bottom up approach, a new breed of network security orchestration applications from companies like Illumio and Tuffin have entered the fray taking a top down, an Application-Segmentation approach to the same problem. More on this to come though in Part-2 of Beyond SDN.