Tuesday morning one of the guys on my team woke me with a text stating a competitor was claiming 1.44 microseconds for a full round trip (RT) using UDP. Two things about this immediately struck me as strange: first it was reported as a full round trip number, and second the number (excluding units) was oddly close to what I'd thought the theoretical 1/2 RT limit might be. You see in the ultra-low latency, high frequency trading market, time is everything. One need only be a few nanoseconds faster than their competitor to win the lion's share of the business. So speed is everything, but in the end physics sets the speed limit.
In an ideal world if one were to measure the time required for a UDP packet to enter a network server adapter, traverse the ethernet controller chip, travel the host PCIe bus, through the Intel CPU complex and finally end up in memory they'd find that this journey was roughly 730 nanoseconds. Now it should be noted that this varies across Intel server families & clock rates. We could be off by as much as +/- 100 nanoseconds, measuring at this level is pretty challenging, but 730 nanoseconds is a reasonable number to start with. Also it should be noted that this is with Solarflare's current 7000 series Ethernet Controller ASIC.
Breaking this down further, the most expensive part of this trip is the 500 nanoseconds or so the UDP packet will spend in Solarflare's Ethernet controller chip. This chip is arguably the most popular low latency Ethernet Controller ASIC on the market today, it includes a high performance PHY layer, an L2 switch, and built in PCIe controller logic, everything happens within this single chip. Over 1,000 financial trading firms rely on this technology daily, most of the world's financial exchanges, and nearly all of their high performance customers depend on Solarflare, and as such they've turned all the dials possible to squeeze out every available nanosecond. Add to this 150 nanoseconds, the time the packet will spend traveling across the PCIe bus using DMA to cache via DDIO (not RAM), and finally another 80 nanoseconds or so to store it in RAM, making your final total 730 nanoseconds to receive a packet to memory. Again, your mileage will vary considerable so please only use these numbers as rough reference points. For a 1/2RT you'll need to double this number (a receive plus a send) which brings the 1/2RT total to 1,460 nanoseconds, or 1.46 microseconds. It should also be noted that receives and sends have different costs, sends are often actually less time consuming, so again your numbers will vary, and this number should in fact be smaller. That's Solarflare physics. Solarflare has a new 8000 series Ethernet Controller ASIC coming out soon which will further trim down the 500 nanoseconds spent in the ASIC, but by exactly how much is still a closely guarded secret.
So is 1.44 microseconds for a conventional (through to user space vs. done completely in an FPGA) full round trip possible today? Well the PCIe and memory components of this total 920 nanoseconds (150 nanoseconds for the PCIe bus plus 80 nanoseconds for CPU to memory, and both times 4 to address a full round trip). This leaves 520 nanoseconds to traverse the Ethernet Controller logic four times, or 130 nanoseconds for each pass. Considering that the most popular low-latency Ethernet controller chip on the planet requires 500 nanoseconds, doing it in 130 nanoseconds with the same degree of utility is highly unlikely.
On checking this competitor's data sheet for this product we found that they have documented 1.82 microseconds for a UDP 1/2RT using 64 byte packets. Compare this to the 1.44 microseconds they claimed verbally for a full round trip, and one could see that they've significantly stretched the truth. If it sounds too good to be true, it probably is...