In his talk "The Race to Zero" last week at Flagg Management's HPC for Wall Street show Dr. Greg Rodgers, a PhD from AMD Research, discussed the rise of highly integrated Heterogeneous Systems Architectures (HSA). For the past six years I've exhibited at both of Russell Flagg's annual shows, and during that time I've seen many different approaches to reducing latency & improving performance for the High Frequency Trading (HFT) market. Many companies have pitched custom FPGA solutions, GPUs, HPC RISC implementations, ultra-dense Intel solutions, but not until this talk had I heard anything that was truly innovative. In Dr. Rodgers brief 15 minute session he proposed a heterogeneous architecture for addressing a wider range of computational problems by tightly integrating several different processing models onto the same chip, the innovation. The concept of a heterogeneous computing environment is not new, in-fact it's been around for at least two decades. While working at NEC in 2004, one of my colleagues at our US Research division demonstrated a new product that loosely coupling several different computing resource pools together. That way jobs submitted with the tool could easily & efficiently be parceled out and leverage both scalar clusters & massively parallel systems (Earth Simulator) without having to be broken up, and submitted individually to specific systems. What Dr. Rodgers is proposing is a much higher level integration on the same chip.
If this were anyone else I might have easily written off the concept as an intellectual exercise that would never see the light of day, but this was Greg Rodgers. I've known Greg for nearly eight years, and when we first met he was carrying around a pre-announced IBM JS21 PowerPC blade server under his arm between booths at SuperComputing 2005. He was evangelizing the need to build huge clusters using the latest in IBM's arsenal of PowerPC workhorse chips in an ultra-dense form factor. Greg has built many large clusters during his career, and when he believes in an approach it will eventually be implemented in a very large cluster. It may end up at the Department of Energy, or a University or other Government lab, but it will happen.
AMD currently producing an ultra dense cluster in a box with their SeaMicro SM15000-OP. This is a 10U enclosure that houses 512 cores, each 64-bit, x86, at 2.0/2.3/2.8 Ghz. To reach 512 cores they use 64 sockets each housing a new Octal core Opteron. Each socket supports 64GB for a total of 4TB of system memory. AMD also provides 10GbE to each socket internally, and expose 16 10GbE uplinks externally. This is a true HPC cluster in a box, but because it's all x86 cores it's designed for scalar workloads. What Greg is proposing is to shift this architecture from pure x86 to "Acceleration Processing Units" (APUs) that marry a GPU, with two x86 cores, caches and other I/O on the same die (chip). That way memory can be shared, and data movement minimized. This would enable data parallel workloads and serial/task parallel workloads to coexist within the same chip, and be able to share memory when appropriate. Furthermore Greg has proposed the following HSA concepts:
- A unified programming model that enables task parallel and data parallel workloads while also supporting sequential workloads.
- A single unified virtual address space addressable by all compute cores with well defined memory regions supporting both global & private access.
- User level queuing between the "Latency Compute Unit" (LCU) and the "Throughput Compute Unit" (TCU) without system calls.
- Preemption & context switching, extending context management to the GPU.
- HSA Intermediate Language (HSAIL) to split compilation between the front end and the finalizer to improve optimization & portability