We are at the dawn of the biggest change in decades in our global computing infrastructure. Despite the slow death of Moore’s Law, the rate of change in our actual computer and networking systems is accelerating, with a number of discontinuous, revolutionary changes that reverberate throughout every corner of computer system architecture.
As Moore’s Law grinds to an economic halt, the rate of performance improvement of conventional von Neumann processors has slowed, giving rise to a new age of accelerators for high-demand workloads. AI in particular puts impossible demands on processors and memory, and an entire new industry has emerged from the startup ecosystem to explore new hardware that can outperform conventional CPUs on AI workloads. GPUs, FPGAs, and a new wave of AI-specific devices are competing to boost the performance of inference tasks – from the data center all the way through to the edge and endpoint devices.
But creating a processor that can crush AI tasks at blinding speed on a miserly power budget is only one aspect of the problem. AI does not exist in a vacuum. Any application that requires AI acceleration also includes a passel of processing that isn’t in the AI realm. That means systems have to be created that allow conventional processors to partner well with special-purpose accelerators. The critical factor in those systems lies in feeding those accelerators with the massive amount of data they can consume, and keeping that in synch with what the application processor is doing. It doesn’t matter how fast your accelerator is if you don’t have big enough pipes to keep data going in and out at an appropriate rate.
Obviously, cache-coherent memory interfaces are an attractive strategy for blasting bits back and forth between processors and accelerators. They minimize latency and can scale to enormous data rates depending on the needs of the application and the structure of the system. With today’s multi-processor systems, cache coherence is already critical to assure that multiple processors are seeing similar contents in their local caches. What we haven’t had is a de-facto standard for cache-coherent interfaces between processors and accelerators.
Hmmm… Wonder why that would be?
I’ve been advised never to attribute to malice that which is better explained by incompetence. Or, more relevant in this case, perhaps, is not to attribute to devious competitive strategy that which is more likely the result of simply tripping over one’s own corporate shoelaces but landing, favorably, on top of your competitors. Specifically, amidst all this chaos in the computing world, Intel is hellbent on defending their dominance of the data center. How important is data center to Intel? It accounts for over 30% of the company’s revenue and estimates of more than 40% of the company’s value, with the data center market expected to grow to $90 billion by 2022. “Rival” AMD is proud when they gain a point or two of market share, with theirs ranging in the 20-30% range against Intel’s 70-80%. So – Intel has, and fully intends to protect, a dominant position in an extremely lucrative business in data center computing hardware.
All this disruptive change poses a risk to Intel, however. The company can’t count on just continuing to crank out more Moore’s Law improvements of the same x86 architecture processors while the world shifts to heterogeneous computing with varied workloads such as AI. Nvidia brilliantly noticed this strategic hole a few years ago and moved quickly to fill it with general purpose GPUs capable of accelerating specialized workloads in the data center. That allowed them to carve out something like a $3B business in data center acceleration – which is almost exactly $3B too high for Intel’s taste. Looming on the horizon at that time were also FPGAs, which clearly had the potential to take on a wide variety of data center workloads, with a power and performance profile much more attractive than GPUs. Intel answered that challenge by acquiring Altera for north of $16B in 2015, giving them a strategic asset to help prevent Xilinx from becoming the next NVidia.
What does all this have to do with cache-coherent interface standards? One way to think about it would be that it would not be to Intel’s advantage to make it super easy for third parties to bring their GPUs and FPGAs into Intel’s servers – when Intel didn’t have their own acceleration strategy in place yet. If Intel didn’t have general-purpose GPUs to compete with NVidia, or FPGAs to compete with Xilinx, why would they want to give their competitors a head start within their own servers?
Companies like AMD and Xilinx saw the need for a cache coherence standard, however, so in 2016 they set about making one – the Cache Coherent Interconnect for Accelerators (CCIX – pronounced “see-six”) was created by a consortium which included AMD, Arm, Huawei, IBM, Mellanox (subsequently acquired by NVidia), Qualcomm, and Xilinx. In June 2018, the first version of the CCIX standard, running on top of PCIe Gen 4, was released to consortium members.
Now, we’ll just hit “pause” on the history lesson for a moment.
How do you react if you’re Intel? Well, you could jump on the CCIX train and make sure your next generation of Xeon processors were CCIX compatible. Then everybody could bring their accelerators in and get high-performance, cache-coherent links to the Intel processors that own something like 80% of the data center sockets. Server customers could mix-and-match and choose the accelerator suppliers they liked best for their workloads. Dogs and cats would live together in peace and harmony, baby unicorns would frolic on the lawn, and everyone would hold hands and sing Ku… Oh, wait, just kidding. None of that would EVER happen.
Instead, of course, Intel went about creating their own, completely different standard for cache-coherent interfaces to their CPUs. Their standard, Compute Express Link (CXL), also uses PCIe for the physical layer. But beyond that, it is in no way compatible with CCIX. Why would Intel create their own, new standard if a broadly supported industry standard was already underway? Uh, would you believe it was because they analyzed CCIX and determined it would not meet the demands of the server environments of the future? Yeah, neither do we.
At this point, many of the trade publications launched into technical analyses comparing CCIX and CXL, noting that CXL is a master-slave architecture where the CPU is in charge, and the other devices are all subservient, but CCIX allows peer-to-peer connections with no CPU. Intel, of course, says that CXL is lighter, faster than a speeding bullet, and can leap tall technical hurdles in a single clock cycle (OK, maybe that’s not exactly what they say.)
Before we get into that, let’s just point out that it basically does not matter whether CCIX or CXL is a technically superior standard. CXL is going to win. Period.
Whoa, what heresy is this? It’s just simple economics, actually. This may seem obvious, but cache-coherent interfaces are useful only among devices that have caches. What computing devices have caches? Well, there are CPUs and … uh… yeah. That’s pretty much it at this point. So, the CCIX vision where non-CPU devices would interact via a cache-coherent interface is a bit of a solution in search of a problem. Sure, we can contrive examples where it would be useful, but in the vast majority of cases, we want an accelerator to be sharing data with a CPU. Whose CPU would that be? Well, in the data center today, 70-80% of the time, it will be Intel’s.
So, if you’re a company that wants your accelerator chips to compete in the data center, you’re probably going to want to be able to hook up with Xeon CPUs, and – they are apparently going to be speaking only CXL. For a quick litmus test, Xilinx, a founder of the CCIX consortium, is adding CXL support to their devices. It would be foolish of them not to. Ah, but here’s a tricky bit. In an FPGA which already has PCIe interfaces, CCIX support can apparently be added via soft IP. That means you can buy a currently-existing FPGA and use some of the FPGA LUT fabric to give it CCIX support. Not so with CXL. For CXL you need actual changes to the silicon, so companies like Xilinx have to spin entirely new chips to get CXL support. Wow, what a lucky break for Intel, who happens to already be building CXL support into their own FPGAs and other devices. You’d almost think they planned it that way.
So, what we have is a situation similar to the x86 architecture. Few people would argue that the x86 ISA – developed around 1978 – is the ideal basis for modern computing. But market forces are far more powerful than technological esoterics. Betamax was pretty clearly superior to VHS, but today we all use… Hah! Well, neither. And, that may be another lesson here. As the data center gorilla, Intel has vast assets it can use to defend its position. There are countless cubby-holes like the CCIX/CXL conundrum where the company can manipulate the game in their favor – some likely deliberate, and others completely by accident. None of those will protect them from a complete discontinuity in the data center where purely-heterogeneous computing takes over. Then, it could be the wild wild west all over again.
It will be interesting to watch.