A CXL progress report: The elephant is learning to dance

It’s been a year since I last looked at the developing CXL (Compute Express Link) standard (see “The Adventures of CXL Across the 3rd Dimension”) and, conveniently, SNIA (Storage Networking Industry Association) held a one-hour Webinar on the topic in April. The title of the presentation is “Unlocking CXL’s Potential: Revolutionizing Server Memory and Performance” and it’s available to view on BrightTalk. SNIA assembled three CXL experts to discuss this progress, including my good friend Jim Handy (the Memory Guy), Mahesh Natu (Systems and Software Working Group Co-Chair for the CXL Consortium), and Torry Steele (Senior Product Marketing Manager for SMART Modular Technologies). Together, these three experts provided a good overview of where CXL stands today and where it’s going. There has indeed been considerable progress in developing the CXL standard, in producing early computer hardware (memory modules and memory servers) based on the standard, and in getting some performance data from that hardware. From the performance data, it’s now possible to determine which applications are best suited for using CXL-based memory subsystems, and which are not.

Handy led the presentation with an overview of CXL’s status. He began by noting that, much like the elephant being examined by the seven blind men, CXL has many diverse capabilities thanks to the features that have been added to the standard through its various revisions. CXL can be used to:

Maintain memory coherency among multiple processor types (xPUs).
Eliminate stranded memory in individual CPUs within a data center.
Expand the amount of memory that can be attached to a single CPU.
Increase memory bandwidth to a CPU, a server, or a rack.
Support persistent memory.
Hide operational differences among DDR4, DDR5, and DDR6 SDRAM banks.
Pass messages between xPUs.

Some of these capabilities will be more important than others, depending on the application, as Handy illustrated with these responses from different systems OEMs who might use CXL memory:

Google: Stranded memory is not important because Google’s VMs are very small and can easily be packed efficiently into a CPU’s memory.
IBM and Georgia Tech: DDR is a poor answer because queuing on DDR channels for multi-processor CPUs is less efficient than communicating with CXL memory.
AI Providers: We need enormous memories and fast loading of HBM storage on GPUs.
Hyperscalers: We want “any-to-any” xPU connections.
PC OEMs: CXL is not immediately useful.

Handy also noted that CXL is a relatively new standard. The CXL Consortium released CXL 1.0 and CXL 1.1 in 2019. CXL 2.0, which added the idea of CXL switching to support many host xPUs within a data center rack, appeared at the end of 2020. CXL 3.0, 3.1, and 3.2 – which add several features including multiple switch layers to support connectivity across an aisle of racks – appeared in 2022 through 2024. Natu presented a graphic that illustrates how CXL’s reach has expanded from individual CPU systems through racks and then aisles of racks over the years:

CXL’s reach has expanded from individual CPU systems through racks and then aisles of racks over the years. Image credit: CXL Consortium and SNIA

During his presentation, Natu presented the usual memory hierarchy chart showing how CXL memory, on the server and attached to the server system’s network fabric, fits into the hierarchy:

CXL memory fits between main memory in a server (usually SDRAM these days) and storage, in the form of Flash memory SSDs or HDDs. Image credit: CXL Consortium and SNIA

However, I don’t think this image shows the full picture, based on my previous article. These pyramids represent the memory hierarchy for one CPU in a server. However, it’s increasingly clear that CXL makes sense only in a multiserver environment. So, a multidimensional memory hierarchy might look something like this:

Fabric-attached CXL memory fits between a server’s main memory (usually locally connected SDRAM) and Flash memory SSD or HDD storage, but spans across all servers in the CXL network. Image credit: CXL Consortium, SNIA, and Steve Leibson

Clearly, CXL is aimed at large systems of the data center class. It’s therefore unsurprising that PC OEMs show little interest, just as they’re not especially interested in 800Gbps Ethernet ports that are increasingly important for data center architectures.

Based on the varied interests of the above systems developers and the relative lack of support for CXL memory subsystems in current operating systems, Handy projects that sales of CXL memory subsystems won’t take off until 2027. Here’s the chart he presented during the SNIA Webinar:

Jim Handy (the Memory Guy) forecasts that CXL-based memory sales won’t take off until there’s software to support CXL’s features. He estimates that won’t be until 2027. Image credit: Jim Handy and SNIA

Despite the immaturity of CXL hardware and the current lack of software to support CXL’s many capabilities, reports are starting to appear that illustrate CXL’s benefits in large systems. Steele’s portion of the presentation provided some insights. His first topic was a direct comparison of the observed latency and bandwidth of DDR memory versus CXL memory. The latency between a DDR memory controller and directly attached DDR SDRAM is about 100 ns. The latency between a CPU’s on-chip CXL memory controller and a CXL memory board or module using the PCIe gen5 protocol is approximately 170 to 210 ns, which is about double the observed DDR latency. Interpose a CXL switch and that latency becomes 270 to 510 ns. Clearly, using CXL memory adds memory latency.

From a bandwidth perspective, a DDR5-6400 SDRAM DIMM transfers about 51.2 Gbytes/sec while a CXL memory board connected to a CPU’s on-chip CXL memory controller over a 16-lane, PCIe Gen5 connection transfers 64 Gbytes/sec. So, the two connection systems are comparable, but the CXL connection requires an order of magnitude fewer CPU pins, so CXL-centric CPUs could be designed with many more CXL ports than DDR ports given the same number of pins, resulting in much better memory bandwidth and direct support for much bigger memory subsystems, again at the expense of latency. Some applications are sensitive to latency, and some are less sensitive to latency and simply need more memory bandwidth.

Tests conducted by Micron and AMD and published in a White Paper titled “CXL Memory Expansion: A Closer Look on Actual Platform” suggest that CXL-based memory subsystems can provide significant performance benefits, depending on the application. In a test involving Microsoft SQL database using the TPC-H benchmark on a system that had been limited by memory capacity, using CXL to expand the system’s memory capacity reduced SSD I/O paging by 44 to 88 percent and resulted in a 23 percent boost in application performance. CXL memory more than doubled the performance of a machine-learning test involving Apache Spark, an open-source analytics engine designed for large-scale data processing, running SVM, a supervised machine learning algorithm. The performance of a CloverLeaf HPC (high-performance computing) application increased by 17 percent when 20 percent of the application’s memory storage was mapped to CXL memory. In that application, CXL memory delivered 33 percent more memory bandwidth to the application versus locally attached DRAM.

Overall, the SNIA Webinar provided an excellent status report for the CXL standard as of early 2025. Memory subsystems based on the CXL standard are now in production. Testing indicates that CXL memory subsystems can deliver real benefits in some applications. Certain systems developers, such as the data center hyperscalers, will be more interested in CXL memory subsystems than others. Finally, CXL needs at least another year or two to ripen into something that will see widespread use in data centers.

Note: For more in-depth CXL analysis and forecasts, see Jim Handy’s report: “CXL Looks for the Perfect Home.”