Addressing The Memory Guy’s CXL Conundrums

My friend Jim Handy (The Memory Guy) recently published a blog about Compute Express Link (CXL) and two conundrums he perceives about the technology. CXL is a relatively new cache-coherent memory protocol based on open standards that’s designed to make large memory pools available to multiple processors in large computer systems and data centers. CXL’s central objective, in my opinion, is to help data center architects avoid overprovisioning every CPU in a multiprocessor system or servers in large data centers with DRAM. This overprovisioning problem crops up when you don’t know what types of workloads the CPUs in a system will be tasked to execute, so you need to provision DRAM for the worst case. If CPUs can somehow grab a chunk of memory from a central pool on a task-by-task basis, the hope is that those CPUs will not need to be overprovisioned with their own memory. They can request memory on an as-needed basis.

To achieve this feat, CXL memory needs to be more loosely coupled to the CPUs than local memory. CPUs absorbed DRAM controllers several years ago to eliminate the delay and the power consumption associated with external memory controllers for local DRAM. These days, three of the critical figures of merit for a CPU are the number of on-chip DRAM controllers it has, the type of DRAM those controllers can command, and the speed of DRAM transactions between the on-chip memory controllers and the attached DRAM as measured in megatransfers/second (MT/sec) or gigatransfers/second (GT/sec). All of these factors together determine how much local memory you can attach to a CPU and how fast that memory will perform. CXL greatly expands the amount of memory a CPU can access by taking on-chip memory controllers out of the equation.

CXL 1.0 appeared in 2019 and the CXL Consortium announced the CXL 3.0 specification in August 2022. The CXL protocol adds coherency and memory semantics on top of PCIe’s I/O semantics. CXL 3.0 doubled the protocol’s maximum transfer rate to 64 GT/sec by adopting PCIe 6.0. Because it’s based on PCIe, the industry finds the physical parts of the CXL specification easy to understand and use. Special CXL switches, similar to PCIe switches, connect multiple CPUs with multiple CXL memory subsystems based on CXL-specific memory controllers. The CXL switches permit the programmatic setup and teardown of many interesting CPU/memory system topologies. CXL also makes the transfer of large blocks of data held in memory from one CPU to another much easier and faster. The transfer of an arbitrarily sized block of data simply requires the passing of an address pointer to that data from one CPU to another. CXL essentially combines all the addressable memory in a large system or an entire data center into one large memory pool accessible by every CPU in the system.

Here is Jim Handy’s excellent description of CXL from his blog:

“CXL’s magic is mainly that it can add memory to systems in a way that doesn’t bog down the processor with significant capacitive loading or make it burn considerable power by adding memory channels. It does this by allowing the CPU to communicate over a highly streamlined PCIe channel with DRAM that is managed by its own controller. With the addition of a CXL switch, multiple processors can access the same memory, allowing memory to be allocated to one processor or another the same way that other resources, like processors and storage, are flexibly allocated. In other words, CXL disaggregates memory.”

Handy’s two CXL conundrums are:

1) Will More or Less Memory Sell with CXL?

2) Where does CXL Fit in the Memory/Storage Hierarchy?

The first conundrum arises from seemingly contradictory statements from DRAM makers and hyperscale data center architects. The second conundrum seems like a simple matter to me, not being a memory guy, but more on that later. Let’s first deal with conundrum number one.

DRAM makers see CXL as a sales bonanza that will allow them to steepen the curve in memory sales. Handy quotes a Micron White Paper that says: “CXL will help sustain a higher rate of DRAM bit growth than we would see without it.” Handy has confirmed that opinion with DRAM makers SK hynix and Samsung. At the same time, hyperscaler companies Google and Microsoft recently published a paper titled “Pond: CXL-Based Memory Pooling Systems for Cloud Platforms” that states: “Our analysis shows that we can reduce DRAM needs by 7% with a Pond pool spanning 16 sockets, which corresponds to hundreds of millions of dollars for a large cloud provider.”

So, who is right? The memory makers or the hyperscalers? That’s Handy’s first conundrum. However, I see no conundrum here. I see a repeat of the old parable “Blind men and an elephant.” According to this story, a group of blind men encounter an elephant for the first time. Each man feels a different part of the elephant’s body, but only one part. The person touching the elephant’s trunk says, “This animal is like a thick snake.” The person touching the elephant’s ear thinks the animal resembles some type of fan. The person touching the elephant’s leg declares that the animal is a pillar or a tree trunk. The person touching the elephant’s side says that the animal is like a wall. The person feeling the elephant’s tail describes the animal as “like a rope.” Finally, the person touching the elephant’s tusk says that the animal is hard, smooth, and like a spear. They’re all right, and they’re all wrong. You cannot define a complex thing by describing only a piece of it, yet that is what the DRAM makers and the hyperscalers are doing with CXL.

The hyperscalers are right, CXL will allow them to reduce the DRAM needs of today’s systems by creating a memory pool that’s easily and dynamically distributed amongst multiple CPUs and servers. I emphasize the word “today” because in the entire 78-year history of electronic digital computers, tomorrow’s systems have always needed more memory than today’s systems because we continue to tackle larger and larger problems. Generative AI (GenAI) is the current poster child for this truism because tomorrow’s GenAI algorithms will need more parameters and will require more memory to store those parameters than today’s algorithms. Consequently, the DRAM makers’ position is also correct and, therefore, I submit that there’s no conundrum, merely different perspectives.

Handy’s conundrum number two is more of a question. Where does CXL memory fit? Handy has long used a very – er – handy graph of the computer memory hierarchy that nicely characterizes all the memories we attach to computers. He customized the graph for this article:

Computer Memory Hierarchy with CXL commentary through animal analogy by Jim Handy, The Memory Guy. Image credit: Objective Analysis

Note that the graph shows three levels of memory cache. I’m old enough to remember when there was only L1 cache. (Heck, I’m old enough to remember the early CISC microprocessors that had no cache memory at all.) We don’t ask where L4 cache fits (yes, there is such a thing as L4 cache) because we intuitively know that each cache-level step results in slower and bigger caches that cost less per bit. (In fact, Handy published another blog on April 1, 2021 titled “Putting the Brakes on Added Memory Layers” that includes six more cache layers, including Last-Level Cache (LLC), Final-Level Cache (FLC), After-Final-Level Cache (AFLC), This Is It (TII), I Really Meant It (IRMI), and Don’t Ask Me Again (DAMA). (Note: Please don’t look for those cache levels in any real CPU specification. The blog is dated April 1.) Similarly, Handy thinks that CXL memory is slower and may cost more per bit than closely coupled DRAM, with a caveat, so hold those comments for just a moment longer.

The caveat is this: although CXL DRAM may have longer latency than local DRAM due to the need for transactions to pass through at least one CXL switch, CXL may also attain higher bandwidth than closely coupled DRAM thanks to those 64 GT/sec CXL channels. It all depends on the CPU, the integrated memory controllers, the CXL memory controllers, the width of the CXL implementation, and the system design. So, The Memory Guy’s graph will need another oval blob for CXL memory, which Handy drew as an elephant for the purposes of this article. (Not too long ago, Handy’s graph had an oval blob for Intel’s Optane memory, positioned between DRAM and SSD, but Intel killed that product and Handy vaporized its blob. So, blobs come and go in Handy’s memory hierarchy, which merely reflects the fluid reality of computer design.)

Handy’s blog on the topic suggests that CXL memory will be slower and more expensive than DRAM, so his CXL oval (or elephant in this case) appears beneath and to the right of the localized DRAM oval. If Handy is correct, then there’s no good reason to have CXL memory. Perhaps he is correct. However there’s also the possibility that The Memory Guy’s positioning of that CXL memory elephant/oval is not correct or that his 2D chart doesn’t show all the important considerations for hyperscale data centers. We will need to wait to see which of these alternatives is correct.

Handy promises to release a more complete analysis of CXL memory through his company Objective Analysis “real soon now,” about the same time that this article will appear. I’m sure his report will contain many interesting facets to this new piece of the computer memory hierarchy.

2 thoughts on “Addressing The Memory Guy’s CXL Conundrums”

traneusee says:

January 18, 2024 at 4:03 pm

The speed of light, 300 millimeters/nanosecond in vacuum and 200 millimeters/nanosecond on PCB, may dominate round-trip latency.

Log in to Reply
Steven Leibson says:

January 20, 2024 at 8:24 am

If latency is the issue, then CXL isn’t the solution as I explained in the article. However, when transferring large blocks of data, the latency is important for the first transfer. Subsequent transfers in the block arrive at bandwidth speeds.

Log in to Reply

Addressing The Memory Guy’s CXL Conundrums

Related

2 thoughts on “Addressing The Memory Guy’s CXL Conundrums”

Leave a Reply Cancel reply

featured paper

Quickly and accurately identify inter-domain leakage issues in IC designs

featured chalk talk