feature article
Subscribe Now

Jigsaw Juggles Cache Management

MIT Researchers Overturn Ideas About Cache Performance

When is software faster than hardware? And when does the physical location of your electronic bits matter?

When you’re programming a multicore processor, evidently. Two researchers at MIT have been delving into the arcane art of cache management, and they discovered that their software routine can do it better than today’s all-hardware approaches. By simulating billions of instructions’ worth of code on 16-core and 64-core processors, the team has come up with a unique new way to manage on-chip cache. Their software manager, dubbed Jigsaw, can double performance while slashing power consumption by up to 72%. Not bad for a software hack.

Daniel Sanchez, an assistant professor in MIT’s Department of Electrical Engineering and Computer Science, and his student, Nathan Beckmann, have released a research paper (available here) describing Jigsaw and the detailed analysis that went into it. In short, they found that even today’s best hardware-only cache mechanisms start to flail when used in multicore microprocessors. The amount of conflict and interference generated by all those cores accessing a single shared cache is just too much for traditional cache-management circuitry to handle. Instead, Sanchez and Beckmann advocate a hardware-plus-software approach, with the Jigsaw software adjudicating cache-management issues usually left up to the hardware.

Jigsaw does two things, both of them good. It minimizes the amount of conflict among all the CPU cores, which improves cache hit rates and, therefore, performance. It also reduces the chip’s power consumption by minimizing the amount of cross-chip travel for cached data. To do that, Jigsaw physically moves cached data to place it near the CPU that wants it.

In a traditional microprocessor chip with a single CPU core, the on-chip cache is private and always helpful. But with multiple CPU cores, some cache (usually the second- or third-level cache) is shared, so the activity of one CPU can affect the cache, which then affects the activity of the other CPU(s). As the number of CPU cores goes up, the likelihood of interference goes up with it.

First, a quick primer on caches. Caches are normal memory, but they don’t have normal addresses. Instead, each cache byte is “aliased” to a byte of memory, holding the same data, but kept physically closer to the CPU for speed. Because there’s far less cache than memory in a typical system, each cache byte must stand in for several (often several thousand) bytes of memory, a process called mapping. Since the cache-to-memory mapping needs to happen extremely quickly (else what’s the point of the cache) it’s done by a simple hardware lookup. For instance, the middle 10 bits of a memory address might be used to address the cache; low-order bits are ignored and high-order bits are ignored. That means several different memory addresses will map to the same byte of cache, but they’ll be far enough apart that (you hope) they probably won’t be accessed at the same time. If all goes well, the cache will hold a good-sized subset of your program’s data.

If the cache mapping goes badly, however, things can get ugly. Because multiple memory addresses map to the same cache address, it’s possible to “thrash the cache” by accessing the exact memory locations that all map to the same cache line. As soon as one cache entry is loaded, it’s immediately flushed out and replaced with the next mapped address, and so on. The result is a lot of wasted time and energy, with no benefit to performance. In fact, cache thrashing can actually hurt performance, adding insult to injury. 

Most caches get around this problem by allocating two cache entries to the same memory address, instead of just one. That way, if a program unluckily ping-pongs back and forth between two addresses that just happen to map to the same cache entry, they won’t continually evict each other. Four-way caches alleviate this problem even further.

Again, all of this mapping is done in hardware, so it’s literally hard-wired and can’t change on the fly. Microprocessor designers make their best guess, based on years of academic research combined with trace history and personal experience, to come up with a cache-management circuit that should provide the best overall performance. It’s a good solution. But it’s not the best solution, according to Jigsaw’s creators.

For starters, multicore chips make caches work harder. Cache thrashing is complicated enough with just one program running on one CPU. A shared cache in a multicore chip has to accommodate two or more programs running independently of one another. The unpredictability of one program is replaced by the unpredictability or two programs – or more.

Second, the MIT team found that adding concepts of physical distance and locality to the cache manager helped its performance quite a bit. Hard-wired cache managers don’t take into account the actual physical location of data in the cache. Entries are mapped purely by their address tags, not by proximity to the CPU requesting the data. And since third-level caches on multicore chips can get pretty big – more than half of the silicon, in many cases – the distance to the cached data starts to make a difference.

Jigsaw’s software cache manager assigns cache entries based on address mapping combined with physical location. If CPU A requests cacheable data, that data will be loaded and cached physically near to CPU A, whereas data for CPU B might be stored over on the opposite side of the chip, even if it came from the same memory address. Same cache tag, different cache location.

How can a software cache manager keep pace with a hard-wired one? By running speculatively ahead of time. Jigsaw doesn’t wait for the level-three (L3) cache to be accessed before starting up. It begins working on a level-two (L2) cache miss. If the L2 access hits, Jigsaw discards its results; no harm done. If the L2 misses, Jigsaw has already figured out where the L3 data should go.

Sanchez and Beckmann saw an average 18% speedup across all the programs and configurations they tested, with a high of 120% (2.2×) improvement, so clearly there’s a lot of deviation. But in no case did they measure a slowdown, so Jigsaw appears to be beneficial in all cases. As far as energy goes, they saw as much as 72% less on-chip power, which is a huge amount. As the paper describes, most of that came from two sources: increasing the cache hit rate (which minimizes energy-sapping cache misses), and by reducing travel to distant cache locations on the far side of the chip (which wiggles comparatively long, high-speed data lines). So Jigsaw looks like a win all the way around.

Jigsaw constantly monitors and reallocates the cache at runtime, hoping to adapt to changing software behavior. Since the optimization is based on Jigsaw’s observations of the chip’s activity, “it’s the optimal thing to do assuming that the programs will behave in the next 20 milliseconds the way they did in the last 20 milliseconds,” according to Sanchez. “But there’s very strong experimental evidence that programs typically have stable phases of hundreds of milliseconds, or even seconds.”

In addition to improving overall performance, Sanchez and Beckmann theorize that Jigsaw could be tuned to improve real-time performance or to implement quality-of-service controls, so that some tasks or threads run faster or more predictably than others. All in all, an interesting new piece to the cache-management puzzle. 

One thought on “Jigsaw Juggles Cache Management”

  1. To be a bit more precise we are talking here about cache DATA management only, not simply cache.
    The paper title seems to me incomplete if not wrong “Jigsaw: Scalable Software-Defined Caches” should rather be “Jigsaw: Scalable Software-Defined Management of Data Caches”, as they are not proposing to remove the last level of a cache in favour of a software defined data allocation somewhere else.
    For the rest their idea works quite well on the power saving, adding a kind of 3rd parameter, the “physical distance of the requestor” to the well known “temporal distance of last accessed data” or “address distance of last accessed data”. It’s clear that adding a new dimension there is an improvement, but ON TOP of the existing HW structure. We are happy to push some extra work to SW people …

Leave a Reply

featured blogs
Dec 4, 2023
As a titan in creating technological solutions for electronic systems design, Cadence expanded our footprint from electronic design automation (EDA) into molecular design and life sciences simulation when we partnered with OpenEye Scientific. In a strategic move that underlin...
Nov 27, 2023
See how we're harnessing generative AI throughout our suite of EDA tools with Synopsys.AI Copilot, the world's first GenAI capability for chip design.The post Meet Synopsys.ai Copilot, Industry's First GenAI Capability for Chip Design appeared first on Chip Design....
Nov 6, 2023
Suffice it to say that everyone and everything in these images was shot in-camera underwater, and that the results truly are haunting....

featured video

Dramatically Improve PPA and Productivity with Generative AI

Sponsored by Cadence Design Systems

Discover how you can quickly optimize flows for many blocks concurrently and use that knowledge for your next design. The Cadence Cerebrus Intelligent Chip Explorer is a revolutionary, AI-driven, automated approach to chip design flow optimization. Block engineers specify the design goals, and generative AI features within Cadence Cerebrus Explorer will intelligently optimize the design to meet the power, performance, and area (PPA) goals in a completely automated way.

Click here for more information

featured paper

Power and Performance Analysis of FIR Filters and FFTs on Intel Agilex® 7 FPGAs

Sponsored by Intel

Learn about the Future of Intel Programmable Solutions Group at intel.com/leap. The power and performance efficiency of digital signal processing (DSP) workloads play a significant role in the evolution of modern-day technology. Compare benchmarks of finite impulse response (FIR) filters and fast Fourier transform (FFT) designs on Intel Agilex® 7 FPGAs to publicly available results from AMD’s Versal* FPGAs and artificial intelligence engines.

Read more

featured chalk talk

The Next Generation of Switching Regulator
Sponsored by Mouser Electronics and RECOM
Power modules can bring a variety of benefits to electronic system design including reduced board space, shorter time to market and easier sourcing of materials. In this episode of Chalk Talk, Amelia Dalton and Louis Bouche from RECOM discuss the benefits of RECOM’s switching regulators, the details of their advanced 3D power packaging and how you can leverage RECOM’s expertise with your next design.
Jan 9, 2023