feature article
Subscribe Now

Jigsaw Juggles Cache Management

MIT Researchers Overturn Ideas About Cache Performance

When is software faster than hardware? And when does the physical location of your electronic bits matter?

When you’re programming a multicore processor, evidently. Two researchers at MIT have been delving into the arcane art of cache management, and they discovered that their software routine can do it better than today’s all-hardware approaches. By simulating billions of instructions’ worth of code on 16-core and 64-core processors, the team has come up with a unique new way to manage on-chip cache. Their software manager, dubbed Jigsaw, can double performance while slashing power consumption by up to 72%. Not bad for a software hack.

Daniel Sanchez, an assistant professor in MIT’s Department of Electrical Engineering and Computer Science, and his student, Nathan Beckmann, have released a research paper (available here) describing Jigsaw and the detailed analysis that went into it. In short, they found that even today’s best hardware-only cache mechanisms start to flail when used in multicore microprocessors. The amount of conflict and interference generated by all those cores accessing a single shared cache is just too much for traditional cache-management circuitry to handle. Instead, Sanchez and Beckmann advocate a hardware-plus-software approach, with the Jigsaw software adjudicating cache-management issues usually left up to the hardware.

Jigsaw does two things, both of them good. It minimizes the amount of conflict among all the CPU cores, which improves cache hit rates and, therefore, performance. It also reduces the chip’s power consumption by minimizing the amount of cross-chip travel for cached data. To do that, Jigsaw physically moves cached data to place it near the CPU that wants it.

In a traditional microprocessor chip with a single CPU core, the on-chip cache is private and always helpful. But with multiple CPU cores, some cache (usually the second- or third-level cache) is shared, so the activity of one CPU can affect the cache, which then affects the activity of the other CPU(s). As the number of CPU cores goes up, the likelihood of interference goes up with it.

First, a quick primer on caches. Caches are normal memory, but they don’t have normal addresses. Instead, each cache byte is “aliased” to a byte of memory, holding the same data, but kept physically closer to the CPU for speed. Because there’s far less cache than memory in a typical system, each cache byte must stand in for several (often several thousand) bytes of memory, a process called mapping. Since the cache-to-memory mapping needs to happen extremely quickly (else what’s the point of the cache) it’s done by a simple hardware lookup. For instance, the middle 10 bits of a memory address might be used to address the cache; low-order bits are ignored and high-order bits are ignored. That means several different memory addresses will map to the same byte of cache, but they’ll be far enough apart that (you hope) they probably won’t be accessed at the same time. If all goes well, the cache will hold a good-sized subset of your program’s data.

If the cache mapping goes badly, however, things can get ugly. Because multiple memory addresses map to the same cache address, it’s possible to “thrash the cache” by accessing the exact memory locations that all map to the same cache line. As soon as one cache entry is loaded, it’s immediately flushed out and replaced with the next mapped address, and so on. The result is a lot of wasted time and energy, with no benefit to performance. In fact, cache thrashing can actually hurt performance, adding insult to injury. 

Most caches get around this problem by allocating two cache entries to the same memory address, instead of just one. That way, if a program unluckily ping-pongs back and forth between two addresses that just happen to map to the same cache entry, they won’t continually evict each other. Four-way caches alleviate this problem even further.

Again, all of this mapping is done in hardware, so it’s literally hard-wired and can’t change on the fly. Microprocessor designers make their best guess, based on years of academic research combined with trace history and personal experience, to come up with a cache-management circuit that should provide the best overall performance. It’s a good solution. But it’s not the best solution, according to Jigsaw’s creators.

For starters, multicore chips make caches work harder. Cache thrashing is complicated enough with just one program running on one CPU. A shared cache in a multicore chip has to accommodate two or more programs running independently of one another. The unpredictability of one program is replaced by the unpredictability or two programs – or more.

Second, the MIT team found that adding concepts of physical distance and locality to the cache manager helped its performance quite a bit. Hard-wired cache managers don’t take into account the actual physical location of data in the cache. Entries are mapped purely by their address tags, not by proximity to the CPU requesting the data. And since third-level caches on multicore chips can get pretty big – more than half of the silicon, in many cases – the distance to the cached data starts to make a difference.

Jigsaw’s software cache manager assigns cache entries based on address mapping combined with physical location. If CPU A requests cacheable data, that data will be loaded and cached physically near to CPU A, whereas data for CPU B might be stored over on the opposite side of the chip, even if it came from the same memory address. Same cache tag, different cache location.

How can a software cache manager keep pace with a hard-wired one? By running speculatively ahead of time. Jigsaw doesn’t wait for the level-three (L3) cache to be accessed before starting up. It begins working on a level-two (L2) cache miss. If the L2 access hits, Jigsaw discards its results; no harm done. If the L2 misses, Jigsaw has already figured out where the L3 data should go.

Sanchez and Beckmann saw an average 18% speedup across all the programs and configurations they tested, with a high of 120% (2.2×) improvement, so clearly there’s a lot of deviation. But in no case did they measure a slowdown, so Jigsaw appears to be beneficial in all cases. As far as energy goes, they saw as much as 72% less on-chip power, which is a huge amount. As the paper describes, most of that came from two sources: increasing the cache hit rate (which minimizes energy-sapping cache misses), and by reducing travel to distant cache locations on the far side of the chip (which wiggles comparatively long, high-speed data lines). So Jigsaw looks like a win all the way around.

Jigsaw constantly monitors and reallocates the cache at runtime, hoping to adapt to changing software behavior. Since the optimization is based on Jigsaw’s observations of the chip’s activity, “it’s the optimal thing to do assuming that the programs will behave in the next 20 milliseconds the way they did in the last 20 milliseconds,” according to Sanchez. “But there’s very strong experimental evidence that programs typically have stable phases of hundreds of milliseconds, or even seconds.”

In addition to improving overall performance, Sanchez and Beckmann theorize that Jigsaw could be tuned to improve real-time performance or to implement quality-of-service controls, so that some tasks or threads run faster or more predictably than others. All in all, an interesting new piece to the cache-management puzzle. 

One thought on “Jigsaw Juggles Cache Management”

  1. To be a bit more precise we are talking here about cache DATA management only, not simply cache.
    The paper title seems to me incomplete if not wrong “Jigsaw: Scalable Software-Defined Caches” should rather be “Jigsaw: Scalable Software-Defined Management of Data Caches”, as they are not proposing to remove the last level of a cache in favour of a software defined data allocation somewhere else.
    For the rest their idea works quite well on the power saving, adding a kind of 3rd parameter, the “physical distance of the requestor” to the well known “temporal distance of last accessed data” or “address distance of last accessed data”. It’s clear that adding a new dimension there is an improvement, but ON TOP of the existing HW structure. We are happy to push some extra work to SW people …

Leave a Reply

featured blogs
Sep 17, 2021
Dear BoardSurfers, I want to unapologetically hijack the normal news and exciting feature information that you are accustomed to reading about in the world of PCB Design blogs to eagerly let you know... [[ Click on the title to access the full blog on the Cadence Community s...
Sep 16, 2021
I was quite happy with the static platform I'd created for my pseudo robot heads, and then some mad impetuous fool suggested servos. Oh no! Here we go again......
Sep 15, 2021
Learn how chiplets form the basis of multi-die HPC processor architectures, fueling modern HPC applications and scaling performance & power beyond Moore's Law. The post What's Driving the Demand for Chiplets? appeared first on From Silicon To Software....
Aug 5, 2021
Megh Computing's Video Analytics Solution (VAS) portfolio implements a flexible and scalable video analytics pipeline consisting of the following elements: Video Ingestion Video Transformation Object Detection and Inference Video Analytics Visualization   Because Megh's ...

featured video

Accurate Full-System Thermal 3D Analysis

Sponsored by Cadence Design Systems

Designing electronics for the data center challenges designers to minimize and dissipate heat. Electrothermal co-simulation requires system components to be accurately modeled and analyzed. Learn about a true 3D solution that offers full system scalability with 3D analysis accuracy for the entire chip, package, board, and enclosure.

Click here for more information about Celsius Thermal Solver

featured paper

Detect. Sense. Control: Simplify building automation designs with MSP430™ MCU-based solutions

Sponsored by Texas Instruments

Building automation systems are critical not only to security, but worker comfort. Whether you need to detect, sense or control applications within your environment, the right MCU can make it easy. Using MSP430 MCUS with integrated analog, you can easily develop common building automation applications including motion detectors, touch keypads and e-locks, as well as video security cameras. Read more to see how you can enhance your building automation design.

Click to read more

featured chalk talk

Power Profiler II

Sponsored by Mouser Electronics and Nordic Semiconductor

If you are working on a low-power IoT design, you are going to face power issues that can get quite complicated. Addressing these issues earlier in your design process can save you a lot of time, effort, and frustration. In this episode of Chalk Talk, Amelia Dalton chats with Kristian Sæther from Nordic Semiconductor about the details of the new Nordic Power Profiler Kit II - including how it can measure actual current, help you configure the right design settings, and show you a visualized power profile for your next design.

Click here for more information about the Nordic Semiconductor Power Profiler Kit II (PPK2)