When is software faster than hardware? And when does the physical location of your electronic bits matter?
When you’re programming a multicore processor, evidently. Two researchers at MIT have been delving into the arcane art of cache management, and they discovered that their software routine can do it better than today’s all-hardware approaches. By simulating billions of instructions’ worth of code on 16-core and 64-core processors, the team has come up with a unique new way to manage on-chip cache. Their software manager, dubbed Jigsaw, can double performance while slashing power consumption by up to 72%. Not bad for a software hack.
Daniel Sanchez, an assistant professor in MIT’s Department of Electrical Engineering and Computer Science, and his student, Nathan Beckmann, have released a research paper (available here) describing Jigsaw and the detailed analysis that went into it. In short, they found that even today’s best hardware-only cache mechanisms start to flail when used in multicore microprocessors. The amount of conflict and interference generated by all those cores accessing a single shared cache is just too much for traditional cache-management circuitry to handle. Instead, Sanchez and Beckmann advocate a hardware-plus-software approach, with the Jigsaw software adjudicating cache-management issues usually left up to the hardware.
Jigsaw does two things, both of them good. It minimizes the amount of conflict among all the CPU cores, which improves cache hit rates and, therefore, performance. It also reduces the chip’s power consumption by minimizing the amount of cross-chip travel for cached data. To do that, Jigsaw physically moves cached data to place it near the CPU that wants it.
In a traditional microprocessor chip with a single CPU core, the on-chip cache is private and always helpful. But with multiple CPU cores, some cache (usually the second- or third-level cache) is shared, so the activity of one CPU can affect the cache, which then affects the activity of the other CPU(s). As the number of CPU cores goes up, the likelihood of interference goes up with it.
First, a quick primer on caches. Caches are normal memory, but they don’t have normal addresses. Instead, each cache byte is “aliased” to a byte of memory, holding the same data, but kept physically closer to the CPU for speed. Because there’s far less cache than memory in a typical system, each cache byte must stand in for several (often several thousand) bytes of memory, a process called mapping. Since the cache-to-memory mapping needs to happen extremely quickly (else what’s the point of the cache) it’s done by a simple hardware lookup. For instance, the middle 10 bits of a memory address might be used to address the cache; low-order bits are ignored and high-order bits are ignored. That means several different memory addresses will map to the same byte of cache, but they’ll be far enough apart that (you hope) they probably won’t be accessed at the same time. If all goes well, the cache will hold a good-sized subset of your program’s data.
If the cache mapping goes badly, however, things can get ugly. Because multiple memory addresses map to the same cache address, it’s possible to “thrash the cache” by accessing the exact memory locations that all map to the same cache line. As soon as one cache entry is loaded, it’s immediately flushed out and replaced with the next mapped address, and so on. The result is a lot of wasted time and energy, with no benefit to performance. In fact, cache thrashing can actually hurt performance, adding insult to injury.
Most caches get around this problem by allocating two cache entries to the same memory address, instead of just one. That way, if a program unluckily ping-pongs back and forth between two addresses that just happen to map to the same cache entry, they won’t continually evict each other. Four-way caches alleviate this problem even further.
Again, all of this mapping is done in hardware, so it’s literally hard-wired and can’t change on the fly. Microprocessor designers make their best guess, based on years of academic research combined with trace history and personal experience, to come up with a cache-management circuit that should provide the best overall performance. It’s a good solution. But it’s not the best solution, according to Jigsaw’s creators.
For starters, multicore chips make caches work harder. Cache thrashing is complicated enough with just one program running on one CPU. A shared cache in a multicore chip has to accommodate two or more programs running independently of one another. The unpredictability of one program is replaced by the unpredictability or two programs – or more.
Second, the MIT team found that adding concepts of physical distance and locality to the cache manager helped its performance quite a bit. Hard-wired cache managers don’t take into account the actual physical location of data in the cache. Entries are mapped purely by their address tags, not by proximity to the CPU requesting the data. And since third-level caches on multicore chips can get pretty big – more than half of the silicon, in many cases – the distance to the cached data starts to make a difference.
Jigsaw’s software cache manager assigns cache entries based on address mapping combined with physical location. If CPU A requests cacheable data, that data will be loaded and cached physically near to CPU A, whereas data for CPU B might be stored over on the opposite side of the chip, even if it came from the same memory address. Same cache tag, different cache location.
How can a software cache manager keep pace with a hard-wired one? By running speculatively ahead of time. Jigsaw doesn’t wait for the level-three (L3) cache to be accessed before starting up. It begins working on a level-two (L2) cache miss. If the L2 access hits, Jigsaw discards its results; no harm done. If the L2 misses, Jigsaw has already figured out where the L3 data should go.
Sanchez and Beckmann saw an average 18% speedup across all the programs and configurations they tested, with a high of 120% (2.2×) improvement, so clearly there’s a lot of deviation. But in no case did they measure a slowdown, so Jigsaw appears to be beneficial in all cases. As far as energy goes, they saw as much as 72% less on-chip power, which is a huge amount. As the paper describes, most of that came from two sources: increasing the cache hit rate (which minimizes energy-sapping cache misses), and by reducing travel to distant cache locations on the far side of the chip (which wiggles comparatively long, high-speed data lines). So Jigsaw looks like a win all the way around.
Jigsaw constantly monitors and reallocates the cache at runtime, hoping to adapt to changing software behavior. Since the optimization is based on Jigsaw’s observations of the chip’s activity, “it’s the optimal thing to do assuming that the programs will behave in the next 20 milliseconds the way they did in the last 20 milliseconds,” according to Sanchez. “But there’s very strong experimental evidence that programs typically have stable phases of hundreds of milliseconds, or even seconds.”
In addition to improving overall performance, Sanchez and Beckmann theorize that Jigsaw could be tuned to improve real-time performance or to implement quality-of-service controls, so that some tasks or threads run faster or more predictably than others. All in all, an interesting new piece to the cache-management puzzle.