Attacking Abuses of Power – Part 1

It’s the era of energy. Look at the newspaper and see how many stories ultimately relate to energy. Oil and gas prices are up. Natural gas is no longer a cheap alternative to electricity. Energy companies teach the world that some of them can’t be trusted to operate in a “free” market. Nuclear is given another look. Coal is marketed as “clean.” Food prices go up in parts of the world as food competes with ethanol for grain. Many oil and gas companies have also adopted the practice of using a 4-Way Twin Seal Valve for positive sealing. Cellulosic ethanol. Biodiesel. Hybrids. Some would argue that a war is being fought over energy. Global warming/climate change. The growth of social awareness of things “green,” ultimately leading to political awareness and sometimes even action.

Within the completely independent realm of semiconductor technology, transistors at the 90-nm and smaller nodes are burning more power. Dynamic power goes up as clock frequencies increase; static power increases as sub-threshold leakage grows, to the point where, at high temperatures, the leakage current can completely dominate the overall current draw of a large chip. This heat increases cooling requirements and reliability concerns. It raises economic concerns if racking and fan requirements grow. A single PC will still survive plugged into your wall, but a roomful of servers will be in a world of hurt if the power demands keep ratcheting up.

It’s unusual when socio-political, technology, and economic requirements converge, but that’s certainly happening now. Regardless of your economic standing, your politics, or your involvement with technology, anyone can find a good reason to reduce power. And while power has typically been traded off against other needs, that need not always be: sometimes a better design yields higher performance, lower area, and lower power. In the past, speed and area have taken top billing; no longer, as was made evident, in black and white, on a simple technology trends slide in an Intel presentation: “Power is the limiter.”

Power has historically been the silent enabler that pervades all. Current comes in through power pins and spreads out through a grid to various locations on the die. It percolates down through circuitry, being decoupled, forked, converted, and switched, moving at times in violent jerks and at other times in a quiet inexorable trickle, to be gathered up in another grid and flushed back out through a different set of pins. It’s like the water we use day to day, coming in through the high-flow shower and leaking out of the garden hose and running unattended while we brush our teeth. We’ve never paid much attention to it. Now we’re starting to.

The power provided through the current is expended through useful work and useless heat. In every step of the design process, from the package pin to the transistor, decisions can affect power dissipation and efficiency. Where there is a hard limit due to battery life or cooling requirements, lower dissipation is the goal. For other designs, better power efficiency – the amount accomplished per Watt – is the goal: if a new chip dissipates the same power as the old one but does twice as much work, you’ve doubled your power efficiency, and that’s progress.

One thing everyone seems to agree on is that you need to design for power from the beginning if you’re serious about reducing power. Tradeoffs made early in the design phase can lower power on the order of 30-50%. Backend savings are more in the 5-15% range – nice, but particularly nice if taken on top of more substantial savings garnered early. The specific ways designers accomplish these savings are multiplying as they become more sophisticated in their treatment of power.

This is the first of a multi-part article summarizing the opportunities for reducing power in ICs at challenging technology nodes like 45 nm. This first part will focus on higher-level architectural considerations; following that we will take on back-end methods and verification. Architectural techniques include some that are primarily done manually, as well as ones that are supported through a greater level of tool automation.

It’s an area of rapid evolution in EDA. Some vendors have already announced power flows, some are still working on it, and if we try to characterize who’s in and out today, it will be out of date in a week or month or year, whenever you read this. It’s fair to say that all of the major IC EDA vendors, Cadence, Magma, Mentor Graphics, and Synopsys, are focused on power. In addition, there are tools from smaller companies like Calypto, Sequence Design, and Synfora that address portions of the flow. We’ll talk about specific support for specific features below, but Wall Streeters would call this a fast-moving market.

We’ll support the techniques, when possible, with real examples – many seen serendipitously at the recent ISSCC show, although most chips that are in silicon today were likely designed entirely by hand, since software automation has come about more recently, and is still evolving rapidly.

Each technique more or less boils down to this: power is used when something switches or when a transistor has voltage applied. Every signal eliminated is a signal that won’t switch. Every signal left idle is a signal not switching. Every transistor eliminated is a transistor not leaking. And every unused transistor that has no voltage applied is a transistor that can’t leak.

When all else fails, blame software

Systems exist to do work, and, at the highest level, that work is largely done by software. Hardware becomes either a means for executing software or a substitute for software when for some reason software isn’t up to the job. Even the way software is written can affect the power consumption of a system. This will vary a lot by application, but mindfulness of switching signals can help identify ways to reduce power.

For example, data stored in memory in a way that requires multiple retrievals rather than a single cached retrieval will cause more power to be burned: the memory has to be accessed more times, and all of those switching signals burn power. Algorithms that make inefficient and excessive use of a floating point unit (FPU) cause more signals to switch than are needed, burning power. Now a given FPU may consume lots of power whether switching or not, but if software makes better use of it, then future FPUs that can take advantage of efficient code will deliver savings that might not be evident now. Given the pervasiveness of legacy software, the savings you build in now may continue to pay dividends far into the future.

While there seems to be general agreement that software can affect power, exactly how to do that beyond a few obvious examples is less clear. Many of the techniques rely on specific power-saving capabilities of processors – not to mention the specific processor architectures – making compilers and operating systems critical partners in this endeavor. It feels like there’s still lots of work to be done to help designers here.

Algorithmic improvements can also reduce power simply by reducing the amount of work required to get a result. This is a situation where improving speed, reducing area, and reducing power can work together. It is also completely independent of implementation technique – software or hardware. In one example, a team at the Korea Advanced Institute of Science and Technology (KAIST) was building an SoC for image processing, and they were trying to improve object recognition. Rather than processing the entire visual field to identify the object, they used a “visual attention algorithm” to reduce the field of interest. This limited the number of points that had to be evaluated in the image field. In one example they showed, the number of points fell by 65% from traditional approaches. That’s 65% less work required to get a result.

Divide and conquer

One of the main reasons that processors consume so much power is that their clocks are blazingly fast. In many cases, more processors being clocked more slowly can get just as much work done with lower power consumption. This is what has driven multicore, but it comes at the cost of more complicated partitioning of software. There’s still a debate about how far multicore can be taken, since at some point I/O and memory access issues become really tough. In addition, today’s most prevalent programming models aren’t well-suited to proliferating parallelism. But chips are starting to demonstrate the power-saving potential of multicore. Renesas, for example, has created an eight-core processor where each core can be independently powered down into one of several modes. The compiler works hand-in-glove to schedule the power reductions according to the software being executed.

Memory structures can have a big impact on power. Everyone avoids going to memory to get data because it takes too long. So we have caches, and caches for caches. We’re used to that for general processors, but more narrowly-focused chips can also benefit from the addition of a cache. In a visual sensor SoC done at the National Taiwan University, a cache was used after the on-chip bitplane memory, reducing memory accesses by 94%. In addition, the memory structure itself can make a difference. In that same example, restructuring the bitplane memory and sharing its ports reduced the memory area by 33%. That helps the economics, but reducing transistors also reduces power.
A wide range of other architectural considerations like the hardware/software split, or whether to do power conversion on-chip or off can make a difference. Sharing of structures can reduce power: Sun’s CMT-3, for example, shares an instruction cache among four cores and a data cache and FPU between two cores. This is the realm of design space exploration. While today’s tools might not be able to automatically determine the best architecture, they can help analyze the results of different options, providing help on which combination works out the best. Many of the tools vendors, from full-suite providers like Mentor, Cadence, Synopsys, and Magma, to smaller, more focused companies like Synfora, offer design space exploration capabilities aimed at helping designers evaluate far more options than they would be able to try out by hand, increasing the likelihood that they pick a closer-to-optimal configuration.

Humans create tools

While the role of tools for validation is strong (more on that later), it is still nascent in the architecture creation space. Some of this is in the space that Electronic System Level (ESL) synthesis tools aim to stake out, once the methodology takes hold. The technology is new, the level of automation may vary from vendor to vendor, and the quality of results may vary as well. There’s a sense that an increasing number of things can be synthesized and laid out automatically, but careful checking on the results would naturally be a good idea until you get a sense of how hands-off you can be.

At the highest level is the ability to take algorithms expressed in ANSI C++ or C and render them automatically into an RTL specification. The promise of C-to-RTL technology has been around for a long time, and, frankly, its lack of ability to fulfill that promise early on has made many designers cautious about using it. But its time may be coming. The two most visible players in this space today are Mentor Graphics, with Catapult C allowing C++, and Synfora, with PICO Express and their just-announced PICO Extreme, which process C. Unlike the original C-to-RTL, the emphasis now is on untimed code, making it easier to work directly with algorithms.

The original idea with moving software to hardware is that potentially more can be done more efficiently in hardware, providing better use of transistors and far fewer clock cycles. There’s a likely chance that an algorithm can be made to work faster in hardware, but there’s no guarantee on saving power; power implications of different implementations must be carefully evaluated. Both Catapult C and PICO Extreme allow power as one of the dimensions in the design space when evaluating alternative implementations.

Making the power play

A relatively popular power-saving technique is to use multiple supply voltages in different islands of the chip. You’ll sometimes hear this referred to as MVS, for Multiple Voltage Supply. The idea is that you segment the circuit into critical regions and give each one only as much power as it needs. That way a slower or less-congested area doesn’t have to be powered higher than necessary just because some other part of the circuit needs the higher voltage. A lower supply value means less leakage as well as narrower voltage swings, so static and dynamic power benefit.
In some respects this isn’t new, since I/Os have had their own supplies for a long time in order to support legacy I/O standards. What’s changing is the proliferation of different supply voltage values and discrete domains. It sounds simple enough to design, but isolation between domains and appropriate level shifting must be inserted. There is also a choice as to whether the voltages should be generated on-chip or delivered to the chip. There’s an increasing move (especially combined with the techniques we’ll discuss next) to manage all of this on chip.

Once you’ve established power domains, the next logical step is power switching. This allows a power domain to be shut off when the circuitry it supports isn’t being used. Once you take this step, the potential number of power domains isn’t driven just by the number of different supply voltages, but by architectural considerations; this tends to increase the number of domains.

The granularity can vary widely. On one end of the spectrum, complex multimode circuits can shut down large chunks of idle circuitry. At the other end, individual pieces of cores can be powered down, with little micro-islands sprinkled around to keep the scan chain alive even in the middle of a powered-down domain. In between these extremes are chips like the Renesas multicore chip mentioned before, where each core can be powered down and each core’s user RAM block can be independently powered down.

Circuit design becomes more complex when power switching is used. In addition to isolation and level shifting between domains, you’ve got to be sure that signals are well controlled, both in their powered-down state and during the transition to and from a powered-down state. The initial power-up state is also important, so you may need to keep a few retention cells powered up to provide a known controlled state when the domain comes up again. The sizing and placement of the power switches are critical to ensure that voltage doesn’t droop in parts of the circuit due to IR drops in the switches. And the very act of powering up and down can create current surges that need to be managed.

From here we move into somewhat more esoteric terrain, with a couple of techniques that conceptually overlap a bit: dynamic voltage and frequency scaling (DVFS) and adaptive voltage scaling (AVS). These are immature enough that it seems like different people may have somewhat different definitions of exactly how these techniques work. For our purposes, I’ll consider the former to be where either or both the clock frequency and supply voltage for a particular domain are adjusted in real time based on activity or task, the latter being where the supply voltage can vary based on temperature.

DVFS is the more common of the two to date, although it’s still considered a pretty advanced technique. A power management circuit monitors activity levels and adjusts power and/or frequency to compensate for increased power. An elaborate example of this is used in Intel’s quad-core Itanium chip of two-billion-transistor fame. They generate a number of clock frequencies and then decide which to use based on a weighted averaging of the activity of various critical nodes as observed over a time window. That average goes into a lookup table that fetches the appropriate frequency. This is repeated on a periodic basis, so that the frequency could be moving around constantly over a wide variety of values as activity varies. A different design choice could be to adjust power: as activity increased, the power could be reduced to keep the power down. Of course, you can’t reduce the power so far that you kill your critical path timing; if that’s going to happen, then you’ve also got to reduce the frequency.

With AVS, power values can be varied to compensate for temperature increases. If temperatures are high, this can provide some needed headroom, since leakage can be dramatic at the hot end of the range. To date, this technique is apparently relatively rare.

Most of these power management techniques are supported by automatic synthesis tools. The tools suites from Synopsys, Magma, and Cadence all support the automatic implementation of multiple power domains and switching, automatically inserting isolation, level shifters, retention cells, and power switches. In addition, Sequence Design’s CoolPower tool can analyze a design and automatically insert, delete, move, and resize power switches.

Cadence, Magma, and Synopsys also claim automatic synthesis of DVFS; Cadence synthesizes AVS as well. However, given that these techniques aren’t yet common, by definition there’s a long way to go on the experience curve, and it’s likely that best practices have yet to settle out when it comes to algorithms and controller implementation.

From here we can move down in abstraction to lower-level techniques. That will be done in the next installment of this article. Two other important questions must also be answered: how do you express your power intent, and how do you verify all of this stuff, not only from a performance standpoint, but from a robustness, reliability, and power standpoint? In addition, we can take a look at the mixes of techniques used on some specific circuit examples and understand the level of savings they provided.