A few weeks ago, we started looking at ways of reducing power consumption when designing SoCs. We divided the world into the front-end, where the big payoff is, and the back-end, with useful techniques that have less dramatic impact. We looked at architecture and system design, hardware/software allocation and C-to-RTL, multicore, Multi-Voltage Supply (MVS), power switching, Dynamic Voltage/Frequency Scaling (DVFS), and Adaptive Voltage Scaling (AVS). These are techniques that can give power savings in the range of 30-50%. Having addressed those, there are numerous back-end techniques that can give more modest, but nonetheless valuable, power savings. We’ll look at some of those here, not necessarily in any specific order. The savings from these techniques will vary widely by application but will generally be in the 5-15% range.
Design it right from the beginning
One technique that has been used for quite a while is to provide different transistors with different thresholds in the design kit — so-called multi-VT design. Low-threshold transistors are faster but also leak more. Not all transistors need to be the same speed – in fact, a majority of the transistors are not likely to be in the critical path, so higher-VT transistors can be used. While in the past extra speed meant extra breathing room, today extra speed means wasted power. So if a path is faster than it needs to be, it can be slowed down by swapping out transistors (among other things).
Similar to using different transistors, some tools will allow the use of faster or slower flip-flops, according to the needs of the critical path. While the different flip-flops might make use of transistors with different thresholds, this also allows other techniques to be used in combination to provide faster or lower-power flip-flops.
Clock-tree synthesis (CTS) can have a significant impact on power. Clock trees must be carefully generated to balance the skew along all paths of the clock. It’s not so important that all clock nets are high speed (although that’s nice), but all branches must have the same delay, so if one is slower and can’t be sped up, then the others must be slowed down to minimize the skew. The points at which a given clock net splits can be close to the clock source, creating longer branches with loads distributed across many buffers, or further in, providing shorter branches but a heavier load on the main buffers. Both the clock nets and buffers must be optimized for dynamic power, since the coupling of the wires and the cross-over current of the switching buffers contribute significantly to dynamic power.
Clock gating is one of the most effective techniques for reducing dynamic current, since it allows unused portions of the circuit to have their clocks shut off. Gates of course add skew to the clock path and so must be compensated during CTS. Designers can specify blocks to be gated, or tools can optimize clock gating automatically. Calypto, for example, can analyze sequential logic and gate all levels of a state machine or pipeline automatically so that the flip-flops are on only when doing useful work; the tool can infer this timing deterministically from the state machine specification. Clock gates can be re-used as necessary to minimize added circuitry; Calypto claims that no results that they have seen have had an area impact above 1% due to added clock gating circuitry. Because this is done before CTS, CTS can balance the overall tree in the presence of the gates.
Another option available for reducing leakage is to bias the substrate of a transistor more negatively than ground. This requires adding isolating wells so that the transistors using this technique can have their substrates at a different voltage than the bulk of the substrate, which is biased at ground. Even more sophisticated is the ability to vary the back bias according to temperature, reducing leakage dynamically as needed.
We’ve talked about sophisticated power supply techniques ranging from multiple voltage sources to dynamic voltage/frequency synthesis, but even the basic design of the power grid can make a difference. Power grid synthesis is now becoming power-aware so that sizing and placement of buses and vias can be optimized. In addition, decoupling caps are now often integrated into the chip but can contribute to leakage unless high-κ dielectrics are used (which apparently isn’t common today); proper sizing can reduce unnecessary leakage.
Finally, the most obvious thing: power-aware place-and-route of signals. Power has become a cost function for place and route so that, in addition to making performance, designs can be biased towards lower power. This can involve things like the clustering of related logic to minimize nets and the use of less circuitous routing.
Cadence appears to have been dogging some of these techniques the longest, claiming that each of the items above can be synthesized automatically. Magma says they can handle low-power clock tree synthesis, multi-VT, static back bias generation, power grid optimization, and power-aware place-and-route. Synopsys claims to have those and low-power flip-flop substitution under their belts. But this is a rapidly evolving area. The different toolsets are at varying levels of maturity, and different techniques may be more or less mature within a given toolset, so benchmarking is likely a good idea to get a snapshot of who’s where at any given time.
The goal of the big guys’ tools is eventually (if they don’t have it already) to make all steps of a complete flow power-aware so that each step can contribute to a reduction in power. There are other point tools that can take an existing design and then optimize what has already been done, as was described above with Calypto. Likewise, Sequence Design’s tools will pore over a design and substitute higher-VT transistors where slack and noise permit (and they claim to be the only ones to take noise into account in such substitution); change the drive strengths of transistors; automatically insert, place, and size power gating circuitry; and optimize integrated decoupling capacitors.
Trust… but verify
The next big question is how to verify designs that use these techniques. For many of the techniques, the real craft is in the design; the result has no impact on verification. For example, assuming a verification tool can handle a transistor of any threshold, then it doesn’t care whether the design contains one kind of transistor or multiple kinds. On the other hand, domain isolation for clock buffering and power switching, for example, can add a new wrinkle to simulation, since an old-school simulator that relies on a single power and clock domain might not know what to do with multiple domains, isolation circuits, and retention registers.
Verification at a higher level early in the design can help give an idea of power consumption as front-end tradeoffs are made. Being able to estimate power using transaction-level modeling (TLM) helps to inform key decisions at the point where the impact of those decisions is the greatest. Of course the accuracy of actual power results is somewhat lower than if simulating at the gate or transistor level, but it still provides useful information in guiding the direction of the design. Power validation of large complex SoCs can also be improved through the use of power assertions. This is important for ensuring that critical corners of multi-domain circuits are tested out.
Even at the lowest levels, elements like the power grid must be carefully checked to ensure that there are no IR drops or thermal issues across the chip. The integrity of power switches, level shifters, isolation wells, drivers, and retention cells with various power levels – including off – must be proven.
Test vector generation and grading are also important. Test vectors typically originate in validation, being supplemented for full test coverage. In a power-sensitive design, it’s critical to ensure real-world vectors, since nonsense vectors might push the power out of bounds. For example, in a cell phone, random vectors might turn on both CDMA and GSM modes when in fact only one would normally be on at a time. The catch is that with automatically generated vectors, there are lots of don’t-care signals that must be assigned to some value. If that value is random, then it may reflect an infelicitous nonsense condition.
In yet another twist, traditional vector generators may try to optimize the vectors by packing as much testing into as few vectors as possible. If two sets of signals and the circuits they drive are orthogonal, then their test vectors can often be combined. This can have an impact on power, though, if the exercising of the two circuits simultaneously isn’t a real-world condition. For these reasons, vectors must be carefully generated in a power-aware manner, or at the very least graded to ensure that the test environment doesn’t end up causing excessive power, ultimately damaging or failing good chips.
As with synthesis tools, validation tools from the large companies will be more or less power-aware at the levels of abstraction offered, and they can sometimes work with the synthesis tools in a feedback loop that helps optimization. There are also point tools that can be used, like Sequence Design’s “Watt-bots”, which can scan a design for inefficient structures prior to synthesis and report back on opportunities for improvement.
There’s one last ingredient that has proven important for low-power design automation: the ability of a designer to specify his or her intent with respect to power. You might think of these as power constraints. Simple enough in concept, it is a problem that’s been solved, but it’s a tad messy. In fact, there’s not just one standard for expressing power intent: there are two competing standards.
While there is ardent disagreement about how we got to this stage, there does seem to be agreement that Cadence was first with their Common Power Format (CPF). The reason for a second format is hotly debated as being either the result of competitors ganging up or the result of other companies not being given access to the CPF and having no choice but to do a different format. We certainly won’t settle that debate here, but the result is the competing Unified Power Format (UPF). The CPF file has been standardized through the Si2 organization; the UPF format was developed jointly through Accellera and is now being reviewed as IEEE P1801; they hope it will be completed in the first half of this year.
What does this mean for users? Well, there are pretty much three camps as far as is evident now. There are those big companies that will support only CPF; there are those that will support only UPF; and then there are the small guys getting squeezed, who will support whatever their customers want. To some extent that means CPF today, since it was out first and has a jump on installed base; Cadence’s tools all have been CPF-aware since the beginning of 2007. But smaller companies are open to supporting UPF as well, as demand materializes. It remains to be seen whether the big guys will eventually accept both formats. Except for the fact that there are two standards, such a file can act as a unifying element across an entire toolchain for both synthesis and validation.
The bottom line on all of this is that each phase of the design, from architectural experimentation down to the last design-for-manufacturing (DFM) tweaks, must be done on a power-aware basis. Everything affects power. Power has risen in importance to the level of timing, occasionally surpassing it. Soon enough it will be the rare designer who can manage to do a design blissfully unaware of the power impact of his or her decisions. The good news is that designers are getting more and more tools that improve their ability to achieve the goal of low power, and new tools will continue to become available for the foreseeable future.