Power Exploration in High-Level Synthesis

Area optimization and timing closure have long been considered the most common digital design challenges in mainstream digital IC design. Much has been analyzed and documented on how to solve these issues at the various design levels – from RTL to gate to layout. In recent times however, as design applications have become more portable and power sensitive, power exploration and smart design practices for optimizing power have taken centre stage.

Abstraction Facilitates Design Optimization

First, let’s review the benefits of high-level synthesis. As with the optimizations for area and timing, the earlier power challenges are tackled, the more flexibility the designer has for achieving an optimal solution. Now, with the availability of ANSI C++/next-generation high-level synthesis tools, power optimization can be approached more efficiently at a much higher level than RTL.

The speed and ease with which high-level synthesis tools can generate RTL from an ANSI C++ source gives designers the flexibility to fully compare multiple architectural solutions to achieve an optimal design. Specifically, high-level synthesis technology allows a user to efficiently explore area, performance and power consumption using different algorithms, and generating different design architectures. The designer can examine performance, area and power across implementations ranging from completely parallel to fully-pipelined. The range of permutations that can be created can be anywhere from tens to hundreds of possible designs. Trying to do this kind of an exploration for even a handful of different permutations with traditional manual RTL methods is impossible given today’s tight design schedules.

Working at a higher level of abstraction, the designer enjoys a significant boost in productivity—anywhere from 10-100x—versus RTL or hardware C languages, which over constrain the source by hard coding concurrency, timing and structure directly in the source language.

By nature, an ANSI C++ source is a purely functional description of the algorithm and therefore independent of the target architecture and technology. The user can apply synthesis directives to specify the target technology (ASIC or FPGA), the amount of parallelism, and desired performance.

A design flow based on high-level synthesis enables designers to tune the design to exactly match the performance required for a specific application, including latency, throughput, power consumption and frequency avoiding a common problem of overbuilding the hardware. Since the C representation is completely abstracted from the final implementation, designers can later use the “soft” constraints in high-level synthesis to easily re-target the same representation for different micro-architectures and ASIC/FPGA implementations.

Equally important, the quality of the RTL source code is greatly enhanced. Since the lower level code is automatically generated from the system specification, there are fewer bugs introduced into the design—up to 60% less. By eliminating errors that invariably crop up during manual RTL generation, high-level synthesis shortens the verification effort and thereby moving a design to completion faster.

For those bugs that stem from design-related decisions, the same high-level description can be used to automatically create a consistent verification environment including high speed system models. Advanced high-level synthesis tools automatically create SystemC wrappers, allowing designers to rapidly verify their designs 20X to 100X faster than traditional register transfer level (RTL). A test bench can also be generated that automatically compares the ANSI C/C++ input to the RTL output, providing debug information for specific synchronization points in the case of a simulation mismatch.

Focus on Power Optimization

Until recently, designers of algorithmic-based applications were primarily concerned with either improving their design’s performance (throughput, latency, frequency) or reducing silicon area to lower manufacturing costs. But the dominance of battery-powered consumer applications that rely on power efficient algorithms—such as cell phones, PDAs and MP3 players—have brought power optimization to the fore. Typical RTL methods for power efficient design involve well- known schemes like clock-gating, optimizing memory accesses, controlling clock rates, changing state machine encoding etc. Most of these design techniques can be automatically generated with high-level synthesis tools. Also, on a broader level, a high-level synthesis user can make effective tradeoff designs based on all three key metrics – timing, area and power.

Figure 1. Power Efficient Design Starts at the System Level

Typically, the accuracy of performance estimation is inversely proportional to a design’s abstraction level (Figure 1). In other words, the higher up you go in your design methodology, the lower the accuracy of your estimation of area, delay or power. Fortunately, at a higher-level, one is more interested in relative power estimation than its absolute value. The ability to compare power consumption estimates for the various algorithms or micro architectures within an algorithm gives the designer an invaluable advantage and allows then to make a significantly positive impact on power-related decisions earlier on in the design process.

The easiest way to demonstrate the power and flexibility of power exploration at a higher level of abstraction is to walk through a simple example. In this article we will discuss a Finite Impulse Response (FIR) filter to highlight the various tradeoffs possible between power and performance. The FIR filter is one of the most common filters used in everyday signal processing applications, where they restore the clarity of digital signals as they travel through a transmission medium.

The FIR filter is a simple algorithm that can be implemented in multiple ways. Most of these approaches have been developed to either maximize performance or to ensure the most efficient use of valuable silicon real estate. The example design is an 8-tap FIR filter with a performance requirement of 400 MHz and is targeted to a 90nm ASIC technology.

Figure 2: Direct form implementation of a 4-tap FIR filter

One of the most common structures of a FIR filter is the literal implementation or the direct form implementation (Figure 2), where the data is moved through a shift-register based delay line and each register’s output is multiplied with corresponding coefficients. The resulting outputs of all the multipliers are summed up to create the filter’s output. Typically, this implementation delivers the highest throughput. As most FIR filter coefficients are symmetrical, this traditional architecture could also be optimized by folding the structure thus reducing the number of multipliers required. Figure 3 shows a RTL schematic view of a pipelined implementation of a direct form FIR filter.

Figure 3: RTL schematic of a pipelined implementation of a direct form FIR filter

Another implementation of a FIR filter, that is typically used when the filter has a low number of taps, is the structure wherein the taps are rotated through a shift-register with only the end tap being indexed. This implementation typically results in a lower area structure. Figure 4 shows a schematic view of a register-based rotate implementation of a 4-tap FIR filter. Of course, there are many other logically equivalent popular FIR implementations such as a transpose format or circular buffer using memory (for larger number of taps) and it is up to the designer to choose one that best fits their performance needs. In this article, we will experiment with the direct form and register-based rotate implementations of the FIR filter and examine them with respect to power consumption.

Figure 4: RTL schematic Register based rotate implementation of a FIR filter

Using an advanced high-level synthesis tool, such as Catapult Synthesis from Mentor Graphics, one can rapidly create various micro architectures for any given algorithm. For example, a traditional or direct form implementation of the FIR filter can be designed with minimal resources or as a parallel fully pipelined system. Though similar in functionality, the effect on performance—especially with respect to power— for each of these implementations is quite different and can be clearly seen in Figure 5. The fully pipelined solution runs with the highest throughput rate, but also has larger area and higher estimated power usage.

Similar experimental implementations can be created for the register based rotate version of the FIR filter algorithm. As expected, this algorithm’s implementation uses less area compared to the shift-register based version and also consumes lesser power.

Figure 5: Power consumption for different implementations of a FIR filter algorithm

Using high-level synthesis, in a very short period of time, the design space for the two algorithms can be thoroughly explored and multiple implementations created. Figure 5 shows a snapshot of the various micro-architectures with corresponding area, latency, throughput and estimated power consumption data. Automated RTL-level power estimation was achieved through the use of Sequence Design’s Power Theater™ tool.

As can be seen from Figure 5, based on the criteria for the implementation, the designer has a wide variety of choices. As expected, area and estimated power consumption are highest for the fully pipelined solution (DIRECT_FORM_THRUPUT_1) that reads and writes data every clock cycle. Other solutions offer differing area, throughput rates and corresponding power usages. A high-level synthesis user can then make tradeoff decisions based on this data and choose the appropriate solution for their design needs.

Taking a Higher View of Power

This simple example shows the usefulness of architectural exploration at a higher level with respect to power consumption. The level of design space exploration possible at a higher abstraction is immensely valuable in determining the optimal design architecture to adopt. Until the advent of higher-level synthesis tools, however, it simply was not practical to explore a range of different architectures using traditional RTL design methods. Now designers can thoroughly explore a multitude of different architectures to find the one that best fits their area, performance and especially power consumption requirements. The result is more optimal designs for even the most complex applications without sacrificing on getting to market quickly.