feature article
Subscribe Now

Power Exploration in High-Level Synthesis

Area optimization and timing closure have long been considered the most common digital design challenges in mainstream digital IC design. Much has been analyzed and documented on how to solve these issues at the various design levels – from RTL to gate to layout. In recent times however, as design applications have become more portable and power sensitive, power exploration and smart design practices for optimizing power have taken centre stage.

Abstraction Facilitates Design Optimization

First, let’s review the benefits of high-level synthesis. As with the optimizations for area and timing, the earlier power challenges are tackled, the more flexibility the designer has for achieving an optimal solution. Now, with the availability of ANSI C++/next-generation high-level synthesis tools, power optimization can be approached more efficiently at a much higher level than RTL.

The speed and ease with which high-level synthesis tools can generate RTL from an ANSI C++ source gives designers the flexibility to fully compare multiple architectural solutions to achieve an optimal design.   Specifically, high-level synthesis technology allows a user to efficiently explore area, performance and power consumption using different algorithms, and generating different design architectures.  The designer can examine performance, area and power across implementations ranging from completely parallel to fully-pipelined. The range of permutations that can be created can be anywhere from tens to hundreds of possible designs. Trying to do this kind of an exploration for even a handful of different permutations with traditional manual RTL methods is impossible given today’s tight design schedules.  

Working at a higher level of abstraction, the designer enjoys a significant boost in productivity—anywhere from 10-100x—versus RTL or hardware C languages, which over constrain the source by hard coding concurrency, timing and structure directly in the source language. 

By nature, an ANSI C++ source is a purely functional description of the algorithm and therefore independent of the target architecture and technology. The user can apply synthesis directives to specify the target technology (ASIC or FPGA), the amount of parallelism, and desired performance.

A design flow based on high-level synthesis enables designers to tune the design to exactly match the performance required for a specific application, including latency, throughput, power consumption and frequency avoiding a common problem of overbuilding the hardware. Since the C representation is completely abstracted from the final implementation, designers can later use the “soft” constraints in high-level synthesis to easily re-target the same representation for different micro-architectures and ASIC/FPGA implementations.

Equally important, the quality of the RTL source code is greatly enhanced.  Since the lower level code is automatically generated from the system specification, there are fewer bugs introduced into the design—up to 60% less.  By eliminating errors that invariably crop up during manual RTL generation, high-level synthesis shortens the verification effort and thereby moving a design to completion faster. 

For those bugs that stem from design-related decisions, the same high-level description can be used to automatically create a consistent verification environment including high speed system models. Advanced high-level synthesis tools automatically create SystemC wrappers, allowing designers to rapidly verify their designs 20X to 100X faster than traditional register transfer level (RTL).  A test bench can also be generated that automatically compares the ANSI C/C++ input to the RTL output, providing debug information for specific synchronization points in the case of a simulation mismatch. 

Focus on Power Optimization

Until recently, designers of algorithmic-based applications were primarily concerned with either improving their design’s performance (throughput, latency, frequency) or reducing silicon area to lower manufacturing costs.  But the dominance of battery-powered consumer applications that rely on power efficient algorithms—such as cell phones, PDAs and MP3 players—have brought power optimization to the fore.  Typical RTL methods for power efficient design involve well- known schemes like clock-gating, optimizing memory accesses, controlling clock rates, changing state machine encoding etc. Most of these design techniques can be automatically generated with high-level synthesis tools. Also, on a broader level, a high-level synthesis user can make effective tradeoff designs based on all three key metrics – timing, area and power.


Figure 1.   Power Efficient Design Starts at the System Level

Typically, the accuracy of performance estimation is inversely proportional to a design’s abstraction level (Figure 1). In other words, the higher up you go in your design methodology, the lower the accuracy of your estimation of area, delay or power. Fortunately, at a higher-level, one is more interested in relative power estimation than its absolute value. The ability to compare power consumption estimates for the various algorithms or micro architectures within an algorithm gives the designer an invaluable advantage and allows then to make a significantly positive impact on power-related decisions earlier on in the design process.

The easiest way to demonstrate the power and flexibility of power exploration at a higher level of abstraction is to walk through a simple example. In this article we will discuss a Finite Impulse Response (FIR) filter to highlight the various tradeoffs possible between power and performance. The FIR filter is one of the most common filters used in everyday signal processing applications, where they restore the clarity of digital signals as they travel through a transmission medium.

The FIR filter is a simple algorithm that can be implemented in multiple ways.  Most of these approaches have been developed to either maximize performance or to ensure the most efficient use of valuable silicon real estate.  The example design is an 8-tap FIR filter with a performance requirement of 400 MHz and is targeted to a 90nm ASIC technology.


Figure 2: Direct form implementation of a 4-tap FIR filter

One of the most common structures of a FIR filter is the literal implementation or the direct form implementation (Figure 2), where the data is moved through a shift-register based delay line and each register’s output is multiplied with corresponding coefficients. The resulting outputs of all the multipliers are summed up to create the filter’s output. Typically, this implementation delivers the highest throughput. As most FIR filter coefficients are symmetrical, this traditional architecture could also be optimized by folding the structure thus reducing the number of multipliers required. Figure 3 shows a RTL schematic view of a pipelined implementation of a direct form FIR filter.


Figure 3: RTL schematic of a pipelined implementation of a direct form FIR filter

 Another implementation of a FIR filter, that is typically used when the filter has a low number of taps, is the structure wherein the taps are rotated through a shift-register with only the end tap being indexed. This implementation typically results in a lower area structure. Figure 4 shows a schematic view of a register-based rotate implementation of a 4-tap FIR filter. Of course, there are many other logically equivalent popular FIR implementations such as a transpose format or circular buffer using memory (for larger number of taps) and it is up to the designer to choose one that best fits their performance needs. In this article, we will experiment with the direct form and register-based rotate implementations of the FIR filter and examine them with respect to power consumption.


Figure 4: RTL schematic Register based rotate implementation of a FIR filter

Using an advanced high-level synthesis tool, such as Catapult Synthesis from Mentor Graphics, one can rapidly create various micro architectures for any given algorithm. For example, a traditional or direct form implementation of the FIR filter can be designed with minimal resources or as a parallel fully pipelined system. Though similar in functionality, the effect on performance—especially with respect to power— for each of these implementations is quite different and can be clearly seen in Figure 5. The fully pipelined solution runs with the highest throughput rate, but also has larger area and higher estimated power usage.

Similar experimental implementations can be created for the register based rotate version of the FIR filter algorithm. As expected, this algorithm’s implementation uses less area compared to the shift-register based version and also consumes lesser power.


Figure 5: Power consumption for different implementations of a FIR filter algorithm

Using high-level synthesis, in a very short period of time, the design space for the two algorithms can be thoroughly explored and multiple implementations created. Figure 5 shows a snapshot of the various micro-architectures with corresponding area, latency, throughput and estimated power consumption data.  Automated RTL-level power estimation was achieved through the use of Sequence Design’s Power Theater™ tool.

As can be seen from Figure 5, based on the criteria for the implementation, the designer has a wide variety of choices. As expected, area and estimated power consumption are highest for the fully pipelined solution (DIRECT_FORM_THRUPUT_1) that reads and writes data every clock cycle. Other solutions offer differing area, throughput rates and corresponding power usages. A high-level synthesis user can then make tradeoff decisions based on this data and choose the appropriate solution for their design needs.

Taking a Higher View of Power

This simple example shows the usefulness of architectural exploration at a higher level with respect to power consumption.  The level of design space exploration possible at a higher abstraction is immensely valuable in determining the optimal design architecture to adopt.  Until the advent of higher-level synthesis tools, however, it simply was not practical to explore a range of different architectures using traditional RTL design methods.  Now designers can thoroughly explore a multitude of different architectures to find the one that best fits their area, performance and especially power consumption requirements. The result is more optimal designs for even the most complex applications without sacrificing on getting to market quickly.

Leave a Reply

featured blogs
Sep 21, 2023
Wireless communication in workplace wearables protects and boosts the occupational safety and productivity of industrial workers and front-line teams....
Sep 21, 2023
Labforge is a Waterloo, Ontario-based company that designs, builds, and manufactures smart cameras used in industrial automation and defense applications. By bringing artificial intelligence (AI) into their vision systems with Cadence , they can automate tasks that are diffic...
Sep 21, 2023
At Qualcomm AI Research, we are working on applications of generative modelling to embodied AI and robotics, in order to enable more capabilities in robotics....
Sep 21, 2023
Not knowing all the stuff I don't know didn't come easy. I've had to read a lot of books to get where I am....
Sep 21, 2023
See how we're accelerating the multi-die system chip design flow with partner Samsung Foundry, making it easier to meet PPA and time-to-market goals.The post Samsung Foundry and Synopsys Accelerate Multi-Die System Design appeared first on Chip Design....

Featured Video

Chiplet Architecture Accelerates Delivery of Industry-Leading Intel® FPGA Features and Capabilities

Sponsored by Intel

With each generation, packing millions of transistors onto shrinking dies gets more challenging. But we are continuing to change the game with advanced, targeted FPGAs for your needs. In this video, you’ll discover how Intel®’s chiplet-based approach to FPGAs delivers the latest capabilities faster than ever. Find out how we deliver on the promise of Moore’s law and push the boundaries with future innovations such as pathfinding options for chip-to-chip optical communication, exploring new ways to deliver better AI, and adopting UCIe standards in our next-generation FPGAs.

To learn more about chiplet architecture in Intel FPGA devices visit https://intel.ly/45B65Ij

featured paper

An Automated Method for Adding Resiliency to Mission-Critical SoC Designs

Sponsored by Synopsys

Adding safety measures to SoC designs in the form of radiation-hardened elements or redundancy is essential in making mission-critical applications in the A&D, cloud, automotive, robotics, medical, and IoT industries more resilient against random hardware failures that occur. This paper discusses the automated process of implementing the safety mechanisms/measures (SM) in the design to make them more resilient and analyze their effectiveness from design inception to the final product.

Click here to read more

featured chalk talk

Inductive Position Sensors for Motors and Actuators
Sponsored by Mouser Electronics and Microchip
Hall effect sensors have been quite popular for a variety of applications for many years but inductive positions sensors can provide better accuracy, better noise immunity, can cost less,  and can reject stray magnetic fields. In this episode of Chalk Talk, Amelia Dalton chats with Mark Smith from Microchip about the multitude of benefits that inductive position sensors can bring to automotive, robotic and industrial applications. They also check out the easy to use kits that can help you get started using them for your next design.
Dec 19, 2022