Need to accelerate the creation of technology-independent DSP hardware?

The massive increase in processing required for next generation compute-intensive applications, such as wireless communication and image processing, has created a gap between off-the-shelf DSP performance and market needs. In many cases, discrete DSPs are simply running out of steam to serve the new communications, multimedia, and consumer applications. In recent years, users have increasingly looked toward alternative solutions ranging from ultra-high performance full-custom ASICs to highly flexible general-purpose CPUs. Somewhere in the middle are FPGAs, providing a cost-effective balance (Figure 1) between programmability and high performance. With their processing flexibility ranging from serial to parallel computing, and now containing highly specialized DSP macros and memories, FPGAs have the potential to become an attractive option in which to implement DSP algorithms.

Figure 1:When it comes to DSPs, many designers are forced to compromise between programmability of discrete devices and the performance of custom FPGA or ASIC implementations.

Each platform has certain benefits and limitations. On one extreme, the pure software approach implemented in discrete DSPs is mature, flexible, and relatively easy to use but offers limited instruction-level parallelism. On the other extreme, ASIC implementations offer custom performance and high volume pricing benefits but traditionally constitute a much greater design effort and soaring NRE costs. Demonstrating some of the value from both extremes, FPGA hardware supports reprogrammability and architecture flexibility in terms of spatial and temporal parallelism (via repetition and pipelining) but lacks ease of programming since design entry is in a register-transfer level (RTL) hardware description language versus the DSP program domain of ANSI C/C++.

The catch-22 situation is that designers want the programming flexibility of the discrete DSPs and the performance flexibility available in FPGAs. How can they combine the best of both worlds? And, more importantly, what are their options if the application calls for the use of an ASIC implementation? Optimal implementation of DSP algorithms, therefore, requires a serious rethinking about how to approach the overall design flow when transforming algorithms into hardware, via either the ASIC or FPGA route. In the end, choosing the path of technology independence could mean the difference between success and failure.

Algorithmic Synthesis Bridges the Design Gap

To use RTL to create hardware implementations for complex DSP algorithms, design teams must iterate through several steps, including micro-architecture definition, hand-written RTL, and area/speed optimization through iterative RTL synthesis. This manual process is slow and introduces up to 60 percent of the bugs found in RTL due to design misinterpretation from original specification. In the final result, both the micro-architecture and technology characteristics become hard-coded into the RTL description. This effect severely limits the notion of RTL reuse or retargeting for real applications, and leads to overbuilt designs and wasted silicon.

New DSP-specific flows enable algorithmic design at a higher level of abstraction than RTL. Although high-level synthesis tools have been available for some time, none have delivered the necessary ease-of-use and quality of results until now. Now, a new breed of “algorithmic synthesis” tools offer a faster path to custom DSP hardware. The best algorithmic synthesis tools take industry standard pure ANSI C++ as input and automatically produce RTL based on user-defined design goals. This approach closes the conceptual gap between algorithm designers modeling in pure ANSI C or C++ and hardware designers working at the RTL abstraction level (Figure 2).

By using a technology-independent ANSI C++ source, these tools enable designers to choose between ASIC or FPGA implementations, and provide designers with a means to incrementally explore and optimize implementation architecture. The end result is a design architecture and RTL implementation tuned to the device and system requirements, all delivered up to 20X faster and with 60 percent fewer bugs versus hand-coded RTL.

Figure 2: Algorithmic synthesis methodologies based on pure ANSI C++ offer a faster path to custom DSP hardware, enabling high performance implementations in less time.

More importantly, the ability to select fundamentally superior platform-independent micro-architectural alternatives enables designers to create hardware designs of better quality than traditional RTL methods. Using this methodology, hardware designers can easily perform “what if” tradeoffs evaluating area, latency, throughput, and clock frequency for each micro-architecture, all the while leaving the original pure ANSI C/C++ source unchanged.

Larger, faster designs are increasingly common in the DSP realm, which implies prolonged simulation and synthesis cycles. It has become imperative to fix as many code errors as possible prior to simulation and synthesis, using the design checking capabilities in interactive HDL visualization tools. Moreover, verification takes significantly longer than design development because of the limited speed of RTL simulators and the time to manually create an RTL test bench. Advanced design verification flows, with support of industry-standard simulation tools, are now addressing rapid algorithm validation and verification by mixing the high-speed characteristics of pure ANSI C/C++ with HDL like modeling benefits found in SystemC and SystemVerilog

Choosing the Right Implementation Technology

Algorithmic synthesis must also take into consideration technology-specific characteristics of RTL synthesis to be fully effective. For example, algorithmic synthesis must be aware of high-performance operations available in some FPGAs such dedicated block multipliers, multiply/accumulate macros, pipelined operations, and special memory architectures. For ASICs, algorithmic synthesis must leverage the wide range of operator architectures available in RTL synthesis ranging from high-performance booth encoded parallel multipliers to area efficient bit-serial multipliers.

The key is knowledge-based synthesis tailored to the RTL synthesis tool. As such, algorithmic synthesis must be keenly aware of the inherent characteristics of RTL synthesis tools. Tight integration between algorithmic synthesis and RTL synthesis ensures timing closure in the back-end as well as accurate up-front area, performance and power estimates in the front-end.

Challenges and Opportunities Abound

When all is said and done, there are still limitations and challenges ahead. While FPGA devices are bigger than ever before, they nonetheless are still constrained by size. The largest algorithms admittedly will not fit onto current FPGAs. FPGA cost and power consumption are still major issues in consumer applications, where DSP applications have major impact. Technology-independent solutions such as algorithmic C synthesis provide the inherent flexibility to target critical DSP algorithms between discrete DSP, ASIC and FPGA implementations, a critical success factor since application segments dictate market cost, performance, and flexibility requirements. Using the innovative, technology-independent solutions now becoming available, the design community can stay ahead of the competitive curve and fully exploit the unprecedented opportunities ahead.