feature article
Subscribe Now

A Perfect DSP Storm

BDTi + High Level Synthesis + FPGA

For years, we’ve discussed how, for high-performance algorithmic design, FPGAs are capable of performance, efficiency, and cost-effectiveness orders of magnitude better than alternative solutions like DSP processors*.  There!  Did you see that?  The asterisk?  You know what that means.  Somewhere, down the page, hidden in the fine print at the bottom is the caveat.  This time, however, we won’t bury the caveat.  We’ll pull it up in plain view, attack it, and make it go away permanently… well, almost. 

For years, we’ve also discussed how, for algorithmic design, high-level synthesis tools (HLS) can take your C or C++ algorithm and automatically create optimized, parallelized hardware implementations that approach the quality of hand-coded RTL*.  Yep, another asterisk.  We’ll tackle that one too.   

On the surface, it seems like a DSP expert should be able to take some C code, an HLS tool, and an FPGA – and instantly be transported to the land of perpetual performance paradise, where magic programmable hardware architectures smartly synthesized by self-aware tools permit us to retire our clunky-old DSP processors to the time-forgotten wasteland of 8-track tapes and rotary-dial telephones.  On the surface, that is. 

In the real world, we need to tackle those asterisks.  Here’s the first one:  FPGAs can actually beat DSP processors in performance, power efficiency, and cost.  However, writing the RTL required to make them do that requires a lot of expertise in hardware design and hardware design languages (HDLs) and a lot of time and effort.  Even for an HDL expert, creating a complex algorithm in RTL requires easily 5x-10x the amount of work that it takes a software expert to develop a C or C++ implementation of the same algorithm.  The bottom line is that FPGAs have more than a 10x advantage over DSPs in performance, power, and cost, and it requires more than 10x the expertise and effort to achieve that with RTL design.   

The second asterisk pertains to the high-level synthesis tools.  HLS seems like a good candidate for poster child of the old Gartner Hype Cycle.  After years of snake-oil-quality hype (see our “System Level Sideshow” article – 7/11/2006), most of the engineering audience decided that high-level synthesis was the technology of the future – and always would be. The claims of the tool vendors were outrageous and all over the map, and the credibility of HLS and ESL tools reached a low (known in the Gartner Hype Cycle as the “Trough of Disillusionment”) that has rarely been seen since the debut of Cold Fusion (not the database – the kind where hydrogen atoms magically and exothermically merge into helium in an ordinary glass of drinking water).  The key sticking points were the easily implausible assertion that C and C++ code could be magically transformed into efficient hardware architectures without human intervention and the fact that EDA companies held a death grip on the technology at price points that were the software equivalent of printer ink. 

A little at a time, however, high-level synthesis has made its way out of the trough and started the slow slog up toward actual usefulness.  In the high-flying world of all-you-can-eat EDA tool deals, giant companies with giant-er tool budgets have spent the past few years enjoying a decided competitive advantage in their ASIC (and sometimes FPGA) designs with HLS tools that weigh in at tens to hundreds of thousands of dollars per seat per year.  They have been quietly shouting (in lightly-attended, well-policed suite presentations at the Design Automation Conference) that their project productivity has hit orders-of-magnitude improvement with RTL-comparable results.   

For the vast majority of engineering teams doing DSP or FPGA work, however, even evaluating these tools to figure out if they were feasible for a project was a walk on the career-limiting wild side of the risk/reward continuum.  Until now. 

Enter BDTi, the “Consumer Reports” of signal processing.  These guys make a living telling the truth about DSP and related technologies.  For almost two decades, Berkeley Design Technology Inc. (BDTi) has carefully benchmarked and evaluated just about every DSP processor to come down the pipe and has reported their real-world results to their clients.  If you were trying to cut through the layers of marketing misdirection surrounding any new offering in the signal processing arena and come up with practical expectations, BDTi was the “go-to” resource.  In recent years, BDTi has also evaluated FPGAs as DSP platforms and, just as we said back in the first paragraph, reported that FPGAs have astonishing potential. 

Now, BDTi has turned their benchmarking lasers on the problem of high-level synthesis.  Specifically, they have created a program for certification of high-level synthesis tools used to synthesize real-world high-performance signal processing algorithms into FPGA-based hardware architectures from C and C++ code.  In this process, and in typical BDTi fashion, they are benchmarking every aspect of the HLS/FPGA/DSP process that a modern engineering team would care about – learning curve, design effort, and quality of results.  They are also comparing those results to the same design done using conventional DSP processors and methods, as well as hand-coded RTL implemented in the same FPGA.  Intrigued?  Yeah, we are too! 

Leading with the punchline, BDTi has concluded that the combination of HLS and FPGA can, in fact, produce an implementation of a typical, real-world, high-performance DSP design with far better performance and lower cost than DSP processors and tools at a comparable level of engineering effort.  How much better?  How about the 30x-40x range for both cost and performance advantage? 

While impressive, this 30x-40x advantage of FPGAs over DSPs is not new.  BDTi has been telling us that for years in their “FPGAs for DSP” reports.  What is new is the “comparable level of engineering effort” being brought to the party by the HLS tools – without a corresponding reduction in quality of results.  This is big news.  In short, the HLS hype that was given to us by EDA marketing in 1999 has been proven true in 2009 independent benchmarking. 

BDTi’s methodology starts with two designs or “workloads.”  These designs were chosen because they are typical of real-world scenarios where FPGA-based DSP acceleration will be considered as a viable or attractive option by commercial design teams.  The first is called the “BDTi Optical Flow Workload” and is a highly-parallelizable video motion-analysis algorithm operating on a high-definition video stream.  The goal will be to achieve the highest frame-rate possible – comparing a design done with HLS/FPGA against the same design done with a conventional DSP processor.    The second design – a wireless communications receiver baseband application called “The BDTi DQPSK Receiver Workload” is used to compare a design done with HLS/FPGA against the same design done as hand-coded RTL targeting the same FPGA. 

For the FPGA platform, BDTi uses the Xilinx Spartan-3A DSP FPGA combined with Xilinx ISE and EDK tools.  The Xilinx tools are used for the RTL portion of the design (as the HLS tools being tested start with C/C++ source and generate synthesizable RTL code as their output).  For the DSP platform, the team uses the 594 MHz Texas Instruments TMS320DM6437 DSP processor with the TI Code Composer Studio tools and the TI Digital Video Evaluation Module. 

BDTi allows an HLS vendor to first do a “self-evaluation” using the designs and specifications provided by BDTi.  In the event the vendor wants to pursue BDTi certification, the vendor then supplies tools and training to BDTi, and BDTi uses the vendor’s methodology to create and evaluate the target designs just as a typical customer would.   

So far, BDTi has certified two HLS tools – AutoPilot from AutoESL, and Pico Extreme FPGA from Synfora.  The company says more certifications/benchmarks will be forthcoming.  AutoESL was spun out of the prolific UCLA program of Prof. Jason Cong – long time FPGA advocate and expert.  AutoESL’s AutoPilot takes C, C++, and SystemC as inputs.  AutoPilot uses the popular and capable LLVM (low-level virtual machine) compiler, so adding new high-level languages should be a comparatively simple matter.  The high-level synthesis “guts” are de-coupled from the language front-end somewhat, so the scheduling and allocation magic can be more or less language independent.  In this phase of high-level synthesis adoption where input language is still a question mark, that’s a good strategy. 

“Our company has been focused more on technology development than on marketing so far,” comments Atul Sharan, CEO of AutoESL.  “Nonetheless, we’ve had a great  number of customer engagements, and in every case we’ve achieved results equal to or better than hand-coded RTL.”  

The company has also targeted their HLS tools to both ASIC and FPGA design.  COO Devadas Varma thinks FPGAs make a good match for HLS-based methodologies, however.  “FPGA developers are more likely to come from areas where they have experience in algorithmic-level design.  They understand the idea of writing algorithms and using micro-architectures as building blocks.  Embedded and DSP developers already write in C.  Hardware designers tend to want to start in something like SystemC, but, despite different entry points, both converge on the same design flow.” 

Asked if FPGA developers have a hard time making the adjustment to the prices of HLS tools, Varma doesn’t think tool pricing is the issue.  “Large companies doing FPGA design are focused on time-to-market, which is why they turned to FPGAs in the first place.  If you look at the time to write RTL for a single micro-architecture and get it to the ‘golden RTL’ stage – the time from a C model is generally several months.  If you can save even one or two months of engineering time getting to that RTL, it more than justifies our cost.” 

Andy Haines, VP of Marketing at Synfora agrees.  “Our Pico High Level Synthesis customers can achieve results comparable to hand-coded RTL in a fraction of the time.  For designs targeting FPGAs, there is a very good fit.”  Haines explains that high-level synthesis matches well in combination with FPGA technology to bring orders of magnitude more performance on algorithmic design than conventional or DSP processors can achieve, with comparable levels of development effort.  Synfora has been focused for a number of years on bringing high-level synthesis capability into the video, imaging, and wireless areas where metrics like raw performance, power, and performance per cost drive design teams to explore new methodologies and target technologies that can trump traditional DSP design. 

Playing quietly behind the scenes in the BDTi certification efforts is Xilinx.  Both of the first two companies certified list Xilinx among their investors, and the certification process is being conducted targeting Xilinx devices and using Xilinx tools for the RTL flow.  Clearly, the findings for the efficacy of high-level synthesis are not Xilinx-specific, however.  “High Level synthesis tools bring the capabilities of FPGAs to a wider audience,” explains Tom Hill, product manager at Xilinx.  “This certification process shows that current commercial HLS tools can generate RTL that is on-par with, or better than, hand-coded RTL in terms of quality-of-results, with much less time and expertise required than conventional RTL design flows.” 

While HLS reduces the need for HDL/RTL expertise in your design team, BDTi says you shouldn’t expect to use HLS to turn software developers into hardware designers without additional training.  “The hardest steps for our team were getting through the RTL flow after HLS,” notes Jeff Bier, General Manager of BDTi.  “It pays to have access to designers with FPGA and/or HDL experience to handle the RTL part of the design process — and, even for the HLS part, an understanding of hardware concepts like pipelining are required to get the best results.” 

The BDTi team brings years of DSP experience and expertise but don’t count themselves as black-belt HDL/RTL developers.  To handle that part of the process — particularly to get the comparisons with “hand-coded” RTL — they enlisted the help of experts at Xilinx to establish the baseline for what “hand-coded” could achieve. Looking at the situation as a whole, the HLS tools actually started at a disadvantage.  BDTi were already expert at developing and optimizing software for the DSP processors that were the comparison standard on one side, and Xilinx experts brought considerable skills to the table in optimizing hand-coded RTL for maximum performance in Xilinx FPGAs.  Neither group entered as experts in HLS technology.  The HLS tools prevailed, however, and fared very well in the team’s subjective ratings of ease-of-use and ease-of-adoption – scoring just about on-par with well-established DSP tool suites coming from the vendor.   

A number of HLS myths were busted in the process of this certification.  First, we have conclusive proof that HLS tools — in the hands of regular designers — can achieve results equivalent to or better than hand-optimized RTL written by experts.  Second, we have documentation that the amount of design effort to develop optimized FPGA hardware for a real-world algorithm is on par with the effort required to develop optimized software running on a DSP processor.  Third, we have strong evidence that the FPGA/HLS design will have orders of magnitude better performance, and better cost-per-performance, than the resulting optimized DSP design. 

A fourth finding of the process might shock many HLS skeptics.  Conventional wisdom was always that considerable code modification was required to get C or C++ to be synthesizable into optimized hardware.  While the BDTi team did have to make some minor modifications to the “pure algorithmic” descriptions in C, they actually had to make MORE modifications to the code to get it optimized on the DSP processor.  Let’s say that again – the original source code had to be changed more for a DSP processor than it did to synthesize into FPGA hardware using HLS.  I have a number of bets that need to be paid off now — you know where to send the cash… 

Conspicuous by its absence in at least this round of the BDTi certification process is Mentor Graphics Catapult C.  Why isn’t the market-leading HLS tool in the lineup?  “The report validates what we’ve seen repeatedly over numerous customer experiences; namely, HLS tools like Catapult C offer real benefit to hardware designers,” replies Shawn McCloud, Product Line Director for High Level Synthesis at Mentor Graphics.  “Catapult is a clear market leader, as Gary Smith EDA’s recent market share numbers indicate. Catapult has grown to a 51% share of the ESL synthesis market while the overall market grew an impressive 35%.  We chose not to invest technical resources in this process, preferring rather to focus on new technology such as our recent control-logic and low-power synthesis capabilities – nevertheless, these [HLS] products are real and can deliver sizeable productivity gains (of 4x-10x).  They bring significant advantage to ASIC and FPGA development teams, not just by reducing the time to develop hardware, but by reducing the verification burden as well.” 

What new does the BDTi certification process tell us about the reality of the HLS+FPGA combo for DSP designs?  A lot.  It tells us that for high-performance designs, we’re bordering on crazy if we don’t consider today’s commercially-available HLS tools targeting FPGAs instead of stacking up rows of conventional DSP processors to meet our performance needs.  It tells us that we will need additional expertise to realize that vision – an airplane can get you across the country much faster than a car, and with a comparable amount of workload for the driver/pilot, but you will need to learn how to fly an airplane or hire a pilot before you can take advantage of that yourself.  The same thing is true of the HLS/RTL flow required to get the productivity/performance point BDTi found with these HLS tools and FPGAs.  Since you’re designing hardware, somebody on the team will need to understand hardware design.  HLS makes that easier, but it does not eliminate the need for knowledge. 

A final thing to keep in mind is that the economics of these tools are not the same as those of conventional FPGA tools.  HLS tools are complex, difficult and expensive to develop, and they serve a comparatively small audience.  They are not subsidized by the silicon vendors the way place-and-route, synthesis, and simulation tools in the standard FPGA vendor suites are.  The companies marketing these tools actually plan to make a business selling them, and their value proposition is strong – eliminating a huge amount of engineering effort (and cost) from your project and getting your product to market faster.  Be ready to pay tens to hundreds of thousands of dollars to use one of these tools for a year.  Be ready to earn that back with interest on your first design project. 

For the full results of the BDTi certification of Auto Pilot and PICO Extreme FPGA, click here

Leave a Reply

featured blogs
Aug 18, 2018
Once upon a time, the Santa Clara Valley was called the Valley of Heart'€™s Delight; the main industry was growing prunes; and there were orchards filled with apricot and cherry trees all over the place. Then in 1955, a future Nobel Prize winner named William Shockley moved...
Aug 17, 2018
Samtec’s growing portfolio of high-performance Silicon-to-Silicon'„¢ Applications Solutions answer the design challenges of routing 56 Gbps signals through a system. However, finding the ideal solution in a single-click probably is an obstacle. Samtec last updated the...
Aug 17, 2018
If you read my post Who Put the Silicon in Silicon Valley? then you know my conclusion: Let's go with Shockley. He invented the transistor, came here, hired a bunch of young PhDs, and sent them out (by accident, not design) to create the companies, that created the compa...
Aug 16, 2018
All of the little details were squared up when the check-plots came out for "final" review. Those same preliminary files were shared with the fab and assembly units and, of course, the vendors have c...
Jul 30, 2018
As discussed in part 1 of this blog post, each instance of an Achronix Speedcore eFPGA in your ASIC or SoC design must be configured after the system powers up because Speedcore eFPGAs employ nonvolatile SRAM technology to store its configuration bits. The time required to pr...