feature article
Subscribe Now

Mad MACs

Who’s Got the Best DSP Accelerators?

Maybe you saw them in science fiction movies when you were a kid. Possibly your Hot Wheels toy car collection contained a few as well. You may have even drawn your own on your school book covers. They looked like normal race cars, but a single engine just would not do. Sometimes four, six, or eight huge power plants graced the foredeck, each with eight big straight pipes sticking out the sides like stocky legs on some steel-bodied spider, complete with a Roots blower abdomen and air scoop head. You were never able to understand quite how all those engines might work together to make the car go fast, but the message of plenteous performance was clearly conveyed by the visual impact of this plethora of pipes and plenums.

A few years later, you were in engineering school and the magic spell was broken. You learned that the complexities of coordinating so many motors for the task of moving a single vehicle caused too many problems, and the promised performance of parallel power plants fell forever into the chasm of misguided engineering fantasies.

Today, however, the process repeats itself. We all know that the performance bottleneck for most digital signal processing (DSP) algorithms is the multiply operation. The most recognized benchmark of DSP performance is, in fact, a measurement of the number of multiplies and accumulates per second such as MMACs (Million multiply accumulates per second) and GMACs (Giga Multiply Accumulates per second). DSP processors sometimes have several multiply accumulate (MAC) units attached to a high performance processor that runs the overall algorithm and sequences the expensive multiply operations out to the available MACs.

How cool, then, if instead of a paltry four MAC units, your chip could have as many as four or five hundred fancy number-crunchers under the hood, ready to calculate at your command. You’d be the envy of every DSP kid on the block. While they poked along with their draggy DSP processors at 1 or 2 GMACs, you’d fly by with your fancy FPGA at speeds up to five hundred times faster.

Is this starting to sound familiar? Will piling on a passel of multipliers really make your algorithm go faster or, like the multi-engine concept car, will the reality of the silicon fail to deliver what the datasheet promised? Like most of the rhetorical questions we ask in this forum, the answer is no, yes, and maybe.

First, let’s tackle the GMACs issue. Zealous marketers at your favorite FPGA company counted the multipliers on their device, multiplied that by the theoretical maximum clock frequency and… voila! A gazillion GMACs go down on the datasheet as the official marketing specification. Your trillion-tap FIR filter can now be implemented in a few million FPGAs with world-class results. As far as the reality of the situation, however, you will almost never find an application that is able to fully utilize the theoretical performance implied by the GMACs metric. As my college differential equations professor would say, the Tomato Theorem applies. The Tomato Theorem for DSP performance on FPGAs states that, if you put all the possible DSP algorithms on a wall, and threw tomatoes at that wall all day, you’d never hit an algorithm that would fully and efficiently utilize a modern FPGA’s DSP resources.

Although DSP performance in most FPGAs will probably never live up to the expectations set by PowerPoint, there is plenty of reason to rejoice. If your DSP algorithm would overtax a typical DSP processor, perhaps requiring two or more to get the job done, FPGAs may save your day, giving you better performance, less power consumption, and a smaller footprint for a lower price than the DSP processor solution.

While many would claim that these advantages have been available with FPGAs all along, the real boost in benefits came with the advent of the hard-wired embedded multiplier. The back room discussion may have gone something like: “Hey Gordon, if somebody makes just a few soft multipliers, it uses almost our whole dang device, so what-say we throw down a few hand-optimized ones to save ourselves some real estate and kick up the performance while we’re at it.”

Sadly, however, the debut of these blocks was less than spectacular, as most synthesis tools didn’t catch on to the idea that “C <= A*B;” might be a good place to use a hard-wired multiplier. The carefully crafted hardware sat idly on the chip sipping leakage current while vast networks of LUTS strained under the load of shifting around partial products during unintended multiply operations.

Luckily, design tools mature quickly, and so do silicon architectures. Today, your favorite design flow will almost certainly make good use of the DSP resources no matter how ham-handed you are with your HDL code. Also, the DSP resources you’re going to be targeting are now several generations better than those early multipliers. At least three major FPGA vendors, Xilinx, Altera, and Lattice, now include DSP-specific hard cores on some of their FPGA lines. Let’s take a look at what’s hiding between the LUTs and see how these math modules stack up.

Xilinx offers DSP acceleration on just about all of their FPGA lines, from the low-cost Spartan-3 series through their newly announced Virtex-4. The 90nm Spartan-3 line includes from 4 to 32 18X18 multipliers, distributed and block RAM, and Shift Register Logic, all of which can be coordinated to pull off some impressive DSP performance at a minimal cost. Jumping up to Virtex-4, the big guns come out. Virtex-4 devices have from 32 to (ulp!) 512 “DSP48 Slices”. A DSP48 includes an 18X18 multiplier, a three-input 48-bit adder/subtracter, and provisions for just about every type of cascading, carrying, and propagating you can imagine. The Xilinx blocks are cleverly designed so that many common DSP algorithms can be implemented without using any of the normal FPGA fabric. Also, although not usually listed as a DSP feature, the MicroBlaze RISC processor fills in as a controller and can execute lower performance portions of your algorithm with a minimal hardware and power expenditure.

Over in the Altera camp, their money-miser Cyclone family missed the multiplier madness, but their soon-to-be-shipping 90nm Cyclone II line corrects that deficiency with DSP-specific hardware ranging from 13 18X18 multipliers on the smallest device to 150 on the EP2C70. Like Xilinx, their flagship Stratix II family ups the ante with a complex DSP block that includes multipliers, adders, subtracters, and an accumulator, making more efficient implementations of many DSP algorithms. Stratix II has from 12 DSP blocks on their EP2S15 to 96 on the EP2S180. Each DSP block has the equivalent of four 18X18 multipliers, so we’d have to use something like a 4:1 ratio comparing Stratix II DSP blocks with Xilinx’s DSP48 Slices.

Diagram courtesy of Lattice Semiconductor.

Lattice’s newly announced ECP-DSP low-cost DSP-biased FPGAs boast a DSP block very similar to Altera’s Stratix II, but in an economy-priced device. With from four to ten of these “sysDSP” blocks (yielding from 16 to 40 18X18 multipliers), the Lattice devices deliver an impressive DSP capability at a very low price point. Lattice’s DSP architecture allows the blocks to be configured in 9- 18- or 36-bit widths and can be implemented as multiply, multiply-accumulate, multiply-add, or multiply-add-sum. Lattice is likely to get some solid market traction with this device, as someone evidently forgot to tell them that only the high-end FPGAs are supposed to have sophisticated MAC-based DSP hardware.

If you’re looking for structured ASIC options, there are fewer choices at this point with dedicated DSP hardware. This is partially because the ASIC-like technologies reduce the benefit gained from dedicated DSP modules, and partially because DSP is not a primary target application for most structured ASIC lines. One exception to this rule is Leopard Logic who offers 64 embedded 18X18 MACs in their Gladiator CLD6400 “Configurable Logic Device”.

Multipliers and MACs aren’t the only goodies you’ll need to get the most DSP performance out of your FPGA. It’s also important to look at quantity and types of memory, efficiency in connecting the DSP blocks together, soft-processor cores for command and control of the DSP algorithm and sequencing of the DSP-specific hardware, and I/O options for getting all that signal data on and off chip quickly.

Features in this space are evolving rapidly, so stay tuned. It has been only a couple of years since the first hard multipliers tested the water, and already we’re knee-deep in MAC mania at almost every FPGA vendor. Watch for more and higher-performance DSP-specific hardware migrating ever lower in the product line as vendors continue to redefine our expectations at each price level. Also, as we discussed in our DSP tools article, look for this hardware to become increasingly convenient to use. Are they as much fun as the Hot Wheels car with the four supercharged hemi engines? Probably not, but they are considerably more practical, as applications like digital video, image processing and software-defined radio increase our demands on signal processing performance.

Leave a Reply

featured blogs
Sep 21, 2023
Wireless communication in workplace wearables protects and boosts the occupational safety and productivity of industrial workers and front-line teams....
Sep 26, 2023
5G coverage from space has the potential to make connectivity to the Internet truly ubiquitous for a broad range of use cases....
Sep 26, 2023
Explore the LPDDR5X specification and learn how to leverage speed and efficiency improvements over LPDDR5 for ADAS, smartphones, AI accelerators, and beyond.The post How LPDDR5X Delivers the Speed Your Designs Need appeared first on Chip Design....
Sep 26, 2023
The eighth edition of the Women in CFD series features Mary Alarcon Herrera , a product engineer for the Cadence Computational Fluid Dynamics (CFD) team. Mary's unwavering passion and dedication toward a career in CFD has been instrumental in her success and has led her ...
Sep 21, 2023
Not knowing all the stuff I don't know didn't come easy. I've had to read a lot of books to get where I am....

Featured Video

Chiplet Architecture Accelerates Delivery of Industry-Leading Intel® FPGA Features and Capabilities

Sponsored by Intel

With each generation, packing millions of transistors onto shrinking dies gets more challenging. But we are continuing to change the game with advanced, targeted FPGAs for your needs. In this video, you’ll discover how Intel®’s chiplet-based approach to FPGAs delivers the latest capabilities faster than ever. Find out how we deliver on the promise of Moore’s law and push the boundaries with future innovations such as pathfinding options for chip-to-chip optical communication, exploring new ways to deliver better AI, and adopting UCIe standards in our next-generation FPGAs.

To learn more about chiplet architecture in Intel FPGA devices visit https://intel.ly/45B65Ij

featured paper

Accelerating Monte Carlo Simulations for Faster Statistical Variation Analysis, Debugging, and Signoff of Circuit Functionality

Sponsored by Cadence Design Systems

Predicting the probability of failed ICs has become difficult with aggressive process scaling and large-volume manufacturing. Learn how key EDA simulator technologies and methodologies enable fast (minimum number of simulations) and accurate high-sigma analysis.

Click to read more

featured chalk talk

ADI's ISOverse
In order to move forward with innovations on the intelligent edge, we need to take a close look at isolation and how it can help foster the adoption of high voltage charging solutions and reliable and robust high speed communication. In this episode of Chalk Talk, Amelia Dalton is joined by Allison Lemus, Maurizio Granato, and Karthi Gopalan from Analog Devices and they examine benefits that isolation brings to intelligent edge applications including smart building control, the enablement of Industry 4.0, and more. They also examine how Analog Devices iCoupler® digital isolation technology can encourage innovation big and small!  
Mar 14, 2023