feature article
Subscribe Now

Mad MACs

Who’s Got the Best DSP Accelerators?

Maybe you saw them in science fiction movies when you were a kid. Possibly your Hot Wheels toy car collection contained a few as well. You may have even drawn your own on your school book covers. They looked like normal race cars, but a single engine just would not do. Sometimes four, six, or eight huge power plants graced the foredeck, each with eight big straight pipes sticking out the sides like stocky legs on some steel-bodied spider, complete with a Roots blower abdomen and air scoop head. You were never able to understand quite how all those engines might work together to make the car go fast, but the message of plenteous performance was clearly conveyed by the visual impact of this plethora of pipes and plenums.

A few years later, you were in engineering school and the magic spell was broken. You learned that the complexities of coordinating so many motors for the task of moving a single vehicle caused too many problems, and the promised performance of parallel power plants fell forever into the chasm of misguided engineering fantasies.

Today, however, the process repeats itself. We all know that the performance bottleneck for most digital signal processing (DSP) algorithms is the multiply operation. The most recognized benchmark of DSP performance is, in fact, a measurement of the number of multiplies and accumulates per second such as MMACs (Million multiply accumulates per second) and GMACs (Giga Multiply Accumulates per second). DSP processors sometimes have several multiply accumulate (MAC) units attached to a high performance processor that runs the overall algorithm and sequences the expensive multiply operations out to the available MACs.

How cool, then, if instead of a paltry four MAC units, your chip could have as many as four or five hundred fancy number-crunchers under the hood, ready to calculate at your command. You’d be the envy of every DSP kid on the block. While they poked along with their draggy DSP processors at 1 or 2 GMACs, you’d fly by with your fancy FPGA at speeds up to five hundred times faster.

Is this starting to sound familiar? Will piling on a passel of multipliers really make your algorithm go faster or, like the multi-engine concept car, will the reality of the silicon fail to deliver what the datasheet promised? Like most of the rhetorical questions we ask in this forum, the answer is no, yes, and maybe.

First, let’s tackle the GMACs issue. Zealous marketers at your favorite FPGA company counted the multipliers on their device, multiplied that by the theoretical maximum clock frequency and… voila! A gazillion GMACs go down on the datasheet as the official marketing specification. Your trillion-tap FIR filter can now be implemented in a few million FPGAs with world-class results. As far as the reality of the situation, however, you will almost never find an application that is able to fully utilize the theoretical performance implied by the GMACs metric. As my college differential equations professor would say, the Tomato Theorem applies. The Tomato Theorem for DSP performance on FPGAs states that, if you put all the possible DSP algorithms on a wall, and threw tomatoes at that wall all day, you’d never hit an algorithm that would fully and efficiently utilize a modern FPGA’s DSP resources.

Although DSP performance in most FPGAs will probably never live up to the expectations set by PowerPoint, there is plenty of reason to rejoice. If your DSP algorithm would overtax a typical DSP processor, perhaps requiring two or more to get the job done, FPGAs may save your day, giving you better performance, less power consumption, and a smaller footprint for a lower price than the DSP processor solution.

While many would claim that these advantages have been available with FPGAs all along, the real boost in benefits came with the advent of the hard-wired embedded multiplier. The back room discussion may have gone something like: “Hey Gordon, if somebody makes just a few soft multipliers, it uses almost our whole dang device, so what-say we throw down a few hand-optimized ones to save ourselves some real estate and kick up the performance while we’re at it.”

Sadly, however, the debut of these blocks was less than spectacular, as most synthesis tools didn’t catch on to the idea that “C <= A*B;” might be a good place to use a hard-wired multiplier. The carefully crafted hardware sat idly on the chip sipping leakage current while vast networks of LUTS strained under the load of shifting around partial products during unintended multiply operations.

Luckily, design tools mature quickly, and so do silicon architectures. Today, your favorite design flow will almost certainly make good use of the DSP resources no matter how ham-handed you are with your HDL code. Also, the DSP resources you’re going to be targeting are now several generations better than those early multipliers. At least three major FPGA vendors, Xilinx, Altera, and Lattice, now include DSP-specific hard cores on some of their FPGA lines. Let’s take a look at what’s hiding between the LUTs and see how these math modules stack up.

Xilinx offers DSP acceleration on just about all of their FPGA lines, from the low-cost Spartan-3 series through their newly announced Virtex-4. The 90nm Spartan-3 line includes from 4 to 32 18X18 multipliers, distributed and block RAM, and Shift Register Logic, all of which can be coordinated to pull off some impressive DSP performance at a minimal cost. Jumping up to Virtex-4, the big guns come out. Virtex-4 devices have from 32 to (ulp!) 512 “DSP48 Slices”. A DSP48 includes an 18X18 multiplier, a three-input 48-bit adder/subtracter, and provisions for just about every type of cascading, carrying, and propagating you can imagine. The Xilinx blocks are cleverly designed so that many common DSP algorithms can be implemented without using any of the normal FPGA fabric. Also, although not usually listed as a DSP feature, the MicroBlaze RISC processor fills in as a controller and can execute lower performance portions of your algorithm with a minimal hardware and power expenditure.

Over in the Altera camp, their money-miser Cyclone family missed the multiplier madness, but their soon-to-be-shipping 90nm Cyclone II line corrects that deficiency with DSP-specific hardware ranging from 13 18X18 multipliers on the smallest device to 150 on the EP2C70. Like Xilinx, their flagship Stratix II family ups the ante with a complex DSP block that includes multipliers, adders, subtracters, and an accumulator, making more efficient implementations of many DSP algorithms. Stratix II has from 12 DSP blocks on their EP2S15 to 96 on the EP2S180. Each DSP block has the equivalent of four 18X18 multipliers, so we’d have to use something like a 4:1 ratio comparing Stratix II DSP blocks with Xilinx’s DSP48 Slices.

Diagram courtesy of Lattice Semiconductor.

Lattice’s newly announced ECP-DSP low-cost DSP-biased FPGAs boast a DSP block very similar to Altera’s Stratix II, but in an economy-priced device. With from four to ten of these “sysDSP” blocks (yielding from 16 to 40 18X18 multipliers), the Lattice devices deliver an impressive DSP capability at a very low price point. Lattice’s DSP architecture allows the blocks to be configured in 9- 18- or 36-bit widths and can be implemented as multiply, multiply-accumulate, multiply-add, or multiply-add-sum. Lattice is likely to get some solid market traction with this device, as someone evidently forgot to tell them that only the high-end FPGAs are supposed to have sophisticated MAC-based DSP hardware.

If you’re looking for structured ASIC options, there are fewer choices at this point with dedicated DSP hardware. This is partially because the ASIC-like technologies reduce the benefit gained from dedicated DSP modules, and partially because DSP is not a primary target application for most structured ASIC lines. One exception to this rule is Leopard Logic who offers 64 embedded 18X18 MACs in their Gladiator CLD6400 “Configurable Logic Device”.

Multipliers and MACs aren’t the only goodies you’ll need to get the most DSP performance out of your FPGA. It’s also important to look at quantity and types of memory, efficiency in connecting the DSP blocks together, soft-processor cores for command and control of the DSP algorithm and sequencing of the DSP-specific hardware, and I/O options for getting all that signal data on and off chip quickly.

Features in this space are evolving rapidly, so stay tuned. It has been only a couple of years since the first hard multipliers tested the water, and already we’re knee-deep in MAC mania at almost every FPGA vendor. Watch for more and higher-performance DSP-specific hardware migrating ever lower in the product line as vendors continue to redefine our expectations at each price level. Also, as we discussed in our DSP tools article, look for this hardware to become increasingly convenient to use. Are they as much fun as the Hot Wheels car with the four supercharged hemi engines? Probably not, but they are considerably more practical, as applications like digital video, image processing and software-defined radio increase our demands on signal processing performance.

Leave a Reply

featured blogs
Nov 24, 2020
In our last Knowledge Booster Blog , we introduced you to some tips and tricks for the optimal use of the Virtuoso ADE Product Suite . W e are now happy to present you with some further news from our... [[ Click on the title to access the full blog on the Cadence Community s...
Nov 23, 2020
It'€™s been a long time since I performed Karnaugh map minimizations by hand. As a result, on my first pass, I missed a couple of obvious optimizations....
Nov 23, 2020
Readers of the Samtec blog know we are always talking about next-gen speed. Current channels rates are running at 56 Gbps PAM4. However, system designers are starting to look at 112 Gbps PAM4 data rates. Intuition would say that bleeding edge data rates like 112 Gbps PAM4 onl...
Nov 20, 2020
[From the last episode: We looked at neuromorphic machine learning, which is intended to act more like the brain does.] Our last topic to cover on learning (ML) is about training. We talked about supervised learning, which means we'€™re training a model based on a bunch of ...

featured video

Introduction to the fundamental technologies of power density

Sponsored by Texas Instruments

The need for power density is clear, but what are the critical components that enable higher power density? In this overview video, we will provide a deeper understanding of the fundamental principles of high-power-density designs, and demonstrate how partnering with TI, and our advanced technological capabilities can help improve your efforts to achieve those high-power-density figures.

featured paper

Exploring advancements in industrial and automotive markets with 60-GHz radar

Sponsored by Texas Instruments

The industrial and automotive markets have a tremendous need for innovative sensing technologies to help buildings, cities and automobiles sense the world around them and make more intelligent decisions.

Click here to read the article

Featured Chalk Talk

Mom, I Have a Digital Twin? Now You Tell Me?

Sponsored by Cadence Design Systems

Today, one engineer’s “system” is another engineer’s “component.” The complexity of system-level design has skyrocketed with the new wave of intelligent systems. In this world, optimizing electronic system designs requires digital twins, shifting left, virtual platforms, and emulation to sort everything out. In this episode of Chalk Talk, Amelia Dalton chats with Frank Schirrmeister of Cadence Design Systems about system-level optimization.

Click here for more information