Mad MACs

Maybe you saw them in science fiction movies when you were a kid. Possibly your Hot Wheels toy car collection contained a few as well. You may have even drawn your own on your school book covers. They looked like normal race cars, but a single engine just would not do. Sometimes four, six, or eight huge power plants graced the foredeck, each with eight big straight pipes sticking out the sides like stocky legs on some steel-bodied spider, complete with a Roots blower abdomen and air scoop head. You were never able to understand quite how all those engines might work together to make the car go fast, but the message of plenteous performance was clearly conveyed by the visual impact of this plethora of pipes and plenums.

A few years later, you were in engineering school and the magic spell was broken. You learned that the complexities of coordinating so many motors for the task of moving a single vehicle caused too many problems, and the promised performance of parallel power plants fell forever into the chasm of misguided engineering fantasies.

Today, however, the process repeats itself. We all know that the performance bottleneck for most digital signal processing (DSP) algorithms is the multiply operation. The most recognized benchmark of DSP performance is, in fact, a measurement of the number of multiplies and accumulates per second such as MMACs (Million multiply accumulates per second) and GMACs (Giga Multiply Accumulates per second). DSP processors sometimes have several multiply accumulate (MAC) units attached to a high performance processor that runs the overall algorithm and sequences the expensive multiply operations out to the available MACs.

How cool, then, if instead of a paltry four MAC units, your chip could have as many as four or five hundred fancy number-crunchers under the hood, ready to calculate at your command. You’d be the envy of every DSP kid on the block. While they poked along with their draggy DSP processors at 1 or 2 GMACs, you’d fly by with your fancy FPGA at speeds up to five hundred times faster.

Is this starting to sound familiar? Will piling on a passel of multipliers really make your algorithm go faster or, like the multi-engine concept car, will the reality of the silicon fail to deliver what the datasheet promised? Like most of the rhetorical questions we ask in this forum, the answer is no, yes, and maybe.

First, let’s tackle the GMACs issue. Zealous marketers at your favorite FPGA company counted the multipliers on their device, multiplied that by the theoretical maximum clock frequency and… voila! A gazillion GMACs go down on the datasheet as the official marketing specification. Your trillion-tap FIR filter can now be implemented in a few million FPGAs with world-class results. As far as the reality of the situation, however, you will almost never find an application that is able to fully utilize the theoretical performance implied by the GMACs metric. As my college differential equations professor would say, the Tomato Theorem applies. The Tomato Theorem for DSP performance on FPGAs states that, if you put all the possible DSP algorithms on a wall, and threw tomatoes at that wall all day, you’d never hit an algorithm that would fully and efficiently utilize a modern FPGA’s DSP resources.

Although DSP performance in most FPGAs will probably never live up to the expectations set by PowerPoint, there is plenty of reason to rejoice. If your DSP algorithm would overtax a typical DSP processor, perhaps requiring two or more to get the job done, FPGAs may save your day, giving you better performance, less power consumption, and a smaller footprint for a lower price than the DSP processor solution.

While many would claim that these advantages have been available with FPGAs all along, the real boost in benefits came with the advent of the hard-wired embedded multiplier. The back room discussion may have gone something like: “Hey Gordon, if somebody makes just a few soft multipliers, it uses almost our whole dang device, so what-say we throw down a few hand-optimized ones to save ourselves some real estate and kick up the performance while we’re at it.”

Sadly, however, the debut of these blocks was less than spectacular, as most synthesis tools didn’t catch on to the idea that “C <= A*B;” might be a good place to use a hard-wired multiplier. The carefully crafted hardware sat idly on the chip sipping leakage current while vast networks of LUTS strained under the load of shifting around partial products during unintended multiply operations.

Luckily, design tools mature quickly, and so do silicon architectures. Today, your favorite design flow will almost certainly make good use of the DSP resources no matter how ham-handed you are with your HDL code. Also, the DSP resources you’re going to be targeting are now several generations better than those early multipliers. At least three major FPGA vendors, Xilinx, Altera, and Lattice, now include DSP-specific hard cores on some of their FPGA lines. Let’s take a look at what’s hiding between the LUTs and see how these math modules stack up.

Xilinx offers DSP acceleration on just about all of their FPGA lines, from the low-cost Spartan-3 series through their newly announced Virtex-4. The 90nm Spartan-3 line includes from 4 to 32 18X18 multipliers, distributed and block RAM, and Shift Register Logic, all of which can be coordinated to pull off some impressive DSP performance at a minimal cost. Jumping up to Virtex-4, the big guns come out. Virtex-4 devices have from 32 to (ulp!) 512 “DSP48 Slices”. A DSP48 includes an 18X18 multiplier, a three-input 48-bit adder/subtracter, and provisions for just about every type of cascading, carrying, and propagating you can imagine. The Xilinx blocks are cleverly designed so that many common DSP algorithms can be implemented without using any of the normal FPGA fabric. Also, although not usually listed as a DSP feature, the MicroBlaze RISC processor fills in as a controller and can execute lower performance portions of your algorithm with a minimal hardware and power expenditure.

Over in the Altera camp, their money-miser Cyclone family missed the multiplier madness, but their soon-to-be-shipping 90nm Cyclone II line corrects that deficiency with DSP-specific hardware ranging from 13 18X18 multipliers on the smallest device to 150 on the EP2C70. Like Xilinx, their flagship Stratix II family ups the ante with a complex DSP block that includes multipliers, adders, subtracters, and an accumulator, making more efficient implementations of many DSP algorithms. Stratix II has from 12 DSP blocks on their EP2S15 to 96 on the EP2S180. Each DSP block has the equivalent of four 18X18 multipliers, so we’d have to use something like a 4:1 ratio comparing Stratix II DSP blocks with Xilinx’s DSP48 Slices.

Diagram courtesy of Lattice Semiconductor.

Lattice’s newly announced ECP-DSP low-cost DSP-biased FPGAs boast a DSP block very similar to Altera’s Stratix II, but in an economy-priced device. With from four to ten of these “sysDSP” blocks (yielding from 16 to 40 18X18 multipliers), the Lattice devices deliver an impressive DSP capability at a very low price point. Lattice’s DSP architecture allows the blocks to be configured in 9- 18- or 36-bit widths and can be implemented as multiply, multiply-accumulate, multiply-add, or multiply-add-sum. Lattice is likely to get some solid market traction with this device, as someone evidently forgot to tell them that only the high-end FPGAs are supposed to have sophisticated MAC-based DSP hardware.

If you’re looking for structured ASIC options, there are fewer choices at this point with dedicated DSP hardware. This is partially because the ASIC-like technologies reduce the benefit gained from dedicated DSP modules, and partially because DSP is not a primary target application for most structured ASIC lines. One exception to this rule is Leopard Logic who offers 64 embedded 18X18 MACs in their Gladiator CLD6400 “Configurable Logic Device”.

Multipliers and MACs aren’t the only goodies you’ll need to get the most DSP performance out of your FPGA. It’s also important to look at quantity and types of memory, efficiency in connecting the DSP blocks together, soft-processor cores for command and control of the DSP algorithm and sequencing of the DSP-specific hardware, and I/O options for getting all that signal data on and off chip quickly.

Features in this space are evolving rapidly, so stay tuned. It has been only a couple of years since the first hard multipliers tested the water, and already we’re knee-deep in MAC mania at almost every FPGA vendor. Watch for more and higher-performance DSP-specific hardware migrating ever lower in the product line as vendors continue to redefine our expectations at each price level. Also, as we discussed in our DSP tools article, look for this hardware to become increasingly convenient to use. Are they as much fun as the Hot Wheels car with the four supercharged hemi engines? Probably not, but they are considerably more practical, as applications like digital video, image processing and software-defined radio increase our demands on signal processing performance.