Domesticating DSP

In the good old days, (those would be the EARLY 2000s) digital signal processing (DSP) was a well-behaved wild animal. It stayed outside in the pasture, grazed off the land, and never harmed the house pets. DSP didn’t disturb the neighbors and didn’t bite unless provoked. If we had a big, complex system, we often hired a specialist, a kind-of DSP whisperer, to handle the care and feeding of our little DSP. He knew all sorts of tricks and techniques for training and taming the little fellows. He spoke MATLAB. He was fluent in DSP processor assembly. He was one with the s-plane.

Lately, though, we’ve needed DSP more often in our day-to-day design lives. We’ve been inviting DSP into the yard, and even into the house occasionally. Increased integration, tighter power budgets, greater cost consciousness, and more performance-hungry algorithms have rallied us to rethink and de-segregate the DSP-like functions in many of our system designs. The tighter relationship between core applications and massive streams of data in applications such as video and wireless have caused us to call into question the practice of throwing down a special-purpose processor and phoning in the DSP guy. We have to learn to handle DSP ourselves.

If you’re a digital designer, you may have forked away from the path of the continuous-looking mathematical function at about the same time that Laplace transforms kicked in. Your RTL would never betray you like that. Logic design was domesticated, domain and range both under control, discretion assured – nothing imaginary going on. You could work away with your Karnaugh maps, truth tables, and bubble diagrams safe from those scary squiggly lines of frequency response and protected from your polynomial paranoia.

Now, however, DSP’s cornucopia of coefficients is invading your well-organized binary abode, threatening to thwart your processor with mounds of mathematical chores and paralyze your peripheral bus with a deluge of data. You have to be ready. You need to understand your options, to own a roadmap, to have a strategy. Unfortunately, as fast as you figure out what’s going on, the rules of the game change. Processors get faster. FPGAs get cheaper. Video resolution increases. Power budgets drop. It’s difficult to nail anything down for a starting point. That’s why we’re here today.

If you want to learn the language of DSP, you should probably start by introducing yourself to The MathWorks. As unofficial keepers of all things related to automated mathematical analysis, they are also widely regarded as the go-to company for the first stage of any DSP project. Their franchise application, MATLAB, is to DSP what Excel is to the accountant, what Photoshop is to the photographer, what Word is to the technology journalist. That’s right, it’s the program whose name you are screaming the moment before your expensive 24″ flat screen monitor flies out through your fourth story window. OK, just kidding about that part. The point is, MATLAB is where, more than likely, all the dirty work of your DSP project begins.

Not content to simply help you with the math homework portion of your project, The MathWorks created a product called Simulink that allows graphical assembly of your algorithm from pre-constructed models. By plugging together various modules like filters, FFTs, or some of their newer specialized blocks for tasks like video processing and image analysis, you can create an accurate simulation of your math-intensive algorithm that can then be converted to software or hardware by the method of your choice.

If you’re planning to take your algorithm to something that works better in fixed point than floating point, (like hardware for example) Simulink is also designed to let you experiment with that conversion and see the effects on various aspects of system-level performance – bit error rate in communications systems, non-linearities or other distortions in conventional signal processing, other types of fidelity loss depending on the nature of your application. Simulink allows you to play what-if and iterate quickly to see the results of various bit width optimizations.

Once your algorithm is developed and refined, there are several strategies with differing levels of processing power available depending on your application’s number-crunching needs. At the simplest level, you can just load software routines onto your primary processor. If you’re using a multi-tasking RTOS, you have processor bandwidth to spare, and your signal processing needs are very modest, you can probably handle the whole problem with a nice, tightly-coded software routine. This approach gives you the most flexibility (software) the least additional cost (zero) and is the easiest to implement assuming you have the cycles available from your processing engine.

Of course, if you could get by with lame advice like that, you wouldn’t really need articles like this, would you? Most of the applications that are actually interesting require more processing power than your processors have left over. This is where the choices start to get interesting. In recent years, the difference between specialized processors for DSP and conventional processors has grown increasingly narrow. Companies like ARM have encroached on that gap even more by adding DSP-specific options to their conventional processor cores. For example, in 2004 ARM announced what it calls NEON technology – a 64/128-bit SIMD (single instruction multiple data) instruction set designed to provide acceleration for data-intensive signal and video processing applications. Available in their new Cortex-A8 processor, the NEON engine gives a significant boost to DSP-like applications.

If an extended conventional processor won’t do your trick, you may be in the increasingly narrow band where a traditional DSP processor offers the best mix of capability, cost, and power consumption. Conventional DSP is being squeezed from both sides these days, however, with increased performance capabilities like NEON boosting conventional processors up into the low end of DSP’s traditional turf, and lower cost points with easier design flows bringing hardware-accelerated solutions like FPGAs down into DSP’s high end.

Because the traditional DSP is the road most traveled, we’ll gloss over it for now. There’s more than enough literature on the market documenting your path from algorithm to working system using these highly-capable devices. However, when you start to stress your standalone DSP, it may be time to start thinking about accelerating your algorithm in hardware. That’s where our discussion becomes particularly exciting.

For several years now, FPGA companies have been boosting the capability of their devices by adding dedicated math hardware aimed at accelerating DSP applications. Because the tall pole in the math tent is traditionally multiplication or more often yet, multiplication feeding an accumulator, FPGA companies began dropping droves of hardware multipliers and/or multiply-accumulate (MAC) units into their programmable platforms. By allowing these math functions to be heavily parallelized, functions like FIR filters could be accelerated to dozens of times the speeds achievable with a pure software approach using standard or even special-purpose DSP processors.

Altera and Xilinx threw the first punches in this battle to woo the DSP guy onto the chip. Each company added math hardware to their high-end FPGAs. Then, in the spirit of friendly competition for which they’re both well known, they each began boosting the capability of that hardware, adding more of it, and marketing the heck out of the idea that theirs was the biggest, baddest, fastest, meanest DSP platform on the block. Not content with battling just for the high end, they both carried the battle even into their value-based low-cost offerings, now fighting for signal-processing supremacy with lines like Spartan-3 and Cyclone II in addition to chips starting with Virtex- or Stratix-. Not to be left out of a fast-paced party, Lattice Semiconductor rolled out their new 90nm lines with the DSP hardware all loaded into the low-cost LatticeECP2. Now, if you’re looking for low-cost DSP power in an FPGA, you’ve got at least three vendors to choose from.

This super-duper parallelized datapath brings more than a few disadvantages with it, however. First, being structured as hardware instead of software, it requires a completely different design approach. Floating point functions typically have to be replaced with fixed point functions, and an understanding of hardware microarchitectures combined with a good helping of patience is required to get your design up and running.

Luckily, tool and IP vendors’ eyes light up with dollar signs any time they spot a new engineering challenge. Numerous companies have flocked to the field of competition, and you can now whip up a DSP design for your FPGA hardware using a variety of mechanisms, from model-based graphical assembly in Simulink and similar environments to behavioral-level programming in C, MATLAB’s M language, or in various flavors of hardware description languages.

Each FPGA vendor partners with third parties such as Synplicity, Mentor Graphics, Celoxica, and The Mathworks for tools to facilitate the fabrication of hardware to scorch through your signal processing problem. Synplicity’s Synplify DSP starts with a Simulink specification (created using Synplicity’s library of models) and performs high-level optimizations resulting in optimized FPGA hardware. Mentor Graphics Catapult-C starts with algorithmic ANSI-standard C and creates synthesizable RTL code that can be taken into FPGA design at the synthesis step, Celoxica’s tools start with Handel-C or System-C and synthesize down to FPGA hardware as well.

Each FPGA vendor also has their own tools to assist you in the signal and video processing process. Xilinx, Altera, and Lattice all offer model-based design flows, and both Xilinx and Altera have recently announced high-level synthesis approaches as well – Xilinx via their acquisition of AccelChip, and Altera through their newly announced C2H compiler to complement their Nios-II soft-processor environment.

Remember that there’s more to the hardware acceleration picture than simple parallelism, however. Before you get hypnotized by hype about the gazillions of gigaflops you can gain by parallelizing your process across millions of multipliers, keep in mind that you also have to design I/O hardware capable of getting data into and out of that newly created number cruncher. If you’re creating a simple hardware accelerator for a conventional processor, you can quickly hit a bottleneck with the data going through the processor bus to get to the accelerating hardware. If you’re moving a lot of data, you’ll want to opt for some kind of direct FIFO or other shared memory structure instead of spoon feeding the DSP function from the main controller. Be sure to balance the performance you design into your custom accelerator with the bandwidth of the connections you’re using to feed it, or you’ll spend a lot on hardware that’s idle most of the time.

Also remember, this technology is changing fast. The approach that gave the best results a year ago is almost certainly not what will serve you best today, and today’s approach will probably begin to be obsolete before the next round of articles is published. Just in the past couple of years the tools for DSP have improved dramatically in performance, ease of use, and quality of results. Conventional processors have stepped up their DSP capability considerably, and the cost and power consumption of DSP-capable FPGA hardware have dropped dramatically. With the underlying variables changing this fast, the coefficients of your equation for signal processing success must also be regularly re-evaluated. Like the wild animal that is being domesticated for the first time, DSPs require you to be cautious. This breed can turn on you.