GMACs GAP

Multiplication is our big problem. We need to multiply. We need to multiply integers. We need to multiply fixed-point numbers. We need to multiply floating point numbers. We need to multiply complex numbers. But, we also need to multiply other things. In designing embedded systems, we constantly need to multiply our productivity.

Over the past 40 years, Moore’s Law has given us exponential improvement on the three “Ps”: Power, Performance, and Price. Transistor counts have reached the billions, frequencies have raced into the Gigahertz, and the power consumed for each bit of data we process has dropped exponentially as well. All of these things have been multiplied for us by the semiconductor industry. As embedded systems designers, we just coast along for the ride.

Unfortunately, this creates a problem for us as electronic system designers. We need a Moore’s Law for productivity as well. With every new process node, we face the challenge of designing more capability into our products on basically the same schedule. In order for our productivity to keep up with demands, we need to multiply ourselves as well.

The way that digital designers have always multiplied productivity is to raise the level of design abstraction. We’ve gone from designing with individual transistors to stitching together vast networks of logic gates to building register-transfer-level descriptions with larger blocks (like multipliers), and finally on to assembling really large IP blocks like processors, memories, peripherals, and interconnect structures.

Using this kind of abstraction, we are able to keep up with Moore’s Law for most classes of problems. We can buy a pre-fabricated COTS board with everything we need, and just add software. We can throw down a few discrete devices on a board and add some code for a more custom solution. Using today’s programmable logic devices, we can even build a whole system on an FPGA without leaving our desk. With a few mouse clicks, we can create a complete system-on-chip in minutes that comprises millions of gates of logic, including processor (or processors), memory, and peripherals. We can probably download even a lot of the software components we need – operating systems, middleware like I/O stacks, even application frameworks, GUI builders, and databases.

Everything is all on-track… until we hit that little asterisk that means multiplication.

Why is multiplication the big problem? Because it falls into an inconvenient place on the performance/productivity line. For many classes of algorithms, the raw processing performance measurement is “GMACs” – Giga Multiply Accumulates per Second Multiplication is the most expensive operation in most of these algorithms. If you measure the time spent in multiplication, or the power consumed by multiplication operations, or the chip area consumed by hardware dedicated to multiplication, it is much larger than the other elements. We tend to get algorithms that look something like this:

There is a gap in the computing performance spectrum for this class of design. For algorithms that require a small number of GMACs, conventional software development techniques can be used. If we need a little more, we can rely on very fast processors or multi-core to get the job done. In those cases, all that complex IP we talked about earlier makes us productive enough to keep up with Moore’s law. When we get a beast like this that has to produce something approaching triple-digit GMACs, however, everything pivots around multiplication. Our productivity accelerators grind to a halt.

When we need GMACs in the upper ranges, we have to jump over the gap from software solutions to hardware acceleration. Traditionally, this means that we have much longer design cycles and much more challenging design problems. Our engineers now need hardware expertise – they have to create microarchitectures complete with datapath and control, often with complex constructs like pipelines and resource sharing, and implement them in register-transfer-level descriptions in languages like VHDL or Verilog. The whole solution has to be hosted on something like an ASIC or FPGA.

Traditionally, projects that have faced this problem have fallen back on tried-and-true techniques of digital design – employing experts at RTL to create a hardware solution that parallelizes the algorithm sufficiently to break the GMACs barrier. Usually, something from dozens to hundreds of hardware multipliers or multiply-accumulate units are laid down on a chip, and the design team sets about creating a datapath with the correct amount of parallelism, pipelining, resource sharing, and chaining to get the computation done in the required time, with the required power consumption, and at the required cost. Given the correct hardware resources, a skilled designer can work miracles balancing the level of parallelism and latency to the design requirements. The performance and efficiency of traditional von Neumann architectures can be exceeded by several orders of magnitude.

The big penalty here is the loss of productivity. Creating a custom hardware accelerator can require literally ten to a hundred times the engineering effort of a software implementation. In order to keep our productivity leverage, we need techniques and tools that will allow us to span the hardware/software performance gap without incurring the software/hardware productivity penalty. This is exactly what a class of so-called “ESL” tools is designed to do.

Using high-level synthesis tools (also historically called behavioral synthesis, hardware compilers, ESL synthesis, algorithmic synthesis, architectural exploration tools, and a variety of other names) we take an algorithm and try to automate the generation of the detailed hardware architecture. While this sounds simple, it’s not as easy as running some C code through a special compiler that pumps out optimized hardware. If it were, a lot of hardware engineers could head right on home and get their real-estate licenses, because their services would be no longer required.

These tools take a variety of approaches to the algorithm acceleration problem. Some ask you to code your algorithm in a parallel language or in a conventional language with different semantics or additions that allow you to specify parallelism explicitly. Others allow you to code in conventional languages like C or C++, but have extensions that allow you to guide the tool in making hardware architecture decisions. Still others allow you to use straight software-like sequential descriptions of the algorithm and then control the architectural decisions from within a GUI.

The reasons for all this hedging and add-on stuff are simple. There is no “optimal” solution for any given algorithm. The best hardware architecture depends on a large number of design goal variables like available chip area, latency required, throughput desired, power consumption allowed, and so forth. If, for example, you’ve got a nested loop that passes through multiplication two thousand times, you probably don’t want a fully-parallel implementation that requires two thousand separate hardware multipliers on your chip – although that might be the optimal solution for maximum throughput and minimum latency. On the other hand, a single multiplier shared between all those multiplications would likely give a very cost-effective solution at the expense of latency and throughput. Between those two solutions are an enormous number of potential architectures in the tradeoff space between chip area, latency, throughput, and power consumption.

Using one of these tools, you typically identify the area of your application that is the performance bottleneck. From that point, you code that portion according to whatever is acceptable to your ESL tool. Typically, this will be anything from a few lines to a few hundred lines of code. Most ESL tools take that code (and the directives you’ve embedded) and create a synthesizable hardware block that you drop into your hardware design. The better tools let you explore the various architectural options and fine-tune the output to meet your particular design goals. Don’t let the small size of the loop code fool you. A tight little nested “for” loop can generate thousands to tens of thousands of lines of HDL and can easily build a hardware design with millions of gates.

At this point, you have a microarchitecture (datapath plus control) that accelerates your algorithm as you want, but you still have a serious issue to deal with – how to keep that hardware busy and supplied with data. Often, the bottleneck shifts from computation to keeping the pipeline full. If your accelerator is only supplementing a larger application running on a conventional processor, you have a plethora of options on how to connect the two. Some processor architectures support the addition of custom instructions. Other approaches include asynchronous communication between processor and accelerator for control and shared memory access to pass data and results back and forth. There are almost as many approaches as there are design requirements.

The use of ESL tools has been increasing dramatically in recent years. A wide range of well-funded engineering teams in a variety of application areas have seen dramatic productivity improvements (claims go as high as 90% to 95% reduction in design time) for DSP algorithms moved into hardware when compared with conventional methods. The initial design time is only part of the story, however. Because your algorithm stays in a compact, architecture-free format, it’s easy to re-run the ESL tool and choose a different implementation if your design needs change in a future version. This kind of flexibility continues to pay dividends way down the road as your product line evolves.

ESL technology is far from finished. The tools on the market today continue to evolve, and the technology behind extracting parallel implementations from sequential algorithms is incredibly complex. It is likely that these tools will make substantial improvements in the coming years in quality of results, ease of use, and flexibility in the types of inputs they can take. If your embedded design has some component that’s pushing the envelope on performance, or if power consumption is out of control because you’re using a much faster processor than most of your application requires (except that one pesky routine…), you may want to investigate ESL synthesis.