Most of us engineers are at least closet hot-rodders. It’s in our DNA. No matter how good a contraption is from the factory, we just can’t resist the temptation to tweak a few things in our own special way, and often that’s all about speed.
FPGA design, it turns out, is a big ‘ol blank canvas for hot-rodding. Even though we (fortunately) don’t have glossy convenience-store magazines adorned with scantily-clad models standing next to the latest tricked-out dev boards, FPGAs have all the tools we need to rev our creative motors in the never-ending quest for that extra little bit of personalized performance.
But, where do we start? Do FPGAs have a set of go-to hop-ups? Is there a “chopping and channeling” baseline for programmable logic design?
It turns out the answer is “yes.” And, just to get you started, here are five tips for turning up the boost on your next project:
Architecture is Everything – The most important part of designing for maximum performance is the architecture you choose. While implementation optimization tricks may be able to give you ten or twenty percent here or there, choosing the right architecture for your logic design can sometimes buy you a factor of ten or twenty.
Architecture decisions come first, before you start pouring concrete with a bunch of carefully crafted RTL. Once you lay down a lot of HDL code, there’s little going back, as most of us don’t have the stomach to throw away thousands of lines of hard-earned logic design just because we came up with a slicker idea for our architecture.
While there is a popular myth that good RTL (particularly IP blocks) can be dragged and dropped from design to design without regard for the underlying technology, it just ain’t so – at least if you care about performance. You need to think through your whole problem space, considering design objectives like latency, throughput, and power consumption. Remember that designing for an FPGA is different from designing arbitrary logic. For example, creating structures that match the widths (or multiples) of your FPGA’s basic logic elements (LUTS) will yield much better results than, for example, building things one bit wider.
While the good old, tried and true combo of back-of-napkin mixed with Excel spreadsheets is still the most popular means of evaluating our architectures, high-level design tools like HLS (high-level synthesis) have actually come a long way and can prove to be extremely powerful allies in optimizing your architecture. Using these tools, you can describe your overall algorithm in a high-level language such as C/C++, and then use the tool to quickly explore a variety of architectural options. Usually, there is a wide range of potential solutions that trade off parallelism versus resource sharing, pipelining, latency versus throughput, etc. Even if you don’t use the HLS tool’s generated code as your final implementation, the architectural exploration capabilities can be invaluable in getting you on the right track.
Know your Tools – The heart and soul of design performance, particularly in FPGA design, is the loop from synthesis through layout and timing analysis. Most synthesis tools have an enormous list of optimization options, and you could probably spend the rest of your life tweaking and tuning synthesis to get that last little bit of speed. Place and route is the same story – a bazillion knobs and dials that could keep your server farm busy for weeks looking for the combination that gives the best results for your particular design.
The best and most important weapon is therefore to know your tools well. Take the time to get acquainted with the various options and settings and to understand what they do. Remember that the behavior of optimization options tends to be very design specific. The trick that gave you 20% better results last time may do nothing at all on your next design, or may even make things worse.
It’s important to remember that synthesis by itself gives only a rough estimate of performance. If you really want to know how much negative slack is in that path, you’ll need to do timing analysis after layout so that parasitic delays are accurately factored in. Sometimes the story changes a lot after place and route, and the paths that synthesis was working so hard to optimize end up being less critical than others that it almost ignored.
Synthesis and place-and-route are both very compute intensive, and the more computer power you have at your disposal, the more iterations you can attempt. Many FPGA design tool suites have built-in features that allow you to automatically execute a large number of tool runs using an array of servers. You can queue up a batch of runs with various tuning options and head home for the night. The next day you’ll have more data than you imagined possible waiting for you. It might even include the best implementation of your design.
Go Wide – For the past several generations of FPGAs, you may notice that the maximum clock frequency (Fmax) hasn’t changed all that much. Fmax jumped off the Moore’s Law train several nodes ago and really hasn’t gotten back on again. When the designers of FPGAs were faced with the hard tradeoffs, they seem to have decided that frequencies were high enough, and they’d take their next-node gains in power efficiency and density instead. So, while densities have continued on the exponential up-trend, Fmax has stalled.
What that means to us as designers is that we need to go wider instead of faster to get performance. We can take advantage of a much greater degree of parallelism, thanks to the vastly increased amount of logic available on today’s devices, rather than just trying to crank up the clock. Wider design has power benefits as well, as clocking logic more slowly almost always burns less juice.
So, back up in step one, when you’re considering your architecture, think about going wider rather than faster.
Harness the Hard – Today’s FPGAs come with a wealth of hardened resources. In just about every case, you’ll get dramatically better performance (not to mention power efficiency) by using the hardened assets rather than LUT-based implementations. The big win here is usually finding ways to harness the “DSP” blocks (mostly hard-wired multiply-accumulate units). If FPGA hot-rodding has a supercharger, it is the DSP block. These perform significantly better than doing these same functions in LUTs, but you have to design with your FPGA’s specific DSP units in mind. Typically, they are fixed-width arithmetic units, and you may have to trade off some precision or do some creative quantization of fixed-point functions in order to fully take advantage of them.
Other hardened resources come in handy as well – particularly in the IO realm. Many chips have built-in Ethernet, PCIe, and other useful units that are far superior to other options for those functions.
Use the Whole Chip – There are no points awarded for the resources on your FPGA that you don’t use. While most FPGAs are difficult or even impossible to populate 100% (place and route will roll over and play dead), there are usually creative things you can do with those extra LUTs, DSPs, and IOs that are way better than letting them sit there just leaking current.
While your design tools dutifully work to give you the most space-efficient implementation of your design, sometimes you can get better performance by duplicating part of the circuit. Or, you can usually turn those unused LUTs into more on-chip memory. Saving a trip off-chip for storing some of that critical data can give a huge performance boost, and if the LUTs were gonna be just sitting there anyway, why not? If you have enough left-over space on the FPGA, consider integrating some other function from your board. It’ll reduce your BOM, even if it doesn’t actually lead to higher performance.
It’s all about you – Most of us spend far too much time analyzing and dissecting data sheets for FPGAs. The truth is, your skill as a designer has a much bigger impact on the performance of your design than any datasheet differences you’ll find. Put down the datasheets, pick a chip, and get to work. The time you spend learning and tuning will pay much bigger dividends than all that worrying about whether one device is slightly better than another.