feature article
Subscribe Now

Go-Fast FPGA Design

Helpful Hot-Rodding Hints

Most of us engineers are at least closet hot-rodders. It’s in our DNA. No matter how good a contraption is from the factory, we just can’t resist the temptation to tweak a few things in our own special way, and often that’s all about speed. 

FPGA design, it turns out, is a big ‘ol blank canvas for hot-rodding. Even though we (fortunately) don’t have glossy convenience-store magazines adorned with scantily-clad models standing next to the latest tricked-out dev boards, FPGAs have all the tools we need to rev our creative motors in the never-ending quest for that extra little bit of personalized performance.

But, where do we start? Do FPGAs have a set of go-to hop-ups? Is there a “chopping and channeling” baseline for programmable logic design? 

It turns out the answer is “yes.” And, just to get you started, here are five tips for turning up the boost on your next project: 

Architecture is Everything – The most important part of designing for maximum performance is the architecture you choose. While implementation optimization tricks may be able to give you ten or twenty percent here or there, choosing the right architecture for your logic design can sometimes buy you a factor of ten or twenty. 

Architecture decisions come first, before you start pouring concrete with a bunch of carefully crafted RTL. Once you lay down a lot of HDL code, there’s little going back, as most of us don’t have the stomach to throw away thousands of lines of hard-earned logic design just because we came up with a slicker idea for our architecture.

While there is a popular myth that good RTL (particularly IP blocks) can be dragged and dropped from design to design without regard for the underlying technology, it just ain’t so – at least if you care about performance. You need to think through your whole problem space, considering design objectives like latency, throughput, and power consumption. Remember that designing for an FPGA is different from designing arbitrary logic. For example, creating structures that match the widths (or multiples) of your FPGA’s basic logic elements (LUTS) will yield much better results than, for example, building things one bit wider.

While the good old, tried and true combo of back-of-napkin mixed with Excel spreadsheets is still the most popular means of evaluating our architectures, high-level design tools like HLS (high-level synthesis) have actually come a long way and can prove to be extremely powerful allies in optimizing your architecture. Using these tools, you can describe your overall algorithm in a high-level language such as C/C++, and then use the tool to quickly explore a variety of architectural options. Usually, there is a wide range of potential solutions that trade off parallelism versus resource sharing, pipelining, latency versus throughput, etc. Even if you don’t use the HLS tool’s generated code as your final implementation, the architectural exploration capabilities can be invaluable in getting you on the right track.

Know your Tools – The heart and soul of design performance, particularly in FPGA design, is the loop from synthesis through layout and timing analysis. Most synthesis tools have an enormous list of optimization options, and you could probably spend the rest of your life tweaking and tuning synthesis to get that last little bit of speed. Place and route is the same story – a bazillion knobs and dials that could keep your server farm busy for weeks looking for the combination that gives the best results for your particular design.

The best and most important weapon is therefore to know your tools well. Take the time to get acquainted with the various options and settings and to understand what they do. Remember that the behavior of optimization options tends to be very design specific. The trick that gave you 20% better results last time may do nothing at all on your next design, or may even make things worse.

It’s important to remember that synthesis by itself gives only a rough estimate of performance. If you really want to know how much negative slack is in that path, you’ll need to do timing analysis after layout so that parasitic delays are accurately factored in. Sometimes the story changes a lot after place and route, and the paths that synthesis was working so hard to optimize end up being less critical than others that it almost ignored. 

Synthesis and place-and-route are both very compute intensive, and the more computer power you have at your disposal, the more iterations you can attempt. Many FPGA design tool suites have built-in features that allow you to automatically execute a large number of tool runs using an array of servers. You can queue up a batch of runs with various tuning options and head home for the night. The next day you’ll have more data than you imagined possible waiting for you. It might even include the best implementation of your design. 

Go Wide – For the past several generations of FPGAs, you may notice that the maximum clock frequency (Fmax) hasn’t changed all that much. Fmax jumped off the Moore’s Law train several nodes ago and really hasn’t gotten back on again. When the designers of FPGAs were faced with the hard tradeoffs, they seem to have decided that frequencies were high enough, and they’d take their next-node gains in power efficiency and density instead. So, while densities have continued on the exponential up-trend, Fmax has stalled. 

What that means to us as designers is that we need to go wider instead of faster to get performance. We can take advantage of a much greater degree of parallelism, thanks to the vastly increased amount of logic available on today’s devices, rather than just trying to crank up the clock. Wider design has power benefits as well, as clocking logic more slowly almost always burns less juice.

So, back up in step one, when you’re considering your architecture, think about going wider rather than faster.

Harness the Hard – Today’s FPGAs come with a wealth of hardened resources. In just about every case, you’ll get dramatically better performance (not to mention power efficiency) by using the hardened assets rather than LUT-based implementations. The big win here is usually finding ways to harness the “DSP” blocks (mostly hard-wired multiply-accumulate units). If FPGA hot-rodding has a supercharger, it is the DSP block. These perform significantly better than doing these same functions in LUTs, but you have to design with your FPGA’s specific DSP units in mind. Typically, they are fixed-width arithmetic units, and you may have to trade off some precision or do some creative quantization of fixed-point functions in order to fully take advantage of them.

Other hardened resources come in handy as well – particularly in the IO realm. Many chips have built-in Ethernet, PCIe, and other useful units that are far superior to other options for those functions.

Use the Whole Chip – There are no points awarded for the resources on your FPGA that you don’t use. While most FPGAs are difficult or even impossible to populate 100% (place and route will roll over and play dead), there are usually creative things you can do with those extra LUTs, DSPs, and IOs that are way better than letting them sit there just leaking current.

While your design tools dutifully work to give you the most space-efficient implementation of your design, sometimes you can get better performance by duplicating part of the circuit. Or, you can usually turn those unused LUTs into more on-chip memory. Saving a trip off-chip for storing some of that critical data can give a huge performance boost, and if the LUTs were gonna be just sitting there anyway, why not? If you have enough left-over space on the FPGA, consider integrating some other function from your board. It’ll reduce your BOM, even if it doesn’t actually lead to higher performance.

It’s all about you – Most of us spend far too much time analyzing and dissecting data sheets for FPGAs. The truth is, your skill as a designer has a much bigger impact on the performance of your design than any datasheet differences you’ll find. Put down the datasheets, pick a chip, and get to work. The time you spend learning and tuning will pay much bigger dividends than all that worrying about whether one device is slightly better than another.

9 thoughts on “Go-Fast FPGA Design”

  1. A few synthesis-specific things I neglected to mention in the article:
    – Multiple synthesis tools: If you have access to more than one synthesis tool, use all of them. Even the lousy one that seems like it never works will win on some designs. The only way to know which tool will give the best results on any given design is to try them all.
    – Set reasonable constraints: Resist the urge to set unrealistically high frequency constraints in an attempt to trick the tools into doing extra optimization. It WON’T WORK! Your tool will start doing all kinds of crazy things like duplicating logic in a failed attempt to reach impossible constraints. It will run all night and all day, and then it will fail. You want to give the most accurate constraints possible – and pay particular attention to properly flagging multi-cycle, multi-clock, and false paths. That will also save your tools a lot of futile work.
    – Use “fast” mode: Many synthesis tools have a quick-and-dirty mode that you can use for the majority of your design and debug cycle. Fast mode is your friend. You don’t need to burn hours and hours trying to optimize timing until the end of your design process. Faster iterations during initial coding and debug means more time to optimize when it counts.

    What tips would you add to the list?

  2. I’d add “Make room for heatsinks.” Especially if you’re packing the whole FPGA with logic and cranking up the clock, you will need help taking the heat from the chip. I’ve never regretted allowing extra room for a big heatsink, including board-level mounting holes.

  3. Pingback: GVK Biosciences
  4. Pingback: Bdsm
  5. Pingback: juegos de friv
  6. Pingback: Scr888 Register
  7. Pingback: lose weight kt13

Leave a Reply

featured blogs
Aug 19, 2018
Consumer demand for advanced driver assistance and infotainment features are on the rise, opening up a new market for advanced Automotive systems. Automotive Ethernet allows to support more complex computing needs with the use of an Ethernet-based network for connections betw...
Aug 18, 2018
Once upon a time, the Santa Clara Valley was called the Valley of Heart'€™s Delight; the main industry was growing prunes; and there were orchards filled with apricot and cherry trees all over the place. Then in 1955, a future Nobel Prize winner named William Shockley moved...
Aug 17, 2018
Samtec’s growing portfolio of high-performance Silicon-to-Silicon'„¢ Applications Solutions answer the design challenges of routing 56 Gbps signals through a system. However, finding the ideal solution in a single-click probably is an obstacle. Samtec last updated the...
Aug 16, 2018
All of the little details were squared up when the check-plots came out for "final" review. Those same preliminary files were shared with the fab and assembly units and, of course, the vendors have c...
Jul 30, 2018
As discussed in part 1 of this blog post, each instance of an Achronix Speedcore eFPGA in your ASIC or SoC design must be configured after the system powers up because Speedcore eFPGAs employ nonvolatile SRAM technology to store its configuration bits. The time required to pr...