Timing is Everything

Timing closure is the not-so-fine-print of FPGA design.

PowerPoint presentations paint the process as almost trouble free. FPGA design is simple, right? You just code up some HDL, drop it into the vendor-supplied tool suite, press the magic button – and zzzzzip! Your dev board will spring to life – blinking LEDs and detecting button presses with glee and aplomb. You even try it with the supplied sample code. Yep, sure enough. It’s like microwaving a burrito. Pop off the wrapper, run it through the process, and it’s ready to eat.

Emboldened, you embark on your first “real” design work. This takes some time, of course. You select from a nice assortment of pre-designed IP blocks, stitch them together with the vendor-supplied whizzy-GUI tool, and things are lookin’ good. You run that portion of your design through the tools and – still on track – except for a couple of things you hooked up wrong between blocks, the miracle of field-programmable custom logic is your apple.

The only thing you need to do now is finish up your own custom block. Since that requires you to write and debug HDL, you saved that for last. Software safety goggles up – in you go! HDL can be a bit frustrating, and it takes some time to get all your signals and nets sorted out, your entities all properly architectured, and your processes all processing. Once you morph your brain into the world of non-sequential programming, things come into focus. You make friends with your simulator and together you conquer the beast. Your custom code is ready to roll! Just one final time through the tools, and… wait, what’s this? 1,235 timing violations?!?

Uh Oh.

Welcome to the dark side – the seamy underbelly of FPGA design – the place the FPGA vendors would just as soon you never knew about – at least until after you have signed that final PO. As much as people want to paint FPGA design as a nice, repeatable, predictable process, that’s true only in the most benign cases (meaning – not in a design where you’re trying to do any real, interesting work.) When you get just about any FPGA filled to a reasonable percentage of its capacity, you’re likely to run into at least some timing problems.

When you write RTL (as the acronym “Register Transfer Level” implies), you are describing your design in terms of what happens between registers. “What happens,” of course, is combinational logic. In FPGAs, that combinational logic is implemented as networks of look-up tables (LUTs). Data is waiting patiently at the output of some registers. It flows at top speed through a combinational network of LUTs until a stable result is achieved and is presented to the inputs of the next register. Then, the next clock edge arrives and clocks that result safely into the register.

That is how it is supposed to work, anyway.

What often happens in practice is that the data takes too long to propagate through our combinational logic, and it isn’t ready by the time the clock edge arrives to clock the data into the next register. Instead, what gets clocked in is some transient result that we almost certainly will not like. Congratulations – you have earned your first timing violation. Luckily, the design tools can estimate how long data will take to wind its way through that combinational logic and arrive at those inputs, and it will let us know that our design has a problem. It simply adds up the expected delay from all those LUTs (and the wiring that interconnects them) and compares that with the clock period. If the combinational delay is less than the clock period, the difference is called the “slack.” Slack is good. When the combinational delay is more than the clock period, we have “negative slack.” Negative slack is bad.

So, let’s say we have some of this creepy negative slack. What caused it, and what do we do about it? Well, do you see up in that last paragraph how “(and the wiring that interconnects them)” is in parentheses? That’s completely wrong. That sentence should have been written “The delay from all those LUTs – BUT ESPECIALLY THE WIRING THAT INTERCONNECTS THEM!!!” There. Now we’ve told the truth. We would have done it that way at first, but we didn’t want to scare you. Today, the interconnect accounts for a huge portion of the delay in our combinational logic. The distance between the LUTs in a particular combinational network has a huge impact on the resulting delay after the design is routed. When we write our RTL, we tend to think only of the number of levels of combinational logic we have stacked between registers, but the way that logic is ultimately placed by the software will have the dominant impact on timing. If the tools place things too far apart, the combinational delay goes up and we risk negative slack.

The people that make FPGA tools put an incredible amount of effort into handling timing automatically. Synthesis tools have built-in timing engines that guess at the LUT and routing delays and try to optimize the logic they’re generating. They can do things like minimize the logic in different ways to reduce the number of levels, replicate logic so that groups of LUTs can be packed closer together for better timing, or even move some logic across register boundaries to balance delay in adjacent clock cycles. Without any information on how the design will be placed and routed, however, synthesis is mostly shooting in the dark. All it can do at that point is try to assume all routing delays will be more or less the same and try to minimize the overall delay on what appear to be the most critical paths.

If synthesis has access to some placement or pre-placement information, it can do a better job optimizing the logic. In many modern tool flows, synthesis and place-and-route are closely integrated and work in iterative cycles with rapid placement information being fed back to logic synthesis where critical paths are re-optimized. Placement and routing still has the final say, of course, as the most accurate estimate of timing is available only after the design has been completely placed and routed. Here too, the tools have built-in iteration cycles where paths that have been flagged as having negative slack are re-optimized, re-placed, re-routed, and re-timed repeatedly.

Unfortunately, when timing is tight in a design, this becomes a bit like the proverbial squeezing on a water balloon. The tools optimize one path – only to cause a new violation somewhere else. They fix that one and the fix causes three more to pop up. This iterative cycle can get into a mode where it does not converge, and we never get “timing closure.” While all this is happening, we are off quietly sipping our triple-nonfat-latte and wondering why the tool has been running for 24 hours. Is it hung? We check the process stats again. Hmmm… CPU usage near 100%, things seem to be still happening – we’ll give it another hour before we kill it.

When the tool finally gives up and hands us a timing violation report, there are still a lot of things we can do. First, and perhaps most important, is to check our timing constraints. In most complex designs, there are several situations that can throw the timing engine off and thus prevent the synthesis and place-and-route tools from doing their best work. First, some paths between registers may not need to finish within one clock cycle. We may have written the RTL so that some data has several clock cycles to flow through a big pile of combinational logic and arrive at the next register before we need to clock it in and use it for real. We need to identify those multi-cycle paths in our timing constraints. If we don’t, the tools may work really hard trying to make that logic complete within one clock cycle. As a result, they may even do ugly things to other paths in our design trying to optimize this bogus multi-cycle path (so even if your multi-cycle path isn’t one of your timing violations, it may be in the background quietly causing some of the others.)

Some paths in our design may feed results from one clock domain into another. Imagine how the poor timing tools feel about that! If you don’t properly identify those paths for the tools, they can drive themselves crazy trying to understand and optimize the weird timing that can happen across domains. Save your tools some time on the therapists’ couch. Give them the constraints they need to identify paths that cross clock domains. Similarly, some paths identified by the timing engine may never happen in real life because they’d require combinations of data that will never occur. These “false paths” can ruin your tools’ day as well. Be kind to your tools and to yourself. Spend some time getting your timing constraints and your RTL right so that timing optimization can do its job.

There are some commercial tools available that can help with this process. Both Atrenta and Blue Pearl have offerings that claim to be able to analyze your RTL design and give you all kinds of useful information – including a careful check of your timing constraints.

If your constraints are right and you still have timing errors, there are still plenty of options at your disposal. Most FPGA design tools have a dizzying number of (often cryptic) optimization options that you can tweak and adjust to try to get timing closed automatically. These include increased levels of “effort” in iterative timing closure, running repeated placement optimizations hoping for a “better” one to happen, re-optimizing the logic in various ways to try to get the magic answer, and increased replication and/or re-timing across register boundaries to try to shoehorn that last little bit of logic in between the boundaries.

Even after exhausting the tool options and a lot of CPU time, sometimes our designs won’t close. Of course, we can go back and manually re-structure our RTL – trying to reduce the levels of logic in critical areas. Sometimes, we can manually pre-place or “floorplan” our design and give the tools enough of a hint that they can close timing automatically (although manual floorplanning can sometimes cause more timing problems than it solves).

There are some other options that might seem obvious, but that many people fail to consider. Sometimes, we choose clock frequencies somewhat arbitrarily. Do all of your clocks really need to be running as fast as they are? Sometimes slowing just one (and not necessarily the one with the timing violations) can put enough leeway back into your design to allow timing closure to complete. Of course, the FPGA vendors would be happy to sell you a faster/more expensive part as well. Moving up a speed grade has been the last-chance option for many designs through the years. Even a larger part will often solve timing problems. When placement and routing are having a difficult time fitting everything on the device and getting it routed, timing suffers. When there are more resources available on the chip, the tools have more options for creating shorter paths.

Altera, Xilinx, and the other FPGA vendors have rich libraries of white papers that can help you with timing closure. These are definitely worth a read as the tricks of the tools vary widely from vendor to vendor. With experience, you can master timing closure with your vendors’ tools.

Interestingly, timing problems occur most often in hand-written designs. We like to think of manually writing RTL as being the ultimate performance weapon, but we humans often can’t keep everything in our heads at once. RTL that has been automatically generated by IP generators or by higher-level design tools (such as high-level synthesis tools) is usually better behaved when it’s time to place and route, and the occurrence of timing violations in pre-written or auto-generated code is much lower than in code we have written ourselves.

Also interestingly, one new vendor has taken some interesting steps to try to engineer away the timing closure problem altogether. Tabula’s “spacetime” architecture (which we have discussed before but is far too complex to attempt to explain here) works in conjunction with Tabula’s specialized design tools to eliminate timing problems by effectively moving timing boundaries around. In addition to the “user” clocks that we have in our design, Tabula’s devices have another, much faster clock operating in the background. If a combinational logic path in our design doesn’t produce an answer in time for our clock edge to catch it, Tabula’s tools can effectively “fudge” (our word not theirs) the clock edge so that our combinational logic has time to complete. This brings up the question – if a clock boundary moves in the forest, and there’s nobody there to see it…

With today’s enormous FPGA designs (and tomorrow’s even more enormous ones) it isn’t likely that the problem of timing closure will go away any time soon. Although tools have improved dramatically over the years, the difficulty of our design challenges has increased as well. If you’re going to be doing FPGA design, it pays to become an expert on timing closure with the tools and devices of your primary FPGA supplier. Your project success will probably end up depending on it.