Clock Watching

In the mythological “good old days” when many FPGA designs were nothing more than simple state machines, clocking was simple. Your average design had a single ticker oscillating merrily away at single-digit megahertz. Skew was something you did to vegetables when barbecuing shish kabobs, gating was an activity that applied to upscale housing developments, false paths were something you only ran into while hiking, and derived clocks were the two-dollar wristwatches you gave away at tradeshows with your company logo on them.

Now that the revolution has come, new alarms are starting to sound. Clocks are no longer to be taken for granted. The division and subdivision of time that creates the choreography of synchronous design is an elaborate symphony-on-silicon with a skilled designer as composer and arranger. While old-school FPGA work was an Fmax drag race with the singular goal of reaching the maximum frequency your device could muster, today’s design is a trapeze act where subtle and sensitive timing adjustments mean the difference between success and failure.

When it comes to clocking, ASIC design has always been a blank canvas. Anything was possible, and everything was attempted. Any number of clock lines could be routed to any number of destinations, and each clock could be subdivided, multiplied, re-phased, gated, inverted, or aligned. As a designer, you were free to create as many problems as you were willing to solve. ASIC timing analysis tools dutifully reported your progress, and buffering, re-placement and tuning would eventually lead to a solution where the data mostly arrived ahead of the clock edge. Messing with clocks was also a near panacea for power problems, so those trying to cut back on the juice generally spent a liberal portion of their design schedule putting tight controls on which flip-flops got flopped.

With FPGA design, however, the canvas has never been blank. Implementing complex clocking schemes can be an engineering exercise in patience, trying to balance geographically-limited routing resources with clocking requirements for multiple domains, frequencies and phases. FPGA vendors have responded with outstanding, feature-laden solutions like Xilinx’s DCMs and Altera’s PLLs that can generate an almost infinite variety of multiplied, divided, phase shifted, time-corrected clocks for both on- and off-chip use. In many ways, this simplifies the design process considerably when compared with the ASIC problem. For the novice, however, (or migrating ASIC designer) there are still a number of constraints and limitations to be aware of before jumping in and trying to synthesize your thousand-clock magnum opus of synchronicity.

The primary limiting factor in FPGA clocking is the scarcity of routing resources. You wouldn’t want to pay for an FPGA that included enough routing for every possible clock to be routed to every available logic element. FPGA vendors have to make compromises and tradeoffs, and that means they need to make some guesses about what constitutes reasonable maximums for clock lines. For most FPGA devices, they’ve decided that you won’t need more than 16 or 32 global clocks (clocks that can go anywhere on the device) and 4 or so regional clocks (clocks that are constrained to a specific local area on the chip). While this may still sound like a plethora of possibilities, you have to keep in mind that, unlike ASIC, every derived clock is essentially a separate line that needs to be routed, and requires one of your precious routing resources. Also, these clock lines aren’t just for clocks, but for any high-fanout control lines like clock enables or synchronous and asynchronous resets.

When it comes to generating clock signals, the FPGA vendor’s clock blocks offer an intimidating list of configuration options. Altera’s PLLs and Xilinx’s DCMs have options to add enables and resets, compensate for on- and off-chip delays, switch between input clocks, multiply and divide (even by non-integer factors), arbitrarily phase shift, and control the duty cycle. These clock blocks also support advanced features like spread spectrum to reduce electro-magnetic (EMI) emissions. For asynchronous transfers of data between clock domains, vendors also provide FIFO macros that buffer the data transfer. Just consult the data sheets and application notes for your favorite FPGA to find out what capabilities that particular version has.

If you’re doing straight FPGA design from scratch, you probably won’t have any difficulties if you exercise a modicum of moderation. The rich feature sets of the vendor’s clock blocks and the reasonable routing resources will handle all but the most pathological cases, and synthesis software will do most of the heavy lifting for you in inferring the right clocking and control resources. If you follow the vendors’ guidelines and use your tools correctly it’s fairly easy to steer clear of clocking trouble.

“We’ve simplified and automated the clock problem significantly,” says Gael Paul, Director of Product Architecture at Synplicity. “Synplify-Pro does automatic assignment to clock buffers. It takes all high-fanout signals including clocks, enables, asynchronous set/reset, and synchronous set/reset and assigns them to the appropriate clock trees. It also tags those signals for place-and-route to use the built-in high-fanout nets. We also allow designers to force assignments through attributes. This can be useful if you have, for example, a high frequency clock that demands low-skew routing.”

If you’re moving an ASIC design to FPGA, however, or trying to prototype your ASIC in an FPGA for verification purposes, the problem can be much stickier. “ASIC designs moving to FPGA can generate a lot of unexpected clocks,” Paul continues. “From an FPGA perspective, every generated clock requires a separate clock net. Gated clocks [common in ASIC for power optimization] are an even bigger problem. A clock that’s gated at 250 locations in an ASIC is still only one clock. In FPGA, this turns into 250 separate clocks and you cannot do it. In Synplify-Pro and in Certify we automatically convert all parts of a gated clock to the correct clock, and convert the clock gating into enables. This is a completely automatic structural extraction that requires no user intervention and preserves the exact function of the original design while working with available FPGA resources.”

In the case of ASIC verification, maintaining the fidelity of the FPGA version with respect to the ASIC version is critical. Given the inherent differences in the two technologies, however, it is impossible to maintain 100% correlation. The clocking space in particular is problematic, and capabilities such as those built into Synplicity’s and Synopsys’s FPGA prototyping tools can be a big help.

Once you’ve got all those clocks being generated properly and your design is passing through synthesis, it’s time to look at your timing analysis to see how all those elaborate edges are lining up. With only one clock, this was a simple matter of seeing if all the logic you’d stacked between registers (plus the routing) could fit into a single clock period. With multiple clocks, however, it means that logic paths crossing between synchronous clocks are sensitive to the alignment of those clocks. You can get into trouble here if you glibly rounded off the period just to make things “easy”. If your synchronous clocks don’t have a reasonable common denominator, you can end up with picosecond delays between edges. Your synthesis tool will often dutifully try to cram some logic into that space and (surprise!) won’t manage to solve it. The solution is to be sure that you specify clocks within the same group with periods that have a known least common multiple.

This example from Synplicity’s Gael Paul illustrates this common mistake:

Let’s assume two clocks, synchronous to each other (that is, in the same clock group):
– clk280 at 280Mhz
– clk70 at 70Mhz, which is simply clk280 divided by 4 (generated clock)
Let’s first assume the user enters the clock constraints in Mhz. The tool automatically calculates the delay between these two clocks, and here it simply is one clock period of the fast clock, or 3.571ns, as reported in the log file:

Clock Relationships
*******************

However, if the user enters the clock constraints in ns, he/she will be tempted to round the numbers:
– 280Mhz = 3.571428… ns => probably rounded to 3.57ns
– 70Mhz = 14.285714… ns => probably rounded to 14.29ns
Now, what is the smallest common denominator between 357 and 1429? Well, not 4 anymore! In that case, the calculated delay between these two clocks is…0.010ns!

Here, the user should use 14.28ns for clk70, which would clearly exhibit a nice 4/1 ratio.

The moral of the story is:
– it is a good idea to calculate by hand the clock relationships, and to ensure that Mhz or ns numbers do exhibit the mathematical relationship
– it is a good idea to check the log file for the calculated relationship.
_______________________________________________________________

It’s also important to flag false paths for logic paths that cross (or appear logically to cross) between asynchronous clock domains. You don’t want your timing optimization and analysis tools scratching their heads trying to make sense of the timing, and you don’t want a 10,000 page report listing all your timing violations. Unfortunately, even a small number of unrelated clock domains can generate a huge number of false paths. Fortunately, however, some timing analysis tools can recognize the situation for you and automatically generate false paths for logic that spans known asynchronous clock domains. In this case, you only have to specify which clocks should be grouped together in synchronous groupings and the tool does the dirty work.

On the subject of helpful constraints, you’ll also want to clue your tools into which paths in your design deliberately require more than one clock cycle to transition. By identifying these as multi-cycle paths, you’ll save your timing analysis tool a lot of work examining these paths for errors, yourself a lot of work reading an enormous timing report, and your place-and-route software a lot of work trying to optimize timing that is probably not optimizable.

After you’ve mastered your FPGA vendor’s clock generation macros and understand the capabilities of your design software (including synthesis, place-and-route, and timing analysis) clocking won’t seem like such an intimidating problem. When in doubt, it pays to get some graph paper and draw out a plain, undergraduate-style timing diagram. If you can’t draw it, chances are you won’t be able to sort it out from timing reports either. Once you get a feel for the very reasonable clocking limitations of the modern FPGA, you’ll be able to turn your attention to the more interesting (and profitable) aspects of your design.