The Impact of Timing Exceptions on FPGA Performance

FPGA designers are typically working with prototype designs without much synthesis history, so on the first pass of the design they will not have developed a set of false path and multi-cycle path constraints. FishTail’s Focus tool can generate false path and multi-cycle path timing exceptions for the FPGA designer before the first synthesis run. These timing exceptions have the ability to improve FPGA QoR by relaxing constraints on the timing paths of the design and potentially allow the FPGA to run faster. In this paper we have studied the impact of timing exceptions on nine designs using Synplify_pro from Synplicity for logic synthesis, Xilinx tools for place and route, and Focus from FishTail to generate false and multi-cycle path timing exceptions. We compare the design’s maximum clock frequency before and after place and route, with and without timing exceptions. In about one quarter of these designs, the timing exceptions can make a one-speed grade improvement in FPGA performance after place and route.

FishTail’s Focus is designed to identify the timing exceptions in the design, and requires data that is normally already available for your design; synthesizable RTL, clock definitions and boundary constraints. Focus generates false and multi-cycle paths in industry standard constraint formats. Focus also provides the ability to verify the generated timing exceptions through assertions that can be checked as part of functional simulation.

Synplify_pro is a full-featured logic synthesis tool which reads constraints for false paths and multi-cycle paths as well as IO and clock constraints. This set of constraints is used in initial logic synthesis and affects the resulting EDIF file generated for Xilinx place and route tools. Synplify_pro also prepares a translated constraint file to feed forward the same timing requirements to place and route.

The Xilinx tool set of ngdbuild, map and par, read the EDIF netlist and constraint file from Synplify_pro, and create a routed design, final timing reports and a device programming file.

Synthesis Flow

Figure 1. Synthesis Flow

Figure 1 shows the flow from RTL synthesis through place and route. Two cases were run for each design clock: the unconstrained case where there were no timing exceptions, and the constrained case where in addition to the clock, Focus generated timing exceptions were included. The basic design flow is:

1. Run Focus to generate timing exceptions.

2. Synthesize the RTL with Synplify_pro to create an EDIF file and an NCF constraint file containing the constraints translated to match the EDIF netlist and in Xilinx compatible format. Default settings were used for logic synthesis, with the exception of the “auto-constrain frequency” option which was turned off.

3. Run Xilinx ngdbuild, map, and par to create a routed netlist and post-route timing report.

Additional scripting is added to the flow to evaluate the impact of Focus generated timing exceptions on maximum clock frequency. The operating frequency estimates are recorded for constrained and unconstrained cases after synthesis and after place and route. A frequency sweeping script for logic synthesis is used to achieve the best EDIF netlist for moving forward into place and route. The frequency iteration script invokes the Synplify_pro tool in batch mode, examines the timing report, increments the target frequency by writing a “1-clock only SDC file” which contains a “define_clock” constraint with the target frequency value, then runs the tool again. The first synthesis is with a relaxed clock requirement of 15MHz and the resulting clock frequency estimate is used to set a clock-frequency goal 5% faster than that result. When the resulting performance falls backward or fails to improve, the best results are preserved. The outputs from the Synplify_pro iterations are the EDIF design netlist and the translated constraints file in NCF format. The best EDIF file is passed forward to the Xilinx tool set.

On designs with multiple clocks, each clock was swept higher independently while the others were held at an initial low value. Each clock sweep produces a unique netlist in which the other clocks are not optimized. This was done so we can establish the maximum frequency for each clock on the design separately.

The Xilinx place and route tool set is next started with an initially low frequency goal placed in the “1-clock only UCF” file. For each iteration of the Xilinx tool chain, the goal frequency is set to 3% faster than the last achieved frequency. When the achieved frequency calculated by the post-route timing analysis is no better than the previous run, the iterations stop and the best results are recorded. The maximum frequency of the unconstrained designs is recorded. The entire process, starting from the initial low frequency targets, is repeated with the Focus timing exceptions file applied for the constrained case.

A key difference between the unconstrained and constrained cases is that the NCF file produced by Synplify_pro at the end of synthesis, is used only for the constrained cases, otherwise the file is deleted before place and route. This NCF file is written with both clock frequency goals and timing exceptions. The place and route iteration script uses the UCF file to control the clock frequency, so the frequency target goals are removed from the NCF file leaving only the timing exceptions before proceeding. The Xilinx tools load both the NCF file and the UCF file to constrain the run.

Constraining Synplicity and Xilinx Tools

Focus generates timing exceptions in SDC format. A perl script was used to convert these constraints into Synplicity format. The constraints for the place and route tools were generated by the synthesis tool in the form of an NCF file. The NCF format breaks all busses into individual bits. Below is an example MCP constraint and how it translates from Focus, through Synplify_pro, and into Xilinx P&R.

Focus SDC Output format
set_multicycle_path -from [get_cells { wishbone/RxPointerRead_reg }] \
-to [get_cells { wishbone/ram_addr_reg[*] } ] \
-setup 2 -end

Synplify_pro SDC format
define_multicycle_path -from { wishbone.RxPointerRead } -to { wishbone.ram_addr[*] } 2

NCF file Format
INST “wishbone/ram_addr[0]” TNM = “wishbone_ram_addr_0_”;
… bits 1 thru 6 …
INST “wishbone/ram_addr[7]” TNM = “wishbone_ram_addr_0_”;
TIMESPEC “TS_wishbone_RxPointerRead_wishbone_ram_addr_0_” = FROM
“wishbone_RxPointerRead” TO “wishbone_ram_addr_0_” 18.622;

Notice that we started with a MCP of 2 clock periods, and ended with a path timing requirement of 18.622ns. The Synthesis tool determined a clock period goal of 9.311ns for this clock domain and translated this MCP constraint into a path timing requirement. The fixed time value poses a hazard in the final timing reporting because the final clock period may achieve better than the 9.3ns assumption, and the actual path requirement to check is two times the final clock period. The last stage of the Xilinx place and route performs a timing check and reports on the requirements that were present in the UCF and NCF files, so in a production flow an additional timing report should be generated with correct MCP time requirements recalculated from the operational clock period.

Testcase Designs

Nine designs were chosen ranging from 42 flops to 1678 flops. Most are publicly available Opencores designs. These designs were chosen because they can be shared between EDA vendors, and because they are small enough to run the sweeping iterations in a reasonable time. The FPGA performance characteristics of the designs were unknown when they were chosen.

The design chart in Table 1 lists the number and type of timing exceptions Focus found in the design. MCP is multi-cycle path and FP is false path. These designs had numerous state machines and divided clocks and so tended to have more MCPs. Columns for LUT count and Flop count give us an idea of the overall size of the design. The target part type is also listed. The part selections were targeted to the Virtex 4 family except where the RTL instantiated RAM cells specific to another family. Since the same part was used for constrained and unconstrained cases, the part selection and utilization variables of the trials are as controlled as possible. The designs tested cover the area range from very small to moderate in size.

Most of the designs are actually cores that are not intended as stand-alone devices. The designs with high IO counts forced the selection of rather large packages for the implementation and thus the utilization is low. We might expect timing constraints to have a more significant effect on a tightly packed device. Also, the mapping of IO cells onto a core will slow the overall timing. The IO cells were necessary for this study in order to preserve all the logic that Focus found in the RTL and used to prepare timing constraints. If the IO cells are not mapped, then place and route tool can remove the logic tied to those ports. With IO cells inserted, a constraint relaxing timing on an IO path will have a greater impact on performance than if it were an internal signal because of the larger delays that IO cells carry.

Results

Table 2 compares the clock frequency for the unconstrained design versus the constrained design. The rightmost column of Table 2 indicates the overall performance gain when the timing exceptions are applied. Each clock of a design was treated separately regardless of how many of the original timing exceptions actually applied to that clock. The uncertainty of the final place and route frequency is approximately 3%.

Significant performance improvement can be seen when the timing exception relaxes the critical path of a design. In design multipath, a state machine controls a combinational SIN COS lookup table. A ‘start’ signal initiates a lookup where the registered angle drives the lookup table whose outputs are registered 3 cycles later, then a ‘done’ flag indicates completion. Comments in the source code indicate an MCP of 4 should be set from the angle register to the output register, however Focus determined that the MCP was 3, and a careful look verifies that 3 is correct and perhaps some editing revised the code and not the comments. The path delay through the lookup table is nearly 40ns, so without constraints the tools will report 1-cycle timing of near 24MHz, but when told that that is a 3-cycle MCP, the entire device can be optimized for an 86 MHz speed.

Table 2. Frequency Results

A more typical design case would be the m_dve design. A look at the post route timing report shows that the critical timing path does not have a MCP or false path applied, however if we compare the path delays, the constrained case found better routing for a 9% path improvement:

Constrained: Total 6.654ns (2.627ns logic, 4.027ns route)
Unconstrained: Total 7.312ns (2.632ns logic, 4.680ns route)
Path SYNTHESIZER/subc_phase_accumulator[24] to
DATAPATH/v_mult/p_out_2[15:0]

In the m_dve design it looks like the relaxing of constraints on non-critical paths allowed the router to use better routes for the critical paths. Notice that the synthesis phase by Synplify_pro found no gains from the timing exceptions.

The designs can_top and eth_top(mrx clock) lost 2% and 5% respectively. For the can_top design, a comparison of the critical path of the slower constrained design shows that the logic delays are equal and that the routing for the unconstrained case was 5% faster. We can speculate that constraints did modify routing priorities, but the randomness of the routing algorithm leads us to a different local minimum. The eth_top_mrx design has a similar story, where for the critical path of the slower constrained case, the routing delay was about 5% slower than the routing for the same network in the unconstrained case.

The multi-clock designs such as eth_top were duplicated for each clock name as indicated in Table 2. All of the available constraints were applied for synthesis, but the aggressive clock goal was set only on the target clock. This method will indicate that the sum of the constraints either hurts or helps the target clock system. The eth_top design gains are near the 3% uncertainty of the iterations, but the pextop, usbf_top, and xsv_fpga_top designs all show gains that might allow the designer to select a slower speed grade and still meet his requirements. In practice synthesis and routing optimize multiple clocks simultaneously.

Issues and Refinements

A few of the designs presented some special problems to the tool sets. The dct8x8 had the issue of 2-dimentional arrays of registers where the Focus constraints would not attach to the Synplify_pro naming style without manual edits. One design had hundreds of false paths using “–through” points and the Xilinx tool seemed to use excessive memory because of those entries. The tool ran out of memory (32-bit Linux) after warning that too many constraints were applied. When the constraints with “–through” were removed, the remaining constraints were processed correctly.

The NCF file format which passes the constraints from synthesis into place and route currently converts n-clock MCPs into fixed time delay requirements. If the router exceeds the performance of the requested clock period, then its timing report will not properly detect a timing violation related to the fixed time MCP. i.e. if a MCP of 2 clocks translates to 18ns for a 9ns clock target, but the router achieves an 8ns clock route, the requirement to allow an 8ns operational clock is that those MCP paths actually time to 16ns, not 18ns. MCP constraints would best be communicated in units of clocks to the timing engine.

Another flow issue is that a small percentage of the NCF format constraints were flagged as errors by the Xilinx ngdbuild program. These constraints were generated by Synplify_pro along with the EDIF netlist, but in processing the netlist the Xilinx tools appear to do some optimization that eliminates some nets. The NCF had to be edited to remove the offending terms in order to move forward. For example Focus placed a constraint on net “X1”, and the EDIF netlist resulted in nets “X1” and “X1_fast” being used for fanout purposes. The NCF file placed the constraint on both nets, however the Xilinx tool claims that the “X1_fast” net does not exist and stops with an error. That “X1_fast” net must have been collapsed back into “X1” by the Xilinx tools.

The application of many false path and multi-cycle path constraints to nets which have plenty of timing slack appears to be design dependent influence which may either hurt of help performance. Focus has a utility to filter the SDCs if provided a list of critical path endpoints. A critical endpoints list could be extracted from FPGA timing reports, and that will cut down the number of constraints and relieve some of the memory usage concerns in the Xilinx tools. For an optimum result the designer could try synthesis with both the full and filtered sets of SDC.

Summary

The application of automatically generated false path and multi-cycle path timing exceptions to a collection of FPGA designs improves the maximum clock frequency of many of the fully routed designs. QoR improvements can be seen in improved routing of the truly critical paths in a routed netlist timing analysis. The frequency performance gains seen were design dependent and often on the order of one speed grade of a FPGA part. EDA tools are currently available which allow the generation and application of timing exceptions from RTL through place and route for a typical FPGA flow.

Acknowledgments

We would like to acknowledge the help provided by Joe Gianelli at Synplicity and Hitesh Patel at Xilinx in providing us tools and assistance in conducting this study.