Does Single-pass Physical Synthesis Work for FPGAs?

As mask prices and NRE costs rise to exorbitant levels, the ASIC route becomes increasingly unrealistic for many applications, especially in low- to medium-volume production quantities. Design starts using ASICs have plummeted from a high of over 11,000 in 1997 to below 4,000 in 2003 (Source: Gartner Dataquest). With the advent of innovative FPGA architectures incorporating embedded processors, memory blocks and DSP functions, many designers who depended on ASIC methodologies are turning to FPGAs for new generations of complex designs. The problem is that, increasingly, these designers are the same person, i.e., one day they are designing an ASIC and the next month they may target an FPGA. There are key differences between the two types of silicon platforms that mandate specific features in the EDA tools you need to develop and implement the latest generation of FPGAs. This paper will make a case with respect to advances in the physical synthesis space.

ASICs/FPGAs: Similarities and Differences

ASICs are conceived from scratch, while FPGA have a predefined architecture for a given family of devices. This means designers must follow different HDL coding guidelines for each type of platform. Complex FPGA design shares some commonality with ASIC design, in the sense that both sets of designers must account for timing, power, and other performance specifications. Designers of both platforms perform synthesis, RTL simulation and generate test benches. But under-the-hood, many steps are fundamentally different. The predetermined nature of FPGAs drives a “use or lose” approach to features/capabilities. FPGA design, more often than ASIC design, must match functional requirements with the device architecture.

ASIC design consists of many disparate design tasks that are not part of an FPGA design flow. For example, the FPGA vendor has already taken care of clock-tree synthesis and boundary scan. FPGA designers also need not perform silicon verification or scan-chain insertion for test. Since most FPGAs power up in a known state, FPGA designers do not have to initialize memory bits, latches or flip-flops. To their advantage, FPGAs can also have embedded logic analysis capability for debugging a design.

As high-end FPGAs encroach on ASIC performance, many advanced ASIC techniques are being adapted for FPGA design. The introduction of high-performance, multimillion-gate FPGAs has forced designers to turn to physical synthesis and hierarchical floorplanning (commonly used methods within the ASIC design flow) to achieve design goals and to support incremental design changes without long place-and-route (P&R) run times. Coarse floorplanning alone will no longer suffice—both ASICs and high-performance FPGAs need placement-based models to achieve timing closure.

Lessons from the ASIC World

The ASIC design community has relied on tools that integrate RTL and physical synthesis to achieve design goals for many years. Clearly, we cannot force the same ASIC methodologies on FPGA designers. Nor can we expect similar levels of success by applying the same ASIC tools on new and emerging programmable platforms. We can, however, examine traditional physical synthesis challenges faced in the ASIC space to leverage that experience and optimize FPGA synthesis techniques moving ahead.

First-generation ASIC synthesis tools used the now primitive fanout-based approach, which worked fine since most of the delay from the cell/wire combination came from the cell. Moving into deep submicron (DSM) ASIC technologies (130 nm and below), however, the traditional separation between logical (synthesis) and physical (place and route) design methods created a critical problem. Designs no longer met their performance goals, giving rise to what is now notoriously known as the “timing closure” problem. As geometries kept shrinking, circuit delays were increasingly influenced by net delays and the wire topology, so that floorplanning and cell placement drastically affected the circuit timing. The traditional fanout-based wire load models used for estimating interconnect delay during synthesis were rendered inaccurate and eventually broke down. This is still the key factor driving the lack of timing predictability between post-synthesis and post-layout results. Timing closure is still one of the biggest areas of concern for ASIC performance-oriented designs.

Many solutions emerged to surmount this daunting challenge in ASIC design and implementation, including custom wire load models, enhanced floorplanning and physical synthesis. Some of these methods work better than others, but no single solution has adequately addressed all types of designs.A combination of all these techniques may be required to fully achieve timing closure. Custom wire load models often failed as the placement engines yielded unforeseen results, while larger designs and complicated floorplans produced inferior results compared to the other techniques. Physical synthesis, which allows logic restructuring based on placement, works extremely well over a broad range of designs, and is the block-level timing closure technology of choice. However, one of its main limitations is that it often creates un-routable designs due to its lack of real wire topology prediction. In terms of the flow, real-wire topologies are not realized until after timing is “closed.” The advantage of using wires to drive mapping is not truly utilized. Evidently, synthesis and placement technologies must be even more tightly integrated to create properly placed and routable designs that meet future ASIC –as well as FPGA — performance goals.

Ideally, interaction between the logic synthesis and the physical implementation would be contained in one environment. This allows synthesis-based optimizations, including restructuring of the logic and optimizations of the critical paths, to occur with real-wire topologies, and the associated effects can be considered simultaneously. This will greatly reduce the current dependence on wire load models as accurate wire information will be available from the very beginning of the timing closure cycle. To reiterate, ASICs and FPGAs require different implementation strategies, which becomes tantamount when addressing the growing ASIC-like physical synthesis challenges in the FPGA world.

Proximity Does Not Imply Better Timing

In today’s DSM chips, interconnect delay dominates performance for both ASIC and FPGA platforms. But on the ASIC front, routing is not predetermined, so designers can do intelligent pre-placement floorplanning to reduce wire lengths and thus minimize this delay. Buffer sizing methods can also be deployed to improve performance. This does not hold true for FPGAs, which have very structured predetermined rules regarding placement and routing for a given programmable fabric. Most FPGAs have embedded buffers, so that option does not exist.

Moreover, fanout-based delay estimates are exactly what they are—“estimates”—inaccurate at best. Optimization decisions based on a wire-load estimate will often result in timing-inefficient netlists, leaving a significant amount of performance on the table. FPGAs have complex placement- and packing-sensitive routing structures. Consequently, proximity is not always directly related to delay, as evidenced by the examples in figures 1 and 2. Therefore, floorplanning based on the adage that “proximity yields better timing” falls painfully short in complex FPGA design.

Figure 1: Proximity model does not guarantee timing convergence in FPGAs, where packing or placement cell by cell is a problem. Coarse floorplanning does not solve this problem, since delays are not a regular function of distance at the detailed placement level.

Figure 2: Regardless of the device type or vendor, FPGAs have complex placement- and packing-sensitive routing structures. Due to these nondeterministic and nonlinear routing delays, load and distance may not necessarily be related.

Currently, a few companies offer “physical synthesis” alternatives based solely on technology borrowed from the ASIC implementation space, such as delay estimation based on coarse placement. In reality, importing an ASIC methodology (and mentality) into the FPGA world does not work. Success will continue to elude these approaches because they essentially try to “outsmart” the vendor placement. These tools guess what the vendor’s tool might do in any given scenario. Based on these estimates, they then calculate approximate delays and make decisions on whether or not to pass this data to the P&R tool. Some available tools may show promise in certain instances, but most cannot match the performance of a tool that leverages post-P&R information to provide “true” physically aware synthesis.

As device complexities rise, we must review traditional approaches to timing convergence to see where they fall short. Current solutions using standalone logical synthesis are iterative and non-deterministic by nature. Designers typically write/re-write RTL code, provide guidance to the place and route (P&R) tools by grouping cells together, and possibly attempt some floorplanning. An alternative is to simply do numerous P&R runs. These cannot be considered “solutions”, since they only think of timing as an afterthought. Even then, the HDL code or constraints are modified without the faintest notion about whether timing will actually improve in the next iteration. It is inconceivable that designers must needlessly iterate through P&R—the most time-consuming step in FPGA design–before gaining any visibility if the changes made were a step in the right direction (or only served to exacerbate the problem). This unpredictability impacts the bottom line, negating the reduced costs and time-to-market advantages of using programmable logic in the first place.

Physical Synthesis for FPGAs

For designs with physical geometries at 130nm and below, “simple” RTL synthesis will not suffice. To reduce the number of design iterations and achieve development goals, interconnect delay and physical effects must be considered up front in the flow, hence the concept of physical synthesis becomes critical. To be highly efficient, however, next-generation physical synthesis tools for complex FPGAs must not only tightly integrate the logical and physical aspects of a design within a single data model, they must also respect the results delivered by the FPGA vendor’s P&R tools. Vendor tools are no doubt optimized to do a superior job in terms of placement of their own technology, since when co-designing their hardware and software offerings, the vendors have devoted great attention to detail in terms of tuning their solutions to match the specific architecture, packaging and placement rules. An ideal physical synthesis tool should take into account vendor placement up front, and then and only then begin to manipulate and optimize the design to quickly converge on timing. In recent benchmark tests, this unique approach of using the vendor’s post-P&R netlist data has demonstrated superior results.

By tying in logical synthesis with physical synthesis in a unified data model, designers get unprecedented flexibility and productivity gains. For this reason, it is impossible to think of these two capabilities existing exclusive of each other. Since many designs are fixed at the RTL, a good cross-probing capability helps correlate information from the physical realm back to the logic realm. This can be very powerful, since it allows designers to easily switch views from a “hot spot” in the physical design all the way into RTL and thus gain visibility into possible fixes and recommendations. Some optimization algorithms are common to logical and physical synthesis. Even so, synthesis decisions made using the post-P&R knowledge of the target device’s physical layout will achieve the best timing results. Popular routines used by physical synthesis tools are re-timing, replication, re-synthesis, and placement optimization.

Re-timing balances the positive and negative slacks found throughout a design. Once the node slacks are calculated, registers are moved across combinatorial logic to balance positive and negative slacks.

Replication is especially effective in breaking up long interconnect. Here, registers are duplicated and moved to reduce interconnect delays.

Re-synthesis uses the physical data to make local optimization to critical paths. This includes operations like logic restructuring and substitution to improve timing and routing. The goal is to move logic from critical paths to non-critical portions of the design.

Placement optimization optimizes the logic and modifies the placement. However, it is critical that the tool understands the FPGA vendor’s placement, clocking and packing rules to ensure that each move is legal and does not cause issues during final routing.

Conclusion

Initial placement helps in the ASIC world since it is based on a proximity model. The closer you are, the easier you can build your routing tracks to justify that proximity placement. In FPGAs, this does not necessarily work because routing is segmented and hierarchical. Many physical synthesis solutions taken from the ASIC world fail in the FPGA space since they do not look at real wire delays and topologies. To be truly effective for complex FPGA design, new physical synthesis tools should combine the benefits of ASIC-strength algorithms with the advantages of using post-P&R data up front. Utilizing wire delays in this manner to drive the mapping is a huge benefit in physical synthesis for FPGAs. It is very difficult to co-relate post-layout data of the vendor tool’s timing engine with that of the physical synthesis tool. Any tool that can provide this functionality can truly make use of actual physical data in real time and ensure the highest accuracy.

In summary, an ideal FGPA synthesis tool should consider vendor placement results as soon as possible, and only then begin to manipulate the design using physical synthesis—tightly integrated with logical synthesis—to converge on timing and thus meet design goals.

Does Single-pass Physical Synthesis Work for FPGAs?

Related

Leave a Reply Cancel reply

featured video

How NV5, NVIDIA, and Cadence Collaboration Optimizes Data Center Efficiency, Performance, and Reliability

featured chalk talk