Blaming the Button

We FPGA designers work hard to get our RTL ready to rumble. We round up our IP, mull over the microarchitecture, sweat over the simulation, and finally get things lined up well enough to push the big GO buttons for synthesis and place-and-route. After that, the design is mostly out of our hands, right? The tools do their job, and, unless we have some critical paths that need optimizing, some LUTs hanging around loose after placement, or some routes that ended up unrouted, we just sit back and wait for the system to tell us that everything is hunky-dory. When the A-OK signal comes back, we grab our new bitstream and head off to programming paradise, unaware that we may have just left staggering quantities of design excellence on the table.

How staggering? According to a recent paper published by Jason Cong and Kirill Minkovich of UCLA, the optimality of logic design alone can be off by as much as 70-500X. That’s not a percent sign, boys and girls; that big X means that your design may be taking up 70 to 500 times as many LUTs as the best possible solution. The UCLA study compared synthesis results from academic and commercial synthesis tools with known-optimal solutions for a variety of circuits. The paper says that the synthesis tools were 70 times larger in area on average than the known optimal solutions in the test study.

While these examples are admittedly academic, and real-world designs may not suffer as badly, this is still a staggering difference. Area is only one part of the story, however, and many of the timing optimization techniques employed by modern synthesis tools involve replicating logic to improve timing. That kind of behavior is not going to help the area score much. Many times, if the tool had done a better job of logic minimization initially, the subsequent replications would not have been necessary. They’re often attempted make-ups for having too many levels of logic in the first place.

Area woes aside, there’s an even bigger design demon arising that threatens to dwarf the suboptimal synthesis problem. As geometries continue to press smaller – with some FPGA companies now talking about 65nm products, the contribution of routing and interconnect as an overall percentage of the delay in any given path is increasing. This contribution is already large enough that routing, rather than logic, is the dominant factor in delay on most real paths.

Since your garden variety logic synthesis tool (the one that’s busy making your design 70X too large) doesn’t know anything about the placement that will eventually be the big determiner of your timing success or failure, the speed of your design is left almost entirely in the hands of an auto-placer. If you think logic synthesis is an inexact science, you should stop by and check out the state of the art in placement technology. Optimality is not in the cards anytime soon. In most cases, OK-imality doesn’t happen all that often either. Placement is pseudo-algorithmic, empirical, NP-complete, heuristic-based, woo woo. Anyone that tells you differently is trying to sell you a placer.

So, what does this all mean? Are the big, green GO buttons that have always been your best friends – the ones that signaled the end of a job well done – the ones that told you “go home, get some rest, we’ve got it under control now…” THOSE buttons? Are those buttons quietly betraying us? Slacking off on the job? Quietly pretending to be competent while delivering real-world results that are possibly an order of magnitude worse than what is theoretically possible? Well, in short, yes – they are. I know this is the part in the article where we usually say “nah, just kidding, it’s not really as bad as all that.” Not this time.

The logic synthesis, placement, and routing tools collaborate to perform a tightly interlocked, highly choreographed algorithmic dance on your design. Synthesis is first, and it is the final arbiter of the specific components and interconnect that will form your netlist. Placement analyzes the topology of that netlist and attempts to create an arrangement that honors design constraints while giving the router the best possible chance to succeed. Routing plays a complicated game of connect-the-dots and hit-the-timing-constraints, but when it can’t meet timing even with the fastest available path, the solution to the problem has to bump back up to the placement level. If placement tries to move a device to improve one path, but finds that moving it would ruin another path, sometimes logic replication is required which (functionally at least) bumps the problem back up to the logic synthesis level. Logic synthesis, of course, initially works in the dark because it has no idea what the eventual placement and routing will be. Therefore most of its delay information (by which it determines critical paths) is just plain wrong.

Does all this sound complicated, iterative, and possibly like one of those “moving the bubble in a water balloon” problems that may never converge? That ain’t the half of it, bub. In the happy, fantasy scenario above, synthesis, placement and routing were all working nicely together, handing off vital information and coordinating their efforts to come up with the best possible design for you – while you sleep. The traditional reality of the situation is far, far worse.

In the real world (most of it before now, at least), you ran a logic synthesis tool from Vendor A. That tool toiled away, trying to meet your timing constraints, ultimately concluding that one of your logic paths had negative slack. You went to bed with the “timing optimization” and “maximum effort” boxes checked, though, didn’t you? Well, what your little automated wonder of a synthesis tool didn’t consider was that its timing estimates were off by a factor of 50% for this path because placement and routing were fated to do a much better than average job on this part of your design. Not knowing that, however, it set about burning copious CPU cycles restructuring the logic of this “false alarm” path, replicating registers and logic, and, in the end, just about doubling the size of that portion of your design – all for nothing.

That “optimized” netlist was now handed to Vendor B’s placement and routing tools. Those tools had no problem meeting timing on your little problem path, but four more paths had serious difficulties. It seems your little logic synthesis tool had under-estimated their delay by significant margins, and had thus failed to do any optimization on them at all. Furthermore, all the area and routing resources occupied by your “false alarm” critical path in synthesis were making the job of fixing the remaining “real” critical paths impossible for the placement and routing tools.

Fortunately, the folks that develop design tools have recognized this problem for awhile and have put together something of a solution sandwich to address the problem. This sandwich is the deli type – you can choose which ingredients suit your particular tastes and budget. Starting with the top bun, there is floorplanning. While the value of floorplanning is the subject of much animated debate at engineering cocktail parties (the ones at trade shows that are 95% male, and where they give you only one drink coupon), Xilinx has found good applications for the Plan Ahead product they acquired from HierDesign a couple of years ago. Floorplanning tools allow you to provide some intelligent guidance to the placement of your design, designating specific chip areas for timing-critical portions, dividing up space for various engineers on the team, or (this is the most fun) setting aside protected areas for a potential partial reconfiguration.

Moving down past the mayonnaise and hybrid pre-placement capabilities, we find integrated placement and synthesis technologies such as Synplicity’s Premier. This approach essentially performs most of the process described in our earlier scenario, putting placement and logic synthesis together and doing a “global route” to get pretty darn accurate estimates of the timing associated with the final routing. This process is the most automated of the bunch, and it can give consistent improvement in results over stand-alone, separated logic synthesis and place-and-route.

On down toward the bottom bun, we have placement optimization. Products employing this approach include most of the rest of the “physical synthesis” field, including the FPGA vendors’ internal physical synthesis (like that offered by Altera in their Quartus II tool suite), Magma’s Palace/Blast FPGA, and Mentor Graphics’s Precision Physical Synthesis. These tools mainly let the normal synthesis and place-and-route process run their respective courses and then sort out the mess at the back end using real timing information from the actual layout. These tools can apply a variety of remedies including re-placement, replication, re-synthesis, register retiming and a few other handy tricks to get the negative slack under control and improve the final performance of the design. Some of these also include manual/interactive graphical interfaces that let the designer’s intuition solve some of the more subtle and intuitive issues that brute-force algorithms can’t always fix.

None of these products address the core of Professor Cong’s 70X issue, however. That solution remains a prize for academia and future economics. Today, we can improve the performance and capacity of our FPGAs every couple of years with a jump to the next process node. Those improvements are getting smaller each time, however, and the cost and complexity of following that progress is increasing almost exponentially. There should come a time when incremental investment in design tool software technology will pay larger dividends in design performance than continuing to pour millions into next-node mania. There are signs, in fact, that that day is coming sooner rather than later. Next time you talk with a Xilinx or Altera executive, ask them about the relative amounts of R&D budget that they’re applying to hardware platform engineering versus design tool engineering. The answer may surprise you.

Physical synthesis also offers a huge opportunity for the EDA companies pursuing the FPGA market. The technology provides a place where suppliers can prove significant return, justifying the investment in their tools. Developing physically-aware tools is very difficult, however. Besides requiring an entirely new set of highly-specialized algorithm experts, the design data required to drive these tools is much more detailed than what is needed for logic synthesis. The EDA company has to have very close personal knowledge of the inner architecture of the FPGA. We’re not talking about “Hi, my name is Joe and I’m an FPGA” kinda’ knowledge. We’re talking about “Hi, my name is Joe. I’ve got a Garfield tattoo on my left shoulder and I was late with my student loan payments last year. I also got two parking tickets, forgot my Mom’s last birthday, and I secretly wish I could still wear a mullet at work…” It’s hard to get an FPGA company to be forthcoming with THAT level of technical data.

There may, then, be a day when the company with the “best” FPGA offering is the company with the most powerful design tools. Being on the latest process node, having the highest LUT count, and claiming the loftiest theoretical Fmax will all lose their luster if your competitor can consistently synthesize their design into half the LUTs your tools use, or beat your real-world operating frequency by 50% or more. If the company that owns such technology is not an FPGA vendor, but instead a third party EDA company, FPGA devices themselves could suddenly become much more commoditized, and the landscape of the industry could change significantly.

Keep pushing the button.

Expect more.