Incremental Design Moves Towards Mainstream

I have this recurring nightmare. I’m supposed to write a chapter for a book. I’ve pretty much got it done, doing some final editing on the last paragraph, and then on review realize that the first paragraph has changed mysteriously. So I fix it, but then another paragraph changes. I never seem to be able to get all the paragraphs right. And then someone else submits his chapter, and for some reason my chapter gets all screwed up. Of course, that’s about the time I also realize that I forgot that I had signed up for a college class that, of course, I never attended, and the final is tomorrow, and I wake up in a cold sweat.

While I can sigh my sigh of relief that it was all a bad dream, this can be the real-life nightmare of the FPGA designer trying to achieve timing closure on a tight design. Each tweak on one path screws up another path, all too often in a manner that doesn’t seem to converge. This whack-a-mole problem is but one of a number of challenges that can be addressed by a so-called incremental design flow. The goal of an incremental flow is to be able to lock down the things that are complete and change only the things that need to be changed. While the promise has been around for a while, it appears that much of this is now real – but of course, there are some caveats, not to mention that not all tools do all things.

Working on Autopilot

The latest buzz on the block is a hands-off “guided” capability (so called because the last fit guides the next fit), where the tools decide what is affected by a design change and then intentionally leave everything else alone. It can start in synthesis, with the synthesis tool deciding how far to propagate changes. It does this based on a global view of the timing constraints, changing only enough to implement the design change and meet the constraints. At that point the place-and-route (PAR) tools from the FPGA vendors take over and try to implement the new logic while perturbing as little as possible of the design.

Changes are detected in the logic by looking at things like timestamps, text, and logic structure. The synthesis folks have worked to remove the apparent whimsical and unrepeatable nature of naming in the netlists so that today unchanged elements should retain their names. (In fact, according to Xilinx, NeoCAD had an automatic flow back in 1992, but synthesis name changes torpedoed its effectiveness.) Even if names do change, the structural checks can confirm whether a name change means anything more than just a name change.

So the result is a flow where you can make low-impact design changes and have a better chance of maintaining the timing of portions of the design unrelated to the changes. But there’s a catch: in fact, it’s the very ability of the tools to perform a wide range of sophisticated optimizations that gets in the way. They can propagate far and wide, changing large swaths of the design and rendering the incremental feature more or less moot. In fact, the kinds of changes that can make effective use of this flow are pretty small: minor logic changes, adding a register, perhaps state machine changes. There’s no hard line, but the reality is that this is an end-of-the-design-flow feature for engineering change orders (ECOs) and last-minute tweaks. It can also be effective for ASIC prototyping, where the FPGA is not likely to be stretched to capacity.

If challenges occur earlier in the evolution of the design, where wholesale changes occur, and the moles are already popping their heads out, the automatic flow isn’t likely to be effective. This is when the alternative is actually a more established technology: partitioning (also called block design). Of course there are a couple catches here as well: up-front planning and quality of results (QoR).

Blocking and Tackling

Partitioning scares the bejesus out of many designers because it’s the kind of up-front work that can delay the real joy of designing, and on top of that, it can be hard to do. But the difficulty of partitioning really depends on the size of the design. If you’re trying to partition a state machine, you’re probably going to have a hard time finding the optimal partition. On the other hand, if you have a larger design, then the state machine as a whole might be one of the partitions; with large designs, such blocks become more evident. And it’s the large designs that benefit more from a partitioned flow, so oddly enough, there is no contradiction here: the larger the design, the more you want to partition, and the easier it will be. (OK, what’s wrong with this picture??) In fact users of this flow talk about needing a week or maybe two of up-front time to plan out the partitions, which isn’t nearly as big a penalty as some might fear — and it can easily be recouped in the timing closure cycle.

This flow works differently from the automatic flow in that the designer determines where the partitions are going to be; it’s a manual flow. Certain files in the design file hierarchy will be tagged as partitions, meaning that the tagged file and anything underneath it will form a partition. When the design is changed, the tool checks to see which design files have changed. Any design file change means that the enclosing partition will be recompiled. Exactly what constitutes a design change varies by vendor. Some use a text “diff” to see if a file has changed, and some use timestamp checking. Here again it has been important for the synthesis guys to minimize the changing of internal names so that unchanged partitions continue to look unchanged.

At this point, no one seems to use the kinds of checks being used in the automatic flow to detect structural changes in the design files for a partitioned flow. This means that changing comments (or perhaps making a change, saving, and then reversing the change) may cause a recompile. But assuming you have a reasonably coherent design process, this shouldn’t be a huge issue. In addition to HDL changes, constraint and effort-level changes – any of the various settings that can impact how fitting is done – will cause recompilation in some of the tools; in other tools these can be set on a per-partition basis.

But in the tradition banning free lunches, one beneficial fitter feature can get in the way: cross-boundary optimization. This allows the tool to move portions of a critical path between partitions in order to better balance the path and achieve timing. But if you want to lock a partition down, you can’t have the tool sneaking in and making changes. The standard answer is to register the signals at the partition boundaries to ward off cross-border incursions during PAR (kind of like making it harder to get through immigration at the airport, only effective). The downside of doing this, of course, is that you limit the kinds of optimizations that can be done, so you may be sacrificing QoR. The unfortunate fact is, the harder you’re trying to push the silicon, the more you need the QoR – and yet, the more likely you are to want to lock things down to achieve closure. There’s no solution to that one in the offing… But the good news is that as designs get larger, with individual blocks being more autonomous, cross-boundary optimization will become less important. Estimates on where partitioning becomes important range from 50K to over 100K logic cells.

Dividing up Turf

Partitioning can be taken one step further through floorplanning. Without floorplanning, different tools also approach silicon allocation differently. Some try to cluster logic. Some allow good relative placement so you can move a block around as a unit on the silicon without losing fit – if the logic array is uniform. But some try to spread stuff out to reduce congestion. Especially in the latter case, floorplanning allows better control of layout and reduces the scope of placement changes. It’s pretty much required for team design if timing is going to be tight.

Effective floorplanning does take more planning, both in estimating the amount of resource required for each block and in arranging the blocks. If data is flowing from the left of a chip to the right of a chip, then placing the first block on the right is a great way to annoy the fitter. Sizing and placing partitions for minimal block interconnect and close proximity can make life a lot easier for you and your tools. There are lots of other tricks and tips that are chip-dependent that can improve results, but they are detailed and technology-dependent. In fact, one of the things that could make the designer’s life easier would be more information more quickly as new chips are released, so that early adopters don’t have to use trial and error to help the vendors figure out the best partitioning and placing strategies.

There are also degrees of lockdown. You can lock placement and allow routing to be flexible, or you can lock placement and routing. None of this is as black and white as it might sound; there are effort levels that can be set, you have the interplay between the synthesis engine and the fitter, and each tool will have its own knobs to turn. Note that you generally have to specify partitions separately in the synthesis tool and the fitter – you could probably have some real fun by using one partition for synthesis and another for the fitter and watching the tools squabble as they work at cross-purposes, but assuming you’re being paid to be productive, you have to make sure you set them to be the same.

Team Play

The focus so far has been on achieving timing closure, but in fact there is another obvious key benefit to both the automatic and partitioned flows: reduced compile time, by numbers like 75% (depending on the size of the change, of course). And for the partitioned flow, there are two additional benefits: incremental bottom-up design and team design. These two are pretty much the same except that in the first case one person is doing everything and in the other the effort is shared.
Team design has really been the primary motivator of the partitioned flow for obvious reasons. You can avoid planning if you’re doing all the work yourself, but you can’t if you’ve got to parcel out the work. There are different theories on the best way to implement team design. One is that the team members all work at the HDL synthesis and functional verification level, and then hand things over to an “integrator” who does the place and route. This will be tough, however, if one or more blocks need a lot of physical hand-tuning: the integrator might not be the best person to do that since he or she didn’t design the block. Some tools allow greater independence among team members, like ignoring changes on your teammates’ blocks so you can get a faster compilation of your block without being impacted by changes they may have recently checked in. And you can even have each block independently compiled on a different computer, all in parallel, with a final compilation that fits only the interconnect between the blocks, dramatically cutting overall compile time.

There’s one area where designers of even modestly-sized designs can benefit from incremental design – or they may even be using it already without realizing it: virtual logic analyzers. Altera calls theirs SignalTap; Xilinx calls theirs ChipScope. Both involve the addition of logic and memory to probe and record what’s going on inside a problematic design. They obviously take more resources from the FPGA and need to be compiled in. But if that compilation process changes the main design, you might lose subtle behaviors that happen without the analyzer. So it’s important to lock down the original design to preserve it, trying to fit the analyzer around it. If the device is really full, that might be a problem, but if you don’t do it, then you may end up with phantom problems that disappear when you try to spy on them. Apparently Altera does this automatically, creating a separate partition for the analyzer and a new top-level file, locking the original top-level file to preserve the implementation.

Comparing specific tools mostly involves getting completely lost in minutiae; not every tool does everything described here. But at a high level, Xilinx, Lattice, and Actel have automatic flows. Altera claims that they used to have an automatic capability, but optimizations ballooned anything but trivial changes, so they no longer support it. All support the partitioned flow. Synplicity and Mentor support incremental PAR by controlling namespace changes; Mentor has announced automatic incremental synthesis capability.

As to the future, a number of things are likely to evolve over the next year or two:

The automatic flows will be tuned as they try to balance change propagation against the need to meet timing constraints. Hopefully this results in larger changes being supported in the automatic flow.
The automatic and partitioned flows should start to merge, with the intelligence developed for the automatic flow being used to do a better job of figuring out which partitions have changed. So far these flows have been developed in isolation from each other, and effort will be made to bring them closer together.
In some tools, changing constraints can invalidate all partitions, causing a complete recompile even if the constraint affects only one block. Corralling this will be the focus of some work and should appear in future releases.
Some tools will work towards better decoupling of partitioning from floorplanning, allowing partitions to “float” over the silicon more effectively.
Ease of use will be addressed both for successful implementation of partitions and floorplanning and for reporting and help if things aren’t working.

It does appear that over the long haul, even with improvements to guided flows, partitioning will gain more traction for effective implementation of large designs. Of course, what everyone wants is an effective automatic intelligent partitioning algorithm. But no one is even pre-announcing a plan to start thinking about working on a possible future solution to that, so it’s not something to look for any time soon.

Incremental Design Moves Towards Mainstream

Related

Leave a Reply Cancel reply

featured video

How NV5, NVIDIA, and Cadence Collaboration Optimizes Data Center Efficiency, Performance, and Reliability

featured chalk talk