No one wants to sit around waiting for software to run. Whether it’s simulation, FPGA fitting, timing analysis, or place-and-route (PnR), run time is the enemy. The worse it is, the fewer design turns can be made and the fewer “what-if” scenarios can be played. And, of course, we all care about the bottom line: the longer it takes the software to run, the longer it takes to get a product out the door.
So everyone works hard to speed up their software. It’s no accident that performance improvements feature prominently on headlines announcing a new release. And it’s kind of a losing game too, because at the same time these guys are speeding up their software, users are applying them to ever larger design problems, meaning the speed-ups might not actually speed things up – they just might make them slow down less with the larger design. It’s a race to see if the software makers can get ahead of the design density curve.
When speeding up a tool, there are a number of areas that can be exploited for better performance. First is the simple fact that when you are releasing a new software capability for the first time, the main job is just to get it working. Speed is good, but fast software that doesn’t work doesn’t pay many bills. So realistically, once software has been out there for a release or two, there are probably many opportunities to make things more efficient without the pressure of “just get it working!” Speed-ups in computers have historically also helped: the tools could ride – somewhat – on the coattails of improving processor speeds. But that’s stopped because most of the new performance gains are coming from more cores, not faster processors.
And that’s where a key opportunity for performance comes in: multi-processing. This can be done at a local level, spawning threads that exploit multiple processing cores, or it can be distributed more widely, spawning processes on multiple computers. Practical issues like network access and data marshalling may make some of the biggest gains available within a single computer, with one shared memory for a main database. And, in fact, exactly how a piece of software can best be sped up by parallelism depends a lot on the nature of the algorithms being accelerated.
But as we’ve all seen, more cores is no magic bullet. You’ve got to parallelize the algorithm and make use of the cores, and Amdahl’s law says that you will always be limited by the parts of the code that can’t be parallelized. It would be nice to be able to split off huge chunks of the program (a “course-grained” partition) and send them off to different cores to run, but life isn’t always so simple. Mentor’s Sudhakar Jilla points out that while some aspects of PnR can be parallelized, it hasn’t been attempted yet with the core timing and optimization engines, due to their complexity and dependencies. Those engines could account for 25-30% of the run time for 45-nm layouts, and that’s growing with the number of modes and scenarios that have to be optimized, not to mention with the growing size of designs. Altera’s Phil Simpson describes Altera’s efforts to parallelize their tools as finding points in the code where they could split things out – and those might be localized in the code – after which they come back together again until the next time they can split out. This is kind of a “bursty parallelism” done on a finer-grained scale rather than a coarse-grained dispatching of large threads.
Which right answer do you want?
Given the ability to parallelize a program, there’s another potential gotcha: if the nature of the parallelization depends on how things are split up and ordered on a given run, the results could be different depending on the number of cores being used in a multicore computer. This brings us to a key characteristic of some kinds of algorithms. Most programs give one right answer because there is only one right answer. If you are running a simulator and an event occurs at a point in time, there is only one right answer to what the circuit response is (assuming deterministic stimuli). If you are running timing analysis, then there’s only one right delay for a given net (for a given level of precision). However you handle the parallelism with these kinds of tools, if they don’t give the right answer, then they’re wrong, plain and simple.
Then there are tasks like PnR, FPGA fitting, and automatic test pattern generation (ATPG), to name a few. With these there may be many, many correct answers. There could be tons of ways to route a given circuit or fit a given FPGA design. There may be many different ways to create a test pattern that covers all the intended faults. Some may be better or worse, but at the end of the process, if you have a result that meets the intended goals – no design rule check (DRC) violations for PnR, performance met for FPGA fitting, adequate coverage achieved for ATPG – then you have a solution that’s “good enough.” Problem is, because there are so many good-enough solutions, you have to be sure you get the same good-enough solution each time you run the tool, made harder if the algorithm uses any randomness, such as seeding a simulated annealing process.
So repeatability is one problem, and fixing seeds and such can help solve that. But what if parallelizing itself creates some inconsistencies? Here’s a simple example: let’s say you’re running PnR, and, on a single core, route A gets routed first and uses some net X; afterwards, route B gets routed and picks some other net Y because net X is already used. Now, with two cores, let’s say that routes A and B get sent to separate threads on separate cores, and route B gets net X before route A can get to it. Route A might be perfectly happy to use net Y since net X is no longer available, and a perfectly viable solution is found. Problem is, the valid result found when running the tool on a single-core computer is different from the valid result found when running on a dual-core machine.
Among the different tools that might have multiple right answers, PnR appears to be the area of biggest concern here; all conversations about this topic in general lead to a discussion of parallelization of PnR. Here “good enough” is actually “as good as you can,” since for really difficult designs you may end up with DRC violations that have to be addressed manually or with some other tool afterwards. And one simple indicator of whether you’re getting the same result on each computer is the number of DRCs found on each run. And if you’re trying to fix DRC problems and the DRCs are different depending on which machine you’re running on, you’re gonna tear your hair out.
Large PnR runs can take hours or even days to run, so there’s lots of incentive to make it faster. Potential inconsistencies are worst when congestion is high, above 80% or so, according to Pyxis’s Phil Bishop. But while there appears to be some anecdotal evidence of such core-count-dependent results, everyone that was part of this conversation claimed that either they gave consistent results independent of core count or had a mode where they could be made consistent.
How to split up the pie
The key questions are how the job is partitioned up for parallelization (you might hear this called “decomposition”) and how that might be affected by the way the partitions are dispatched. One way that was mentioned during a casual conversation was to use the floorplanning partition done by a designer. As it turns out, this is not done by any of the companies I talked to, so it’s a misconception. While several companies do use area-based partitions (also referred to as geometric partitions), they create the partitions automatically themselves, and never use the designer’s floorplanning partition. So let’s look at how various companies said they approach this problem.
Magma’s Talus tool suite uses an internally-created geometric partition. Magma’s Patrick Groeneveld says that the results of PnR depend on the order in which the individual partitions are run. And, in fact, simply running on a different computer could cause things to run in a different order, which might in some cases affect the result (and the number of these cases is diminishing with further R&D). But they have provided a “repeatability” mode that forces the order of the partitions so that the same result can be guaranteed each time.
Mentor’s Olympus-SoC tool partitions by area, generating that partition internally, and using a variety of synchronization strategies to provide deterministic results. Mentor’s PV Srinavas emphasized that having a deterministic approach is critical to parallelizing any PnR engine.
Pyxis is a newer kid on the block, with a specific focus on improving the routing in PnR. As described by Mr. Bishop, they also use a geometric partition that can be created automatically by the tool or by the user. A challenge of area-based partitioning is the boundary where partitions meet: you have to make sure that, at the very least, routes crossing the boundary join up properly. Pyxis tries to partition intelligently to minimize boundary crossings, and they run all partitions concurrently. Their approach to solving the challenges of boundary crossings involves a patented technique for partition overlap so that the regions where the partitions meet can more effectively be routed in a repeatable manner. They tout their invariability regardless of core count as a key differentiator.
Synopsys actually partitions by individual nets, using shared memory for communication between threads. According to Synopsys’ Steve Smith, they apply their own proprietary techniques for resource allocation to avoid conflicts when multiple threads try to snag the same routing resource. However, the question of ordering remains. An attempt to answer why ordering isn’t a problem required digging a bit deeper into their technology than he was comfortable doing publicly, so he couldn’t provide an explanation of how that problem is solved. He does aver, however, that the results of PnR for the Zroute tool do not vary with the number of computing cores.
So, having seen all of this, there still appears to be something of a disconnect between the anecdotal experiences of some users as reported by some providers with respect to their competitors’ tools. Is this a case of a false issue being raised? Of false assurances being made about tools? Perhaps a legacy of earlier less-repeatable versions? User error? Hard to tell; I’ll assume that no one is out-and-out lying; I’m optimistic like that. What remains is the fact that variability of results when changing the number of cores is obviously undesirable and has risen onto the radar of both vendors and some users. If your computing environment is such that you might run afoul of any such variations, you should talk hard with your prospective vendor to make sure you are comfortable that their tools will give you the stable results you need.