feature article

## Reservations Required

Mentor Parallelizes Analysis and Optimization

Man at the head of the line: Table for three please?

Maitre D’: Bien sur. Ees your party all ‘ere?

Man: Yup.

Maitre D’: Alors, we are full now, so I put your name on ze leest. I cannot say whahn your name weell be call; it dehpond on the size of your partee, available tehbell, ‘ow many tehbell are full at each stehshahn, etceterah. I call your name whahn ready; I guarontee you get in tonight. [patronizing smile]

[Party goes off to the side to wait.]

Woman now at the head of the line: Three for dinner please.

Maitre D’: Tres bien. Ees your party all ‘ere?

Woman: Almost; we’re only missing one.

Maitre D’: Bien, alors, I cannot put you on ze leest onteel your partee ees all ‘ere. I don’t want to call your name and find your partee ees not yet complete. So, as soon as ze last person est arrivé, you tell me and I put you on ze leest.

Woman: Hmmm… well, it’s not that easy. You see, that party of three you just put on the list? Well, the odd person out is the wife; her husband is at home now with their child, and he’s going to eat with us. He can’t come until she’s done and gets back home to take over babysitting. We figured that since she’s ahead of us, that this should all work out.

Maitre D’: Oh la la… Ees not so easy. I cannot say whahn zeir name will come up, and eef you go on ze leest, your name may come up before zeirs. But you dehpond on zem to feeneesh ze meal first, and I cannot ‘old your tehbell emptee eef I call you before zey’re done. And zey might be ze last ones I call tonight, which mean you have a big problème.

So… I tell you what I do. I start a leest for tomorrow and put you on eet. I don’t know whahn ze ozers feeneesh today, but I know zey feeneesh today, so ees safe to put you on ze leest for tomorrow. Go ‘ome today, and I guarontee you a tehbell tomorrow.

Bien, à demain…

[Party moves out dejectedly]

Next person in line: Do you have a table for three or four?

Maitre D’: Mais oui. Ah, but wait… sree or four?

Person: Well, there’s one person that may or may not show.

Maitre D’: [rolls eyes] I cannot put you on ze leest onteel everyone ees ‘ere. Eef ze person weell come, zen you must wait. Eef ze person weell not come, zen you can go on ze leest.

Person: Oh man, OK, well, that may take a while just to get on the list then. You know that group right before us? Well one of the guys in that group knows the woman who might join us. She’s really really picky about where she eats, and when she heard he was going, she wanted to hear back from him whether it was worth coming or not. So we won’t know if she’s coming until he’s finished and reports back.

Maitre D’: Sacre bleu, zat beecom even more complicate. You see, zey are on ze leest for tomorrow because zey cannot eat onteel someone ‘ere tonight eat. And you cannot eat onteel zey eat. So… I start a leest for ze day after tomorrow. You weell be on zat leest, and weel be guarontee a tehbell because zat group ees guarontee to eat tomorrow.

Making EDA tools faster is always a good thing, and a popular trend right now is using parallelization to take advantage of multicore computers. We’ve talked about it before and we’ll talk about it in the future. Lots going on here.

Some things are easier to parallelize than others; place and route has received a lot of attention, but only at the gross level. The fine detailed stuff and optimizations have remained elusive because more of the old code has to be ripped up and replaced with new algorithms.

The efficiency of parallel code depends heavily on how effectively the processing and/or the data are broken up. Two key artifacts of parallel programming play a large role in determining how well parallel code runs: blocking and locking. (Kind of like blocking and tackling in the multicore world.)

Two different processes may run in different threads on different cores, but if one depends on the results of the other, then it may have to wait – it’s blocked – until the first one finishes, or at least reaches a point where the necessary result is available. When you have lots of dependencies, someone’s got to manage this and make sure that a process is blocked until all necessary dependencies are met.

This may be provided by the programming language. In “fork/join” constructs, several threads may be created by the “fork” portion of the construct, but the “join” is where blocking is specified: a “join any” will proceed when any of the threads completes; a “join all” will block until all forked threads complete. If you’re using a language that doesn’t have such management built in, then you have to create it yourself.

Obviously, sitting around waiting for threads to complete isn’t a good use of time. An operating system may be able to swap out a thread that’s blocked, but swapping threads itself takes time, and there’s no deterministic way to guarantee that any thread will be put back into play as soon as its dependencies are resolved; you may end up thrashing threads while looking for a thread that can actually do something.

The other concern is locking, and this may refer to shared data or critical sections of code that access shared data. Reading of shared data typically isn’t an issue (unless there’s a multi-step transaction with nonsensical values in the middle of the transaction that shouldn’t be read), but writing can be a big problem if multiple writers are sparring.

The typical approach here is to take out locks of one sort or another; if one thread locks a resource or a critical section of code, then no other thread can use it until the lock is released. This can get complicated if there are numerous locks: you can end up in deadlock situations where no one can proceed; at the very least, someone ends up sitting around waiting for access to the data. Again, not a good use of time.

Eschewing blocking and locking

So it would be really nice to be able to create an algorithm that’s what I’ll call “blockless and lockless”, meaning that, by the very nature of how the algorithm is built, no time is wasted in a blocked state or waiting for a lock, and no time is wasted managing blocked threads and locks. Mentor proposes to have created just such an algorithm in what appears to be the first full parallelization of static timing analysis (STA), power analysis (PA), signal integrity (SI), RC extraction (RCE), and optimization (O). That is to say, not only do these all run in parallel, but each of these has itself been parallelized.

Most of the interesting nature of this relates to the blockless characteristic of the algorithm. But to understand it, it helps to know the basis of thread creation. There are two levels of parallelization: what they call fine-grained and ultra-fine-grained. If you consider a mesh of nodes that constitute a circuit, each node of which is a point that must be calculated, then this provides the first level of parallelization. Each node can become a thread, and for each node the five operations (STA, PA, SI, RCE, and O) can each become an individual thread.

Then for each of these, multi-corner multi-mode (MCMM) rears its ugly head. This has been the cause of much of the increased processing time in recent years. Large SoCs have various blocks that may or may not be exercised at any given time; each different configuration (sleeping, GSM mode, CDMA mode, etc.) is a mode that must be run through a complete validation. The corners have multiplied as different blocks operate at different voltages, and as those voltages change independently of each other, such that all combinations must be considered.

Throw in things like Dynamic Voltage/Frequency Scaling (DVFS) or Adaptive Voltage Scaling (AVS) and you have blocks of circuitry with a supply voltage whose value may be along a near continuum depending on operating conditions, all of which must be considered in the analysis, and you’ve got lots of computing to do. You start multiplying it out and it isn’t hard to get over 40 different analyses required for each operation at each node on a not-particularly-exotic chip. And that number will be going up. The ultra-fine-grain parallelization assigns each of those corners its own thread so that they can run in parallel.

So clearly this algorithm can spawn lots of threads, all of which can operate independently of the others. Almost. The key thing to remember is that simply having more threads doesn’t help anything by itself. Some threads will depend on other threads. If the threads are just thrown willy-nilly at cores, then inevitably some thread will end up waiting for some other thread to finish. You’ll see the inefficiency by looking at core utilization: if all the cores aren’t maxed out while the algorithm is running, then you’re wasting something.

A chunky approach

Mentor put together an algorithm that ensures that no thread will be submitted for execution until all dependencies are resolved. They do this by assigning threads to what they call “chunks” – essentially, the lists being made by the maitre d’. This is done with knowledge of the underlying work being done – it’s not a generic algorithm that can be simply used anywhere independent of the computing function, although presumably the concept can be extended to other algorithms. (In full mindfulness, of course, of that patent thingy that sometimes interferes…)

In the context of a path with numerous nets along it, each of which needs to be calculated, you start at one end and move along the path. The first net doesn’t depend on anything because it’s first; the second net depends on the first for some things (like time). So when the threads are spawned, the thread for the first net – and any other net that depends on nothing – is put in the first “chunk.” The thread for the second net – and any other thread that depends on a thread in the first chunk – is placed in the second chunk. Any thread that depends on a thread in the second chunk is put in a third chunk. And so forth.

These chunks are then submitted for execution in order. In this manner, even though the tasks in the first chunk may execute in an unpredictable order, they have no mutual dependencies, so that’s not an issue. None of the tasks in the second chunk will be submitted for execution until all tasks in the first chunk complete. By definition, this guarantees that no thread will be hanging around waiting for something else to finish. All cores should be operating full out. Operation is blockless.

This doesn’t do anything for locking, of course. But with knowledge of how shared data is accessed, they’ve resolved this simply by ensuring that, while threads are executing in parallel, shared data is only read. Writing happens only at known synchronization points. Operation is lockless.

The results they are reporting from this are such that running on eight cores provides up to a seven-fold reduction in timing analysis run time and up to an overall four-fold decrease in design closure time. The difference between those gains, of course, is due to the fact that not everything in the closure flow is parallelized (especially the engineer’s decisions on how to tweak things), and Amdahl’s Law says that the overall gain will be limited by any remaining serial code (and, presumably, serial engineers). Getting a 7X improvement with 8 cores is actually pretty good, suggesting that not only are the cores being highly utilized, but also that the overhead for the chunking of threads is nominal.

So while some of our patrons may not be happy having to wait until another day to dine, at least they aren’t simply waiting around for two or three days. They can use much of that time to go get something else done. Like, maybe, finding a new babysitter, or perhaps finding a less picky friend.

featured blogs
Aug 2, 2024
In which an innocent quest in search of an electronic video leads to an unexpected plunge into a musical rabbit hole....