feature article
Subscribe Now

Reservations Required

Mentor Parallelizes Analysis and Optimization

Man at the head of the line: Table for three please?

Maitre D’: Bien sur. Ees your party all ‘ere?

Man: Yup.

Maitre D’: Alors, we are full now, so I put your name on ze leest. I cannot say whahn your name weell be call; it dehpond on the size of your partee, available tehbell, ‘ow many tehbell are full at each stehshahn, etceterah. I call your name whahn ready; I guarontee you get in tonight. [patronizing smile]

 [Party goes off to the side to wait.]

Woman now at the head of the line: Three for dinner please.

Maitre D’: Tres bien. Ees your party all ‘ere?

Woman: Almost; we’re only missing one.

Maitre D’: Bien, alors, I cannot put you on ze leest onteel your partee ees all ‘ere. I don’t want to call your name and find your partee ees not yet complete. So, as soon as ze last person est arrivé, you tell me and I put you on ze leest.

Woman: Hmmm… well, it’s not that easy. You see, that party of three you just put on the list? Well, the odd person out is the wife; her husband is at home now with their child, and he’s going to eat with us. He can’t come until she’s done and gets back home to take over babysitting. We figured that since she’s ahead of us, that this should all work out.

Maitre D’: Oh la la… Ees not so easy. I cannot say whahn zeir name will come up, and eef you go on ze leest, your name may come up before zeirs. But you dehpond on zem to feeneesh ze meal first, and I cannot ‘old your tehbell emptee eef I call you before zey’re done. And zey might be ze last ones I call tonight, which mean you have a big problème.

So… I tell you what I do. I start a leest for tomorrow and put you on eet. I don’t know whahn ze ozers feeneesh today, but I know zey feeneesh today, so ees safe to put you on ze leest for tomorrow. Go ‘ome today, and I guarontee you a tehbell tomorrow.

Bien, à demain…

[Party moves out dejectedly]

Next person in line: Do you have a table for three or four?

Maitre D’: Mais oui. Ah, but wait… sree or four?

Person: Well, there’s one person that may or may not show.

Maitre D’: [rolls eyes] I cannot put you on ze leest onteel everyone ees ‘ere. Eef ze person weell come, zen you must wait. Eef ze person weell not come, zen you can go on ze leest.

Person: Oh man, OK, well, that may take a while just to get on the list then. You know that group right before us? Well one of the guys in that group knows the woman who might join us. She’s really really picky about where she eats, and when she heard he was going, she wanted to hear back from him whether it was worth coming or not. So we won’t know if she’s coming until he’s finished and reports back.

Maitre D’: Sacre bleu, zat beecom even more complicate. You see, zey are on ze leest for tomorrow because zey cannot eat onteel someone ‘ere tonight eat. And you cannot eat onteel zey eat. So… I start a leest for ze day after tomorrow. You weell be on zat leest, and weel be guarontee a tehbell because zat group ees guarontee to eat tomorrow.

Making EDA tools faster is always a good thing, and a popular trend right now is using parallelization to take advantage of multicore computers. We’ve talked about it before and we’ll talk about it in the future. Lots going on here.

Some things are easier to parallelize than others; place and route has received a lot of attention, but only at the gross level. The fine detailed stuff and optimizations have remained elusive because more of the old code has to be ripped up and replaced with new algorithms.

The efficiency of parallel code depends heavily on how effectively the processing and/or the data are broken up. Two key artifacts of parallel programming play a large role in determining how well parallel code runs: blocking and locking. (Kind of like blocking and tackling in the multicore world.)

Two different processes may run in different threads on different cores, but if one depends on the results of the other, then it may have to wait – it’s blocked – until the first one finishes, or at least reaches a point where the necessary result is available. When you have lots of dependencies, someone’s got to manage this and make sure that a process is blocked until all necessary dependencies are met.

This may be provided by the programming language. In “fork/join” constructs, several threads may be created by the “fork” portion of the construct, but the “join” is where blocking is specified: a “join any” will proceed when any of the threads completes; a “join all” will block until all forked threads complete. If you’re using a language that doesn’t have such management built in, then you have to create it yourself.

Obviously, sitting around waiting for threads to complete isn’t a good use of time. An operating system may be able to swap out a thread that’s blocked, but swapping threads itself takes time, and there’s no deterministic way to guarantee that any thread will be put back into play as soon as its dependencies are resolved; you may end up thrashing threads while looking for a thread that can actually do something.

The other concern is locking, and this may refer to shared data or critical sections of code that access shared data. Reading of shared data typically isn’t an issue (unless there’s a multi-step transaction with nonsensical values in the middle of the transaction that shouldn’t be read), but writing can be a big problem if multiple writers are sparring.

The typical approach here is to take out locks of one sort or another; if one thread locks a resource or a critical section of code, then no other thread can use it until the lock is released. This can get complicated if there are numerous locks: you can end up in deadlock situations where no one can proceed; at the very least, someone ends up sitting around waiting for access to the data. Again, not a good use of time.

Eschewing blocking and locking

So it would be really nice to be able to create an algorithm that’s what I’ll call “blockless and lockless”, meaning that, by the very nature of how the algorithm is built, no time is wasted in a blocked state or waiting for a lock, and no time is wasted managing blocked threads and locks. Mentor proposes to have created just such an algorithm in what appears to be the first full parallelization of static timing analysis (STA), power analysis (PA), signal integrity (SI), RC extraction (RCE), and optimization (O). That is to say, not only do these all run in parallel, but each of these has itself been parallelized.

Most of the interesting nature of this relates to the blockless characteristic of the algorithm. But to understand it, it helps to know the basis of thread creation. There are two levels of parallelization: what they call fine-grained and ultra-fine-grained. If you consider a mesh of nodes that constitute a circuit, each node of which is a point that must be calculated, then this provides the first level of parallelization. Each node can become a thread, and for each node the five operations (STA, PA, SI, RCE, and O) can each become an individual thread.

Then for each of these, multi-corner multi-mode (MCMM) rears its ugly head. This has been the cause of much of the increased processing time in recent years. Large SoCs have various blocks that may or may not be exercised at any given time; each different configuration (sleeping, GSM mode, CDMA mode, etc.) is a mode that must be run through a complete validation. The corners have multiplied as different blocks operate at different voltages, and as those voltages change independently of each other, such that all combinations must be considered.

Throw in things like Dynamic Voltage/Frequency Scaling (DVFS) or Adaptive Voltage Scaling (AVS) and you have blocks of circuitry with a supply voltage whose value may be along a near continuum depending on operating conditions, all of which must be considered in the analysis, and you’ve got lots of computing to do. You start multiplying it out and it isn’t hard to get over 40 different analyses required for each operation at each node on a not-particularly-exotic chip. And that number will be going up. The ultra-fine-grain parallelization assigns each of those corners its own thread so that they can run in parallel.

So clearly this algorithm can spawn lots of threads, all of which can operate independently of the others. Almost. The key thing to remember is that simply having more threads doesn’t help anything by itself. Some threads will depend on other threads. If the threads are just thrown willy-nilly at cores, then inevitably some thread will end up waiting for some other thread to finish. You’ll see the inefficiency by looking at core utilization: if all the cores aren’t maxed out while the algorithm is running, then you’re wasting something.

A chunky approach

Mentor put together an algorithm that ensures that no thread will be submitted for execution until all dependencies are resolved. They do this by assigning threads to what they call “chunks” – essentially, the lists being made by the maitre d’. This is done with knowledge of the underlying work being done – it’s not a generic algorithm that can be simply used anywhere independent of the computing function, although presumably the concept can be extended to other algorithms. (In full mindfulness, of course, of that patent thingy that sometimes interferes…)

In the context of a path with numerous nets along it, each of which needs to be calculated, you start at one end and move along the path. The first net doesn’t depend on anything because it’s first; the second net depends on the first for some things (like time). So when the threads are spawned, the thread for the first net – and any other net that depends on nothing – is put in the first “chunk.” The thread for the second net – and any other thread that depends on a thread in the first chunk – is placed in the second chunk. Any thread that depends on a thread in the second chunk is put in a third chunk. And so forth.

These chunks are then submitted for execution in order. In this manner, even though the tasks in the first chunk may execute in an unpredictable order, they have no mutual dependencies, so that’s not an issue. None of the tasks in the second chunk will be submitted for execution until all tasks in the first chunk complete. By definition, this guarantees that no thread will be hanging around waiting for something else to finish. All cores should be operating full out. Operation is blockless.

This doesn’t do anything for locking, of course. But with knowledge of how shared data is accessed, they’ve resolved this simply by ensuring that, while threads are executing in parallel, shared data is only read. Writing happens only at known synchronization points. Operation is lockless.

The results they are reporting from this are such that running on eight cores provides up to a seven-fold reduction in timing analysis run time and up to an overall four-fold decrease in design closure time. The difference between those gains, of course, is due to the fact that not everything in the closure flow is parallelized (especially the engineer’s decisions on how to tweak things), and Amdahl’s Law says that the overall gain will be limited by any remaining serial code (and, presumably, serial engineers). Getting a 7X improvement with 8 cores is actually pretty good, suggesting that not only are the cores being highly utilized, but also that the overhead for the chunking of threads is nominal.

So while some of our patrons may not be happy having to wait until another day to dine, at least they aren’t simply waiting around for two or three days. They can use much of that time to go get something else done. Like, maybe, finding a new babysitter, or perhaps finding a less picky friend.

Link: Olympus-SoC

Leave a Reply

featured blogs
May 8, 2024
Learn how artificial intelligence of things (AIoT) applications at the edge rely on TSMC's N12e manufacturing processes and specialized semiconductor IP.The post How Synopsys IP and TSMC’s N12e Process are Driving AIoT appeared first on Chip Design....
May 2, 2024
I'm envisioning what one of these pieces would look like on the wall of my office. It would look awesome!...

featured video

Why Wiwynn Energy-Optimized Data Center IT Solutions Use Cadence Optimality Explorer

Sponsored by Cadence Design Systems

In the AI era, as the signal-data rate increases, the signal integrity challenges in server designs also increase. Wiwynn provides hyperscale data centers with innovative cloud IT infrastructure, bringing the best total cost of ownership (TCO), energy, and energy-itemized IT solutions from the cloud to the edge.

Learn more about how Wiwynn is developing a new methodology for PCB designs with Cadence’s Optimality Intelligent System Explorer and Clarity 3D Solver.

featured paper

Altera® FPGAs and SoCs with FPGA AI Suite and OpenVINO™ Toolkit Drive Embedded/Edge AI/Machine Learning Applications

Sponsored by Intel

Describes the emerging use cases of FPGA-based AI inference in edge and custom AI applications, and software and hardware solutions for edge FPGA AI.

Click here to read more

featured chalk talk

Shift Left with Calibre
In this episode of Chalk Talk, Amelia Dalton and David Abercrombie from Siemens investigate the details of Calibre’s shift-left strategy. They take a closer look at how the tools and techniques in this design tool suite can help reduce signoff iterations and time to tapeout while also increasing design quality.
Nov 27, 2023