feature article
Subscribe Now

Working Towards Independence

CriticalBlue Tries to Assist Parallelization

Everyone’s trying to make their code run faster. Since we can’t count on processors simply doing the work for us by running faster, we have to resort to other means. For code that’s going to run on SoCs, there are a couple primary choices: add more cores and have them run in parallel or have some of the code execute in dedicated hardware.

The whole parallel thing has been bandied about for years now, and no one has found a way around the fact that it’s really hard to take a serial programming paradigm and create a parallel implementation. Meanwhile, there are a number of solutions for creating hardware accelerators, although a quick sample of users suggests that most such tools yield less than overwhelming results.

But the whole parallel concept relies on a critical concept: being able to do multiple things at once. And you can do that only if those things are independent of each other. Creating a hardware accelerator can certainly speed up whatever code it executes, but if something else is awaiting the result, then you can’t run the accelerator in parallel with the cores running software. Instead, the core sends a task to the accelerator, sits around and waits, and, when the accelerator is done, continues on. It may be faster than running the accelerated code in the core, but whenever a core is sitting around idle, it feels wasteful.

If execution of code has to stop while something else happens, execution is said to be blocked. An accelerator that works this way is referred to as a blocking accelerator because further core execution is blocked until the accelerator is done. And CriticalBlue was hearing from their customers that, too often, tools generate blocking accelerators, which wastes core cycles.

Assuming an acceleration tool is reasonably competent, there’s only one main obvious reason why an accelerator would be blocking: because the code that follows needs to use the result of the accelerator. In other words, there are dependencies between the accelerator and the subsequent code. The only way to make the accelerator non-blocking is to remove the dependencies; then the core can send the accelerator off to perform its task and continue on with other work.

Dependencies can be more subtle, however. There may not be an obvious execution reason why one operation has to wait for the other, but there may be a data access reason: potentially parallel operations that all need the same data have to access it one at a time. The order in which they get to the data may not even matter; the fact remains that they have to wait in line, meaning they are blocked while waiting.

And of course, this whole concept applies to any attempt to parallelize code – whether entirely in software by invoking multiple cores or when using accelerators. In this respect, an accelerator is simply another dedicated core performing a pre-defined set of code. So each parallel entity is subject to dependencies, either execution or data access, and eliminating those dependencies increases the opportunities to parallelize.

To date, the actual removal of dependencies is not a job that can be performed automatically. In fact, CriticalBlue’s David Stewart says that customers don’t want an automatic fix; they just want help knowing where the dependencies are so that they can fix them manually. To help out with this, CriticalBlue created their Prism tool: a workbench for playing with dependencies to see what could happen if they were removed.

The tool works by running a test set and creating a dynamic profile of execution. By monitoring activity, the tool can identify and list dependencies. The more complete the test set, the more dependencies can be found. In fact, they’ve had some data sets that are large enough that the simulation takes a couple days to run.

The idea is to present such dependencies so that you can figure out what would happen if the dependency were eliminated. You can then run what-if scenarios: you target some specific number of cores and then play with different parallelization options. You can move functions between threads to see what the effect would be. You can see which dependencies block which activity and then toggle them on and off to figure out what the implications are for making more of the program parallel.

Once you play with the alternatives, you can identify those dependencies that would be most effective in eliminating instances of blocking and work to remove them. If, for example, a loop is unrolled and parallelized with a single data structure being used for all instances, then creating multiple instances of the structure (assuming the instances are all mutually independent) allows for each loop to operate on its own copy so that there is no fighting to access one shared copy (although it should be noted that there may still be contention for access to memory, depending on the memory architecture).

The tool doesn’t specifically say how the dependencies should be eliminated; it tells you only what the effect of eliminating them would be if you were to do it. You then figure out how to implement the changes and recode your program the old-fashioned way. You also have to be sure to protect data structures and critical sections of code against any race conditions that might crop up once multiple chunks or instances of code are running at the same time.

Of course, there’s no guarantee that any given dependency can be removed. If A necessarily depends on B as a fundamental characteristic of an algorithm, then it won’t do any good to obsess on that dependency: move on to ones that can be removed. There may well be plenty of other areas to explore.

It’s also important to keep in mind that the number of dependencies identified will depend on the coverage of the test set. There may be dependencies that are never uncovered. In that case, no change would be proposed for them, which is a pessimistic approach. However, the effect of eliminating a dependency assumes that all instances of that dependency have been identified – if there is a missing case that the data set didn’t catch, then you would need to be careful to ensure that correct operation wasn’t compromised once that missing test case actually occurred in a deployed system.

CriticalBlue recommends approaching this problem in an incremental fashion. Start with whatever code you have – serial or partially parallelized, look at the main blocking problems, and eliminate them one at a time. Re-profiling to characterize the program after the change helps to ensure that you haven’t inadvertently introduced any new problems – race conditions, for example – or that you haven’t actually changed how the program works. Moving little by little helps to ensure that you don’t completely re-engineer things in one fell swoop, only to find that multiple changes introduced multiple problems, and now you have to unscramble the changes to figure out which change caused which problem.

Link: CriticalBlue Prism

Leave a Reply

featured blogs
May 18, 2021
'Virtuoso Meets Maxwell' is a blog series aimed at exploring the capabilities and potential of Virtuoso RF Solution and Virtuoso MultiTech. So, how does Virtuoso meets Maxwell? Now, the... [[ Click on the title to access the full blog on the Cadence Community site....
May 13, 2021
Samtec will attend the PCI-SIG Virtual Developers Conference on Tuesday, May 25th through Wednesday, May 26th, 2021. This is a free event for the 800+ member companies that develop and bring to market new products utilizing PCI Express technology. Attendee Registration is sti...
May 13, 2021
Our new IC design tool, PrimeSim Continuum, enables the next generation of hyper-convergent IC designs. Learn more from eeNews, Electronic Design & EE Times. The post Synopsys Makes Headlines with PrimeSim Continuum, an Innovative Circuit Simulation Solution appeared fi...
May 13, 2021
By Calibre Design Staff Prior to the availability of extreme ultraviolet (EUV) lithography, multi-patterning provided… The post A SAMPle of what you need to know about SAMP technology appeared first on Design with Calibre....

featured video

Introduction to EMI

Sponsored by Texas Instruments

Conducted versus radiated EMI. CISPR-25 and CISPR-32 standards. High-frequency or low-frequency emissions. Designing a system to reduce EMI can be overwhelming, but it doesn’t have to be. Watch this video to get an overview of EMI causes, standards, and mitigation techniques.

Click here for more information

featured paper

How to solve two screenless TV design challenges

Sponsored by Texas Instruments

The new 4K display chipsets from DLP Products help make screenless TV setup easier and save cost by reducing the number of components required while also adding more advanced image-processing capabilities. The DLP471TP DMD and DLPC6540 controller for small designs and the DLP471TE DMD and DLPC7540 controller for designs above 1,500 lumens help deliver stunning ultra-high resolution displays to the market and take advantage of the rapid expansion in the availability of 4K content.

Click here to read

Featured Chalk Talk

Keeping Your Linux Device Secure

Sponsored by Siemens Digital Industries Software

Embedded security is an ongoing process, not a one-time effort. Even after your design is shipped, security vulnerabilities are certain to be discovered - even in things like the operating system. In this episode of Chalk Talk, Amelia Dalton chats with Kathy Tufto from Mentor - a Siemens business, about how to make a plan to keep your Linux-based embedded design secure, and how to respond quickly when new vulnerabilities are discovered.

More information about Mentor Embedded Linux®