feature article
Subscribe Now

Working Towards Independence

CriticalBlue Tries to Assist Parallelization

Everyone’s trying to make their code run faster. Since we can’t count on processors simply doing the work for us by running faster, we have to resort to other means. For code that’s going to run on SoCs, there are a couple primary choices: add more cores and have them run in parallel or have some of the code execute in dedicated hardware.

The whole parallel thing has been bandied about for years now, and no one has found a way around the fact that it’s really hard to take a serial programming paradigm and create a parallel implementation. Meanwhile, there are a number of solutions for creating hardware accelerators, although a quick sample of users suggests that most such tools yield less than overwhelming results.

But the whole parallel concept relies on a critical concept: being able to do multiple things at once. And you can do that only if those things are independent of each other. Creating a hardware accelerator can certainly speed up whatever code it executes, but if something else is awaiting the result, then you can’t run the accelerator in parallel with the cores running software. Instead, the core sends a task to the accelerator, sits around and waits, and, when the accelerator is done, continues on. It may be faster than running the accelerated code in the core, but whenever a core is sitting around idle, it feels wasteful.

If execution of code has to stop while something else happens, execution is said to be blocked. An accelerator that works this way is referred to as a blocking accelerator because further core execution is blocked until the accelerator is done. And CriticalBlue was hearing from their customers that, too often, tools generate blocking accelerators, which wastes core cycles.

Assuming an acceleration tool is reasonably competent, there’s only one main obvious reason why an accelerator would be blocking: because the code that follows needs to use the result of the accelerator. In other words, there are dependencies between the accelerator and the subsequent code. The only way to make the accelerator non-blocking is to remove the dependencies; then the core can send the accelerator off to perform its task and continue on with other work.

Dependencies can be more subtle, however. There may not be an obvious execution reason why one operation has to wait for the other, but there may be a data access reason: potentially parallel operations that all need the same data have to access it one at a time. The order in which they get to the data may not even matter; the fact remains that they have to wait in line, meaning they are blocked while waiting.

And of course, this whole concept applies to any attempt to parallelize code – whether entirely in software by invoking multiple cores or when using accelerators. In this respect, an accelerator is simply another dedicated core performing a pre-defined set of code. So each parallel entity is subject to dependencies, either execution or data access, and eliminating those dependencies increases the opportunities to parallelize.

To date, the actual removal of dependencies is not a job that can be performed automatically. In fact, CriticalBlue’s David Stewart says that customers don’t want an automatic fix; they just want help knowing where the dependencies are so that they can fix them manually. To help out with this, CriticalBlue created their Prism tool: a workbench for playing with dependencies to see what could happen if they were removed.

The tool works by running a test set and creating a dynamic profile of execution. By monitoring activity, the tool can identify and list dependencies. The more complete the test set, the more dependencies can be found. In fact, they’ve had some data sets that are large enough that the simulation takes a couple days to run.

The idea is to present such dependencies so that you can figure out what would happen if the dependency were eliminated. You can then run what-if scenarios: you target some specific number of cores and then play with different parallelization options. You can move functions between threads to see what the effect would be. You can see which dependencies block which activity and then toggle them on and off to figure out what the implications are for making more of the program parallel.

Once you play with the alternatives, you can identify those dependencies that would be most effective in eliminating instances of blocking and work to remove them. If, for example, a loop is unrolled and parallelized with a single data structure being used for all instances, then creating multiple instances of the structure (assuming the instances are all mutually independent) allows for each loop to operate on its own copy so that there is no fighting to access one shared copy (although it should be noted that there may still be contention for access to memory, depending on the memory architecture).

The tool doesn’t specifically say how the dependencies should be eliminated; it tells you only what the effect of eliminating them would be if you were to do it. You then figure out how to implement the changes and recode your program the old-fashioned way. You also have to be sure to protect data structures and critical sections of code against any race conditions that might crop up once multiple chunks or instances of code are running at the same time.

Of course, there’s no guarantee that any given dependency can be removed. If A necessarily depends on B as a fundamental characteristic of an algorithm, then it won’t do any good to obsess on that dependency: move on to ones that can be removed. There may well be plenty of other areas to explore.

It’s also important to keep in mind that the number of dependencies identified will depend on the coverage of the test set. There may be dependencies that are never uncovered. In that case, no change would be proposed for them, which is a pessimistic approach. However, the effect of eliminating a dependency assumes that all instances of that dependency have been identified – if there is a missing case that the data set didn’t catch, then you would need to be careful to ensure that correct operation wasn’t compromised once that missing test case actually occurred in a deployed system.

CriticalBlue recommends approaching this problem in an incremental fashion. Start with whatever code you have – serial or partially parallelized, look at the main blocking problems, and eliminate them one at a time. Re-profiling to characterize the program after the change helps to ensure that you haven’t inadvertently introduced any new problems – race conditions, for example – or that you haven’t actually changed how the program works. Moving little by little helps to ensure that you don’t completely re-engineer things in one fell swoop, only to find that multiple changes introduced multiple problems, and now you have to unscramble the changes to figure out which change caused which problem.

Link: CriticalBlue Prism

Leave a Reply

featured blogs
Aug 19, 2018
Consumer demand for advanced driver assistance and infotainment features are on the rise, opening up a new market for advanced Automotive systems. Automotive Ethernet allows to support more complex computing needs with the use of an Ethernet-based network for connections betw...
Aug 18, 2018
Once upon a time, the Santa Clara Valley was called the Valley of Heart'€™s Delight; the main industry was growing prunes; and there were orchards filled with apricot and cherry trees all over the place. Then in 1955, a future Nobel Prize winner named William Shockley moved...
Aug 17, 2018
Samtec’s growing portfolio of high-performance Silicon-to-Silicon'„¢ Applications Solutions answer the design challenges of routing 56 Gbps signals through a system. However, finding the ideal solution in a single-click probably is an obstacle. Samtec last updated the...
Aug 16, 2018
All of the little details were squared up when the check-plots came out for "final" review. Those same preliminary files were shared with the fab and assembly units and, of course, the vendors have c...
Jul 30, 2018
As discussed in part 1 of this blog post, each instance of an Achronix Speedcore eFPGA in your ASIC or SoC design must be configured after the system powers up because Speedcore eFPGAs employ nonvolatile SRAM technology to store its configuration bits. The time required to pr...