Tackling the Hard Part

As multicore works its way into the embedded world, there are lots of things that used to be simple that become more complex, both in hardware and in software. But one element – often acknowledged by people knowledgeable in this space as “the hard part of multicore” – has remained mostly untouched: how to take a program and partition it over multiple cores.

Basically, it’s hard to do wrong. And it’s much harder to do right. Particularly when doing it manually, which was the only option for a long time. Automating elements of the process helps, and we looked at some movement in that direction in a previous article.

One of the companies working in this space, Vector Fabrics, just made their latest tool announcement, and it reflects a number of key changes:

They’ve changed their naming.
The technology itself has evolved.
They’ve left the cloud.

Vector Fabrics has gone through a few iterations of their tool as their capabilities unfolded. Originally, their vfAnalyst tool showed dependencies in the code in a way that had never been done before. They went through a couple more iterations up to their last release, vfEmbedded, which included partitioning and mapping options in a more rudimentary way.

Pareon fills in a lot of the holes, taking the tool from one that handled the most frequent scenarios to one that is more airtight. And, as you can see, they’ve changed their branding structure; they found that the underlying “ontology” that drove the previous scheme became a bit forced, so they’ve made a clean break.

While the key technology relates to identification of dependencies, which are visualized in a unique way, this has turned out to be secondary from a user methodology standpoint. Originally, the dependencies were a way for a user to figure out where to partition most effectively. Once they had that in place, they realized that they could propose different partitioning strategies based on those dependencies, and, if the user liked that, then he or she didn’t explicitly need to check the dependencies at all.

So what happens is that Pareon executes a program and presents a display showing how long various parts of the program take to run; similar in concept (although presented different graphically) to profiling. The user then goes in to partition the lengthy bits – more on that shortly.

The reason for executing the program is not just to profile speed, but also to detect certain dependencies. Some dependencies (which they call “compute” dependencies) involve only register values, and programs can be inspected statically to determine those. But others – typically related to pointer arithmetic and pointer aliasing – can’t be handled statically. Two pointers may end up pointing to the same memory location, and you will know that only by actually running the program and examining the memory address. So they instrument your code and then track the loads and stores. If one variable writes to a location, and then, sometime later, some other variable (or even the same one) reads from that location, a dependency is noted.

These dependencies can end up being pretty complex. So-called “loop-carried” dependencies, where values are produced in one iteration of a loop that are consumed in another (like a[i] = a[i-1] + a[i-2]) can get tricky to keep track of (they’re mostly mind-numbingly annoying if you try to figure them out manually).

In addition, there are special patterns that the tool can notice and handle. For example:

From the beginning, the tool has been able to recognize “streaming” patterns, where accesses progress through some regular address sequence on a repeated basis.
There are more complex versions of this that they referred to as “windowed,” where it’s not just a straight march through an address sequence, but a series of accesses within a window of addresses that moves in a regular way. An example of this would be a graphics algorithm that works on a pixel by examining pixels in a 3×3 window around it. Accesses within that window would make things not look like a simple walk through the array, but as the operation moves from pixel to pixel, the window moves along with it. This pattern has also been detectable for a while.
The induction expression in a loop – the code that increments i – by definition creates a loop-carried dependency: the value of i will be set to the previous value of i incremented by 1 (or some other increment). So you can’t figure out the n^th iteration until you’ve figured out the n-1^th iteration, which gets in the way of parallelizing things. And, really, the i isn’t part of the calculation – it’s just a mechanism for looping, and here it is gumming things up. The tool can identify such induction expressions and handle them differently to eliminate the dependency.
Reduction involves operations like summing the contents of an array. Again, the “sum = sum + a[i]” aspect of the calculation is serial in nature. It’s possible to restructure things to allow multiple sub-sums to be calculated in parallel, summing the sub-sums into a grand total at the end. Such scenarios can also be detected and handled by Pareon.

In the past, there have been various other patterns whose structure could not be discerned; at this point, the only dependencies they may have a hard time with are those resulting from certain complex instances of recursion.

Why does it matter what dependencies they can detect and handle? Because that’s what determines where the tool thinks you can partition your code. You may select a function that chews up a lot of execution time and decide to try to partition it so that it runs faster. Mechanically, you right-click and choose “Partition,” and Pareon will present a list of different partitioning options and their costs. It will choose only partitions that either break no dependencies or break dependencies that they know can be synchronized in the partitioned program. The more dependencies they can handle, the more partitioning options can be made available. If, in fact, it finds no way to partition a function, you can go to a schedule view where you’ll be able to see which dependencies get in the way (and the source code that creates the dependencies).

They will also present only partitioning options that are free of deadlock issues.

Each partitioning option provides some level of acceleration of the program at some cost. The cost has to do with the number of threads created and any overhead in synchronizing data between threads. The resulting speed-up is tricky, however. What used to be displayed was somewhat more theoretical, even architecture-agnostic in the early days. But architecture and other aspects matter. Once you start running a parallelized program, you may end up with bus contention between cores or cache implications.

Pareon allows mapping onto specific architectures so that the bus and cache (and other) characteristics can be more accurately calculated so that the resulting speed-up estimate includes such potential issues. They’ve also characterized OS call delays and worked with the processor vendors to measure silicon delays so that those can all be included in the performance estimate.

The cost of threads themselves is non-trivial. In fact, if you rely on classic thread creation and destruction, partitioning can well make a parallel program slower than its sequential original. So Pareon relies on “worker threads”: a series of threads are created early on, and functions are assigned to those threads (rather than having new threads created).

When idle, those threads can either sleep – meaning they need to be awoken and scheduled when assigned, which can take time – or be constructed as “busy-wait” threads, where they just loop – which wastes cycles, but which, in the case of many deeply embedded programs where there is only one process running and nothing to be scheduled in place of wasted cycles, can be tolerated and save time. Pareon can select between the two according to settings, timing, and program characteristics.

To this point, all we’ve discussed is the selection of a partitioning option, which tends to be an iterative process: you find a big time-user and try to partition it. Then you find others and partition them until you get to your desired performance (assuming that’s possible).

But the tool is merely recording your decisions; it’s not actually changing any code or producing an output. When you get to the performance you want, as estimated by the tool, then you need to implement it. This can theoretically be automated, at least to a point, but programmers generally don’t want a tool to write code for them or change their code. So instead, Pareon provides a series of “refactoring steps” (formerly referred to as a “recipe”) for making code changes. The intent is that the steps can be implemented one at a time, and each step results in a runnable program that can be validated before taking the next step.

These steps walk you through the process of breaking up subroutines, creating and using threads, and implementing semaphores for synchronization of non-streaming loop-carried dependencies. They provide libraries for simplifying the trickier parts of semaphore implementation.

Lest it sound like Pareon can do everything for everyone, there are lots of different ways of implementing a multicore system, and Pareon does not support them all (at least at present). Their language support is C and C++. They can handle ARM and x86 architectures, focusing on tablet, phone, and desktop applications. They’re open to other architectures opportunistically (and if funded). They generate only SMP implementations, and they implement only data parallelization (functional parallelism has been a “Labs” feature, for future formal release).

That said, as you can imagine, they’re trying to tackle a market area that has lots of use, and it would come as no surprise that C, C++, ARM, and SMP feature heavily in embedded systems (x86 kind of comes for free since the host tools running Pareon run on x86-based machines). The most common AMP systems these days are either things like cell phones, where different cores run different programs that have little interaction, or packet processing, which has its own specialized way of doing things, and has for years. A tool like Pareon is less likely to have traction there because those programs are written for parallelism from the beginning.

Pareon is targeted more towards compute-intensive algorithms like graphics or video processing, speeding up various coding and other algorithms. In these cases, many times the algorithms start as sequential C programs and must then be implemented in a multicore platform. A tool like this is intended to make that “hard part” a bit easier.

More info:

Vector Fabrics Pareon