Multicore Automation

I used to be a die-hard manual transmission driver. Even before I drove a car, I had learned how to double-clutch for smooth on-the-go gear changing on our Case tractor (which didn’t have synchronized gears because it wasn’t intended to be shifted on the go). My grandmother insisted on a manual transmission up until she stopped driving in her 80s (or later?): “I vant to do it myself!” (That’s a Swiss-German accent there…)

I finally had to rethink my principles when I got a cell phone. The moment of truth came when I was at a stoplight waiting to turn left while talking on the phone. The light turned green and I proceeded, but realized I didn’t have enough hands: I needed to turn, change gears, and hold the phone, all at the same time.

Recasting this in really geeky terms, I had done just fine when multitasking involved only steering and changing gears – two independent processes, each of which requires only one hand (but one of which benefits from two). I was able to share one resource, my right hand, with both the steering and the gear-changing processes.

Toss a cell phone in the mix, and now I had a resource-sharing problem. The phone (while in use) depended on my holding it up to my ear. (If the FBI or any other law enforcement agency is snooping in on this conversation, I’d like to state for the record that this was years before there were hands-free laws.) The gear-shifting process depended on the same resource, and I now had contention for that resource.

So… conceding that this might be a bit of a reach as a metaphor, it does lead to the question: in a less trivial situation, how do you know how to take tasks and implement them in a multicore system in a manner that doesn’t result in the phone being dropped in a lap or the car steering into an oncoming left-turning car while changing gears? (Moving through the entire turn in 1^st gear is too humiliating to be a real option.)

As multicore becomes more prevalent in the embedded world, this is an increasingly relevant question. Embedded multicore has, to date, been dominated first by packet processing, having its own little world of fast and slow paths lovingly hand-crafted, and now cell phones, with multiple processors doing different things.

In the cell phone case, it’s worked pretty well specifically because the processors do different things – they don’t generally interact, or, if they do, it’s simple enough to manage manually. But now multicore general processors of the ARM and MIPS variety are becoming more common. So we’re no longer talking about putting separate processes on separate processors: we’re talking about splitting One Big Process up into multiple tasks – they might be threads or processes – and having them work in parallel.

For the moment, let’s assume the One Big Process has already been implemented for a single core. That means that it’s a sequential (or serial) program with everything happening in a deterministic, orderly fashion. One foot always goes ahead of the other.

Call it a divorce

When you start splitting it up and having things run in parallel, you can get into deep trouble. First, instead of a single process requesting resources, numerous pieces of that process may now be demanding the attention of memory or a peripheral at the same time. A memory has only so many ports.

Second, those tasks all start at their beginnings. If one of the tasks reflects something that happens towards the end of the One Big Process, then, if you’re not careful, it will just start when everything else starts and generate nonsense. Somehow it has to be told that it can’t start until the things it depends on are ready.

The trivial case is,

a = x + y

c = a + z

When those happen in order, then the value of c at the end is x + y + z. But if they get put into parallel processes and just start at the same time, then c will be the result of z added to… whatever happens to be in a at the time (which could be, but is very likely not, equal to x + y).

Unfortunately, most situations aren’t this blatant. And it gets very tricky when pointers are used in C. The simple case just illustrated can be detected by static analysis of the code. But that’s not the case with pointers – you can’t be sure what they’re going to point to (other than 0, when you crash the system), and pointer arithmetic and aliasing make the problem intractable without dynamic analysis using a full-coverage test suite.

How you divide up a program when going parallel has an enormous impact on how the resulting parallel implementation will run. There are four possible outcomes:

It works perfectly; pat yourself on the back and take some well-earned time off. (Or wake up.)
It works, but very slowly – possibly more slowly than the original sequential version.
It doesn’t work at all. The good news is that you know immediately and can start debugging.
It seems to work perfectly. Until you ship the product and then customers start finding weird, elusive bugs here and there that are hard to replicate and even harder to debug. This is your nightmare scenario. (And you wish you could wake up.)

For the most part, this problem has been handled manually by experts who own a very specialized programming niche. But there aren’t enough of those guys to support the amount of multicore programming that will be needed. So automation is coming to the rescue. But the question is, what to automate?

Let’s split the problem up in order to wrap our heads (OK, my head) around it, continuing for now our assumption that we’re starting with a known-good sequential program. I’ll divide the process of creating a parallel version into three major chunks:

Determining where to partition the program
Implementing the partition
Verification and debug

Step 1 involves analysis and decisions – and possible changes to your original program. The second involves changes to code to transform the original program into a parallel version. The third involves… well, the usual doggedness and 64 ounces of TweakStar energy drink. We’re going to focus on the first two.

How to split up the assets

Zooming in closer on step 1, it involves several items, the order of which may vary:

Analyzing dependencies in the program
Making changes to the original code to provide more opportunity for concurrency
Evaluating different ways of partitioning the code
Mapping the partitions to cores (not needed in an SMP environment, where all the cores look alike and the OS handles real-time core assignment).

There are a couple of philosophies on how dependencies should be analyzed. Two different tools intended to help with this process embody these two philosophies.

One is CriticalBlue’s Prism. With this tool, you establish which partitions you want to evaluate first. For each possibility, Prism can analyze the dependencies and “stage” the different threads (we’ll use the term “thread” loosely here) on a timeline so you can see what the performance looks like. Specific core behavior can be modeled for some architectures so that the idiosyncrasies of any particular processor or cache can be included.

From this view, you can see where a particular task might be delayed because it’s waiting for something else to make some data available. So you can run “what if” analysis to see what would happen if that dependency were removed. By doing this, you can evaluate different ways of improving the performance of the program and, ultimately, pick the best partitioning scheme. It doesn’t tell you how to change the program, and it certainly doesn’t make that change for you (more on that shortly), but it does focus you on specific things to fix in order to eliminate a bottleneck.

So in this case, you pick the partition first and then do the dependency analysis on that partition. CriticalBlue’s David Stewart says you really need to do it in that order simply because the number of dependencies in the entire program is huge, and you really don’t care about most of them. Identifying the partition first focuses you on the dependencies that matter.

Vector Fabrics has a different approach. Their vfEmbedded tool analyzes the entire program up front, developing a database of the dependencies. They avoid the dependency clutter problem through the visualization of the program: they have a graphic display that filters out the dependencies that are irrelevant for a given view.

While these two tools look very different, up to this point, they’re similar in that they can point you to dependencies that you might want to eliminate in your original program. But the vfEmbedded approach puts the dependency analysis ahead of partitioning. One benefit of this is that the tool itself can recommend partitions based on dependencies and likely speed-up.

The trick here is in discriminating between different kinds of dependencies. It’s very unlikely that you can find ultra-clean breaks in your program that have no dependencies. No matter which way you break things up, you’re likely going to have to “cross” or “break” a dependency.

The concept of breaking a dependency is much like what happens when you reconfigure a building or modify a garment. You may have to cut through some timbers or remove a seam, but you then put things back together in a way that works.

Some dependencies – like the simple one we saw above – are easy to repair: you simply synchronize the c calculation so that it waits until a is ready before proceeding. Streaming data can be repaired with suitably configured FIFOs between the threads. Some dependencies can’t be repaired at all.

So the problem, whether done manually or via a tool, is finding a way to cut the program that

has no dependencies (unlikely),
has only dependencies that can be repaired (and as few as possible),
or has irreparable dependencies that don’t matter.

Those last two need some explanation. When repairing fixable dependencies, you have to add library calls that implement synchronization. It’s work to add those calls, and they have some overhead at run time. So, all things being equal, you’d like to have to repair as few dependencies as possible.

The last point is a bit more subtle. The whole thing about “repairing” is about making sure that you don’t “break the semantics” of the program. In other words, the new parallel version should work in every way like the original version. If you break a dependency without repairing it, it’s usually a bad thing (and often very hard to detect). But sometimes it doesn’t matter.

For example, if you’ve got a program that logs data to a file, then it will do so in a particular order. If that order matters, then, if a broken dependency re-orders the log entries, it’s a bad thing. But if the order doesn’t matter – if you just want everything logged, in whatever order – then the fact that you’ve changed the semantics (because the logging order has changed) doesn’t matter.

The vfEmbedded tool can characterize the different dependencies to determine which can and cannot be repaired. Based upon this, the tool will recommend partitions. By design, a recommended partition will break only reparable (or ignored irreparable) dependencies. These partitions will also be free of race conditions (another multicore nightmare we won’t dwell on here).

Whether done manually or with the aid of a tool, the partitioning problem must go hand-in-hand with the mapping step, if used, since different cores may have different performance, or some cores may become over-committed. So, having divided the program up, you assign the parallel chunks to different cores. Based on the result, you may decide to go and try a different partition. So if you do manual mapping, then you may likely iterate alternately portioning and mapping until you have a solution.

Both Prism and vfEmbedded use relatively detailed models of select cores – performance modeling for these is “hardware-aware.” CriticalBlue lists various ARM cores, Cavium, Freescale, Intel x86, MIPS, Netlogic, Renesas, and Toshiba, some under Linux, some under Android (plus a bare metal one and a V-Kernel one). Vector Fabrics lists x86 and ARM (independent of OS). Both say they’re got more in the pipe.

So, revisiting those first pieces of Step 1, we have:

Analyze dependencies: Prism and vfEmbedded automate this using static and dynamic analysis. Another company, Target Compiler, also does this, but uses only static analysis, being pessimistic about dependencies (in other words, if it might be a dependency, it’s flagged as one – you can ignore it if you wish). More on Target Compiler in a moment.
Making changes to original code: neither Prism nor vfEmbedded provides any automation here. And that’s how users prefer it. “Don’t touch my code.”
Evaluating partitions: that’s somewhat manual with Prism, with their strength being the “what-if” analysis capability. vfEmbedded automates the process of identifying legal (and preferred) partitions (although nothing would stop you from picking something else).
Mapping partitions to cores: this is done manually with both Prism and vfEmbedded, although there are graphic tools to make this easier. Both tools support mapping within a homogeneous group of cores , but not for heterogeneous multicore. Mr. Stewart notes that it’s unlikely for a single problem to be split up over dissimilar cores.

Putting it on paper

Once you get through these steps, you have a partitioning (and mapping) scheme that you need to implement. If you’re mapping code, then you work at the system level to assign code to cores, and that’s a manual task. How you do it depends on the type of system (SMP vs. AMP, for example). Suffice it to say, there’s no automation.

But we still need to implement the partition in our program. That involves breaking the code up (literally), adding threading calls (if threading) or creating separate run-to-completion programs, for example. But just as important is the adding of code to synchronize data and repair any broken dependencies.

While this is tedious, laborious, error-prone work, it is deterministic and well-bounded. The reason it’s error-prone is that there are lots of details, and missing any of them will cause problems.

Target Compiler automates this process. This comes out of a very different mindset, since Target Compiler is a company specializing in creating what they call application-specific processors (ASIPs). These ASIPs are created for an SoC to offload a general processor, and they’re essentially custom-crafted processors dedicated to a particular function. Because they implement only a portion of a program, they must communicate with the main processor so that all dependencies are accounted for.

In creating the ASIP, Target Compiler also creates the infrastructure for communication so that everything is automatically put into place.

While that may be acceptable in the specific context of designing accelerators in an SoC, programmers are less enamored of such changes being made automatically in the more general case of simply parallelizing a sequential program. Again, “don’t touch my code” tends to prevail. That’s partly because people don’t trust new tools, but it’s also because programmers want ownership of their code, and they have to be able to maintain it. Machine-generated code tends not to work well in that regard.

vfEmbedded does take one step towards automation of the code changes: it generates what they call a “recipe”: it’s a step-by-step set of instructions on the changes required to ensure a parallel program that’s semantically identical to the sequential original.

Can we start over?

You may recall that I tentatively qualified this process with the assumption that you’re starting with a known-good program. These tools might be nice for parallelizing old, moldy legacy programs, but what if you’re starting from scratch? There are methodologies for writing programs that are parallel from the start. But both CriticalBlue and Vector Fabrics feel it’s better to write a new sequential program first and then parallelize after.

This is for two reasons. First, people think sequentially, so it’s just easier. But, more practically, many programs are destined for many different platforms. This suggests having one golden sequential original that can be adapted as needed. So the process we’ve looked at can apply to new programs as well as legacy code.

So… winding way back, the process of generating a parallel program has been selectively automated. The analysis steps have been automated to help partitioning and mapping decisions. Actual implementation, whether improving the old code or generating the new program, is manual (perhaps guided by a recipe). The exception here is Target Compiler, but that’s a system design process, including creation of a custom processor, so the context is different.

So… winding back even further, had I had tools like this available when deciding on a car to use along with my cell phone, they would have made it clear that I needed an automatic transmission in order not to find myself semantically altered all over the pavement. But the tools would have stopped short of driving the car for me.

Yeah, I guess I’d be OK with limited automation.

Note: There is another company called Compaan that is also doing similar work in this space, but they weren’t able to get any information to me before this article went to print.

And… full disclosure, I was on the exec team at Vector Fabrics in the past… long enough ago to have required a briefing on the current state of the product.

More info:

Prism

Target Compiler

vfEmbedded