feature article
Subscribe Now

It’s the Parallelism, Stupid!

A couple years ago I was participating in a standards meeting of multicore insiders, and a discussion ensued as to how to use such phrases as “multicore,” “multiprocessor,” etc. The discussion went on for a bit, making it pretty clear that this was not a cut-and-dried issue even amongst the cognoscenti.

Shortly after that I was having another conversation and was using the term “multicore” rather loosely, and at some point was, with great certitude, corrected in my usage. Which gave me the sense that such phrases may have as much value as buzzword-of-the-month (no one cares about multiprocessor anymore) as technical term.

Fast forward a couple years, and at the recent Multicore Expo, VDC presented some survey data they had taken regarding multicore usage expectations. But their statistics were somewhat complicated by the fact that the users they had sampled weren’t unanimous in their understanding of what the term “multicore” meant – they were including multiprocessor designs as well.

Clearly, while many draw a clear line between multicore and multiprocessor, others see more of a cloud of similar solutions. A QNX presentation at the same conference suggested that multicore is just another twist on the multiprocessor support they’ve had for a while. Now they may intend that to position themselves ahead of everyone else in the multicore wave, and yet elements of it resonated with me.

For the record, the canonical difference between multicore and multiprocessor is simple: multiple processing units in a single chip constitute multicore; in separate chips they constitute multiprocessor. I guess a quad-core made up of two dual-core chips would be both. Taking things a couple steps further, multiple boxes constitute a “cluster.” And a large number of high-power computers can constitute a “grid.”

Now I’m not suggesting that multicore and multiprocessor, in the sense that specialists may understand the words, are really the same. There are important ways in which they differ in their details. The processor interconnectivity possibilities are very different. The ways in which memory can be divvied up into private and shared sections is much more flexible in a multicore chip. The means by which messages can be sent from processor to processor are very different. In fact, although intra-cluster inter-process communication (IPC) systems like TIPC exist, a new one is in the works for multicore, the MCAPI standard.

But here’s the rub: most focus seems to be placed on the differences between the different multi-whatever systems; very little attention is paid to the unifying element: parallel computing. And here’s why that’s important: it’s the one thing that will probably never be completely automated for users. Parallel data decomposition can often be handled systematically, but parallel algorithm decomposition is the beast that, no matter how good the tools are, programmers are going to have to wrestle with. At least that’s how it looks from here. You can watch presentation after presentation on nifty new multicore tools, with all the whiz-bang transformations and flexibility and time-saving productivity enhancers, but when it comes to how a user will write their code, the presenters uniformly admit that transforming serial code into parallel code is not something they can do or will even try to do. Very large companies have tried to come up with systems that, with a bit of annotation, can parallelize a serial program, and at the end of the day those systems have been scrapped.

So why is this an issue? Because it feels like the different multi-whatever systems are being treated as isolated silos, with multicore being completely different from multiprocessor. In their implementation details, they may be different. But why should they be different to a programmer? If we look over time at integration trends, a parallelized program might originally have operated on multiple systems in multiple boxes. Those boxes might then have been integrated into a single box with multiple chips. Those chips may then have been combined into a multicore chip. Why should the programmer have to make one change to the program to keep it working efficiently? Even though the details of communication channels, for example, may have changed dramatically, why should a user have to go in and change the communication API calls for each different level of system?

A specialist would argue that optimizing a program for multicore is different from optimizing for multiprocessor, and that’s correct. But then again, optimizing a C program to run on two different Wintel boxes using two different versions of the processor may be different. And optimizing compilers take care of many of those differences. At Teja Technologies*, there was a level of abstraction built into their multicore programming model so that the programmer could target, for example, an abstract queue or mutex without knowing the implementation. At compile time, when the target hardware was specified, the compiler would resolve that into a form optimized for the target hardware. The specific nature of the queue or mutex that was ultimately realized was made opaque to the programmer.

If that kind of abstraction technology can help migrate programs between multicore platforms, why can’t a similar mindset transcend the differences between multicore, multiprocessor, and grid? Optimizing compilers could analyze a program and decide how to arrange variable storage in a given system, choosing between registers and various memories according to the locality and sharing needs of the variable. A higher-level communication API could be used, with the compiler translating that into an efficient set of commands in an API appropriate to the system being targeted, including mixing APIs if some communication is inter-core and some is inter-processor.

Some of this may be over-simplistic. A real-world program today has to be tweaked to run on Windows or Linux; there’s no universal OS API that can be resolved at compile time. But programmers can and do structure programs in a way that can isolate the OS-related items to minimize the amount of hunting through a program required to make modifications for an OS change. That’s much simpler than what may be required today for writing parallel programs, where, for instance, the memory architecture may need to be explicitly considered by the programmer for anything other than a plain-vanilla symmetric multiprocessing (SMP) configuration.

You might also argue that the needs of the embedded world are different, since each system tends to get optimized on its own. You don’t have the kind of desktop situation where a given program may run on any of a number of platforms. And yet, what is one of the biggest concerns that companies have about moving to multicore? Legacy code. The ability to port existing code to different platforms is one of the big roadblocks to multicore adoption today. And there are two big hurdles in the way of porting legacy code: restructuring for a multi-whatever architecture, and parallelizing serial code.

If a means can be provided to abstract as many of the architectural issues as possible for automated compile-time binding, then the focus comes back to the big elephant in the room: creating parallel programs. Parallelism is inherent in algorithms; it is a property of the job being done, not a property of the underlying system. A given task may be decomposable into 15 parallel… um… flows (it’s hard to find a word that’s generic – streams, threads, and programs all have connotations that are too specific). This should allow a system with 15 or more parallel units to execute all those flows in parallel. A system with less than 15 units should still be able to handle the program, with some parallel flows serialized either at compile or run time, whether or not the execution units are in the same chip or the same country. The focus of the programmer should be on finding and exploiting the natural parallelism opportunities in the high-level problem being solved. And yet it feels like this is the element of multicore that’s more or less being punted to universities or others to solve. Time after time the parallelism problem is acknowledged, with the admission that “we can’t solve that.”

The low-level problems that are being solved today are important and should be solved. But we shouldn’t stop there: there also needs to be a focus on isolating programmers from those low-level solutions. I am a huge fan of automating tedious dirty-work and abstracting away complexity, and have occasionally been guilty of over-optimism. But even if complete system-agnostic nirvana cannot be attained, there’s still a big opportunity to narrow the chasm that a programmer must now cross in order to do effective multi-programming. Let’s solve the multicore problems, but then continue on by moving up a level. Let’s extend those solutions to cover multi-anything through another layer of abstraction. That will help close the gap for programmers, leaving them the not-unsubstantial challenge of learning how to find and implement parallelism. Whether that’s through new languages, new programming models, or simply new tools that help a programmer organize his or her thoughts when decomposing a program, there is much to be done.

No matter how we arrange the underlying components in the system, it will always come back to this: It’s the parallelism that’s the real differentiator. It’s the parallelism that’s the big paradigm shift. It’s the parallelism that will separate the winners and losers in the end. Until we address this with gusto, we will be faced with a reticent kicking-and-screaming transformation of our world from a serial one to a parallel one.

*Full disclosure – I was the marketer at Teja at one point; it’s now part of ARC Int’l

Oh, and I’m not really suggesting any of you are stupid.

Leave a Reply

featured blogs
Jul 17, 2018
In the first installment, I wrote about why I had to visit Japan in 1983, and the semiconductor stuff I did there. Today, it's all the other stuff. Japanese Food When I went on this first trip to Japan, Japanese food was not common in the US (and had been non-existent in...
Jul 16, 2018
Each instance of an Achronix Speedcore eFPGA in your ASIC or SoC design must be configured after the system powers up because Speedcore eFPGAs employ nonvolatile SRAM technology to store the eFPGA'€™s configuration bits. Each Speedcore instance contains its own FPGA configu...
Jul 12, 2018
A single failure of a machine due to heat can bring down an entire assembly line to halt. At the printed circuit board level, we designers need to provide the most robust solutions to keep the wheels...