There are two alternative realities out there in computation: the sequential universe – which is where our brains naturally conceptualize most algorithms, and the parallel universe – which is the way actual hardware works. These two realities have always co-existed, but the von Neumann architecture shielded us from them. Since our processor architecture was based around a program counter, we could conceive that our machine was doing one instruction at a time.
Therefore, when we wrote our software – even at a much higher level of abstraction than the instructions the machine was executing, the notion that one event would follow another in our algorithm was protected. Of course, as electronic engineers, we understand that there was a “man behind the curtain” in that scenario. We had entirely parallel hardware pretending to do one thing at a time in order to keep us sane and organized in our algorithm development.
When hardware description languages came along, however, that notion of code always being sequential had to be thrown out the window. The metaphor of instructions being executed one at a time simply did not work for the creation of arbitrary logic networks. As a result, VHDL and Verilog are almost impossibly difficult to understand and use, compared with normal high-level software development languages. And it’s too bad, really. Because if the general programming population could code as efficiently in VHDL and Verilog as they do in C, FPGAs (or something like them) would have probably long ago taken over all serious computing.
The advent of object-oriented programming (OOP) warped the wall between these sequential and parallel domains. The very notion of an object, with its own methods and private data space, was largely incongruous to the idea of a single processor thread executing instructions one at a time. Objects exist and work in parallel, and compiling object-oriented code down to something that simulated that parallelism on a sequential machine was a major challenge in the creation of object-oriented compilers. Still, OOP provided a wormhole through the dimensional barrier between the sequential universe – where we had historically conceived algorithms in procedural programming models – and the parallel universe where masses of logic gates flipped back and forth almost simultaneously.
Now, however, we face the opposite problem from OOP. Where OOP was trying to map parallel concepts onto sequential hardware, we need to map algorithms that were conceived as sequential into vastly parallel optimized hardware structures. If we can do this, we can take the sequential code of legacy software and move it automatically into the domain of custom hardware implementation. As we have discussed many times in these pages, the rewards of this are spectacular – orders of magnitude improvement in compute performance – and particularly performance-per-watt, completely independent of improvement in the underlying semiconductor technology.
When we conceive an algorithm as a set of instructions to be executed in sequence, we humans tend to ignore opportunities for parallelism and pipelining. We have one body and two hands, so we generally do real-world physical tasks in steps. If we were making cookies, for example, we would add the flour, then the sugar, and then the salt. Then we stir. Then we form the result into (pretty boring) blobs and arrange them on a pan and pop them in a hot oven. If we were thinking a little about opportunities for parallelism, we probably started the oven pre-heating before all that. But, if we wrote “code” for this process, and our code says “add flour, then sugar, then salt, then stir, then make blobs, then arrange on pan, then bake” some information missing from that code is required to determine, for example, whether the order of those steps can be changed, or whether they can all be done simultaneously. We know that flour, sugar, salt could be done in any order, or in parallel, and that “make blobs” must be done after those three. A parallelizing recipe compiler might not know that without additional information.
Thus is the first challenge of making tools that can automatically convert sequential algorithms into optimized parallel architectures. The tools need to be able to figure out dependencies in the data and in the order of operations in order to create a parallel model. There is also the issue of resources, and the utilization of those resources. If we had three cooks, one could be adding flour while a second adds sugar and a third adds salt. More or fewer resources would determine the number and types of things we could or could not parallelize.
In the world of hardware, resources are not infinite. If we have a multiplication operation inside a loop, and that loop is executed one million times, it would help to have a number of hardware multipliers on our chip working in parallel on those problems. But we probably don’t have a million multipliers. Our tool needs to understand the resources available in making decisions around parallelism. Furthermore, those resources could be arranged in pipelines in order to maximize their utilization and increase throughput. If three cooks are constantly mixing bowls of dough while three more are making dough balls and others are constantly carrying pans to the optimal number of ovens, we can see how our boring cookie production could be optimized for maximum throughput.
Of course, in addition to throughput, there is also the consideration of latency. Our goal might be to maximize the number of cookies coming off the end of the line, or it might be to minimize the time from startup until the first batch of cookies is ready to eat. Those would require completely different architectures, and they would require our tool to understand our latency vs throughput goals, which may not have been captured in our “code.”
The point of all this is that the notion of simply compiling legacy (sequential) software into optimized parallel hardware probably requires information that is not normally present in the original source code itself. It turns out that much of this information can be inferred by clever tools, but some of it cannot. And the space of potential implementations is exponentially larger than for a straight-ahead von Neumann implementation.
And this discussion only skims the surface of the problem. Most of what we’ve talked about here is a dataflow situation, but many algorithms are much heavier on conditional behavior and control. These types of algorithms can also be parallelized, but the approach to optimizing them is very different.
This wall between parallel and sequential worlds is incredibly difficult to breach. Humans conceptualize sequentially, and in doing so we omit critical information from our code that is required for optimal parallelism. Automated tools have to guess at the humans’ intent, and they often don’t have the resources to make those guesses intelligently. But requiring the humans to do the actual parallelizing isn’t the answer. Our brains simply don’t work that way, and human-created architectures will always fall short on the optimization front.
Today, optimizing hardware implementations of sequential software algorithms is the domain of human intelligence and high-level synthesis tools. The tool technology has come a long way in the past three decades, but it still has far to go before we conquer the challenge of automatically targeting large, legacy software applications to reconfigurable heterogeneous computing machines that include both von Neumann and alternative hardware architectures such as FPGA fabric. The creation of such a tool is an incredibly complex proposition, but one that has the potential to revolutionize computing itself. Let’s hope our peers who are tackling this challenge have the wind at their backs!