feature article
Subscribe Now

Sequential and Parallel Universes

The Real Challenge of Next-Generation Computing

There are two alternative realities out there in computation: the sequential universe – which is where our brains naturally conceptualize most algorithms, and the parallel universe – which is the way actual hardware works. These two realities have always co-existed, but the von Neumann architecture shielded us from them. Since our processor architecture was based around a program counter, we could conceive that our machine was doing one instruction at a time.

Therefore, when we wrote our software – even at a much higher level of abstraction than the instructions the machine was executing, the notion that one event would follow another in our algorithm was protected. Of course, as electronic engineers, we understand that there was a “man behind the curtain” in that scenario. We had entirely parallel hardware pretending to do one thing at a time in order to keep us sane and organized in our algorithm development.

When hardware description languages came along, however, that notion of code always being sequential had to be thrown out the window. The metaphor of instructions being executed one at a time simply did not work for the creation of arbitrary logic networks. As a result, VHDL and Verilog are almost impossibly difficult to understand and use, compared with normal high-level software development languages. And it’s too bad, really. Because if the general programming population could code as efficiently in VHDL and Verilog as they do in C, FPGAs (or something like them) would have probably long ago taken over all serious computing.

The advent of object-oriented programming (OOP) warped the wall between these sequential and parallel domains. The very notion of an object, with its own methods and private data space, was largely incongruous to the idea of a single processor thread executing instructions one at a time. Objects exist and work in parallel, and compiling object-oriented code down to something that simulated that parallelism on a sequential machine was a major challenge in the creation of object-oriented compilers. Still, OOP provided a wormhole through the dimensional barrier between the sequential universe – where we had historically conceived algorithms in procedural programming models – and the parallel universe where masses of logic gates flipped back and forth almost simultaneously.

Now, however, we face the opposite problem from OOP. Where OOP was trying to map parallel concepts onto sequential hardware, we need to map algorithms that were conceived as sequential into vastly parallel optimized hardware structures. If we can do this, we can take the sequential code of legacy software and move it automatically into the domain of custom hardware implementation. As we have discussed many times in these pages, the rewards of this are spectacular – orders of magnitude improvement in compute performance – and particularly performance-per-watt, completely independent of improvement in the underlying semiconductor technology.

When we conceive an algorithm as a set of instructions to be executed in sequence, we humans tend to ignore opportunities for parallelism and pipelining. We have one body and two hands, so we generally do real-world physical tasks in steps. If we were making cookies, for example, we would add the flour, then the sugar, and then the salt. Then we stir. Then we form the result into (pretty boring) blobs and arrange them on a pan and pop them in a hot oven. If we were thinking a little about opportunities for parallelism, we probably started the oven pre-heating before all that. But, if we wrote “code” for this process, and our code says “add flour, then sugar, then salt, then stir, then make blobs, then arrange on pan, then bake” some information missing from that code is required to determine, for example, whether the order of those steps can be changed, or whether they can all be done simultaneously. We know that flour, sugar, salt could be done in any order, or in parallel, and that “make blobs” must be done after those three. A parallelizing recipe compiler might not know that without additional information.

Thus is the first challenge of making tools that can automatically convert sequential algorithms into optimized parallel architectures. The tools need to be able to figure out dependencies in the data and in the order of operations in order to create a parallel model. There is also the issue of resources, and the utilization of those resources. If we had three cooks, one could be adding flour while a second adds sugar and a third adds salt. More or fewer resources would determine the number and types of things we could or could not parallelize. 

In the world of hardware, resources are not infinite. If we have a multiplication operation inside a loop, and that loop is executed one million times, it would help to have a number of hardware multipliers on our chip working in parallel on those problems. But we probably don’t have a million multipliers. Our tool needs to understand the resources available in making decisions around parallelism. Furthermore, those resources could be arranged in pipelines in order to maximize their utilization and increase throughput. If three cooks are constantly mixing bowls of dough while three more are making dough balls and others are constantly carrying pans to the optimal number of ovens, we can see how our boring cookie production could be optimized for maximum throughput.

Of course, in addition to throughput, there is also the consideration of latency. Our goal might be to maximize the number of cookies coming off the end of the line, or it might be to minimize the time from startup until the first batch of cookies is ready to eat. Those would require completely different architectures, and they would require our tool to understand our latency vs throughput goals, which may not have been captured in our “code.”

The point of all this is that the notion of simply compiling legacy (sequential) software into optimized parallel hardware probably requires information that is not normally present in the original source code itself. It turns out that much of this information can be inferred by clever tools, but some of it cannot. And the space of potential implementations is exponentially larger than for a straight-ahead von Neumann implementation. 

And this discussion only skims the surface of the problem. Most of what we’ve talked about here is a dataflow situation, but many algorithms are much heavier on conditional behavior and control. These types of algorithms can also be parallelized, but the approach to optimizing them is very different.

This wall between parallel and sequential worlds is incredibly difficult to breach. Humans conceptualize sequentially, and in doing so we omit critical information from our code that is required for optimal parallelism. Automated tools have to guess at the humans’ intent, and they often don’t have the resources to make those guesses intelligently. But requiring the humans to do the actual parallelizing isn’t the answer. Our brains simply don’t work that way, and human-created architectures will always fall short on the optimization front.

Today, optimizing hardware implementations of sequential software algorithms is the domain of human intelligence and high-level synthesis tools. The tool technology has come a long way in the past three decades, but it still has far to go before we conquer the challenge of automatically targeting large, legacy software applications to reconfigurable heterogeneous computing machines that include both von Neumann and alternative hardware architectures such as FPGA fabric. The creation of such a tool is an incredibly complex proposition, but one that has the potential to revolutionize computing itself. Let’s hope our peers who are tackling this challenge have the wind at their backs!

5 thoughts on “Sequential and Parallel Universes”

  1. Hmm. Being involved with parallel processing since the late 1980’s, I don’t fully agree even if I understand why the author thinks this way. Humans do not conceptualise sequentially, but have been “brainwashed” to apply sequential recipes at school. And yes, partly because of the von Neumann architecture (I call that the von Neumann syndrome).
    Software is modelling and the real world is concurrent/parallel by nature. Hence, thinking in terms of parallelism is quite natural. We have the formalisms in the forms of process algebras with Hoare’s CSP being one of the pioneers that made it into a real processor (the transputer) and programming languages like occam, Virtuoso RTOS, ….
    And of course, reverse engineering sequential code is hard precisely because the parallel information that is inherently in the problem domain is more or less thrown away and replaced by a complex state machine.
    I had the personal experience years ago when after 10 years of sequential programming (Algol, Pascal, Fortran) I attended a programming course in occam. At some point, my brain made the click undoing ten years of “brainwashing” and since then thinking in parallelism is much more natural than thinking in state machines that no brain can ever keep in his head. Just image you have to write a program that models the conversation of people in a group. is that easy with a state machine? Or does one timslice every millisecond? Model each person as an event driven process and the communication as a message, and thing come clear. Divide and conquer.
    I also noticed often that hardware people had a lot less issues with thinking in parallel. They are used to “connect” blocks without having to know the state machine inside. Its sufficient to know the interface, showing that it really is a matter of education and training.
    “Design parallel, optimise sequentially” and not only one gets much cleaner designs, but also better ones. Parallelism is not a voodoo science, but natural.

  2. @ericverhulst – I have to agree mostly — it’s actually simpler to explain in real life, but from a systems level view, in terms of successive decomposition.

    Junior/Unskilled engineers solve problems with a monolith of sequential logic/code, that for all practical purposes can only be implemented in real life for the smallest of problems.

    Senior/Skilled engineers solve problems incrementally with smaller logic/code bodies that work together. And as the problem complexity increases, these small logic/code bodies become additional state machines, threads, tasks, processes, and pipelines in the finished architecture of more complex problem solutions.

    This is not new … this has been the state of the art for more than 40 years. We have been building single processor systems with interrupts and partitioned tasks for 60 years. We have been building SMP systems for more than 40 years. We have been building NUMA machines for more than 30 years. We have been building large clusters for more than 25 years.

    I introduced general lockf semaphores into the UNIX/POSIX architecture in 1982 (35 years ago) as a general solution for the deadlock and synchronization problems for cooperating processes. And we added both threads and light weight threads into the UNIX/POSIX architecture shortly after. And with that we created compilers and libraries for re-entrant code bodies, to support shared data/code address spaces for threads, events, and call-backs.

    Tasks, interrupts, events, call-backs, processes, threads, messages, mail-boxes, semaphores, locking, client/server are all OLD well understood problems. This isn’t magic, this IS the required tool box of any trained software or computer engineer.

    Communication and sequencing is a critical part of designing all systems … systems of people in the work place, and systems of control and data in automated designs. Each requires specific training, plus experience to be effective.

    As you note, it’s a mistake not to introduce these naturally parallel concepts from the beginning.

    Nobody expects building designers to start with a mountain and carve out rooms and hallways … we start with stones/bricks/sticks, and build the walls for rooms and hallways — as complex as necessary, within the limits of the materials.

    We need to teach our hardware/software engineers these same skills from the beginning – KISS for naturally parallel designs.

  3. @kevin – For skilled engineers using OpenMP, OpenCL, CUDA and/or OpenMPI, most Single-Instruction-Multiple-Data (SIMD), Multiple-Instruction-Single-Data (MISD), and Multiple-Instruction-Multiple-Data (MIMD) parallel instruction/data flow problems have been very tractable for nearly a decade now in systems with SMP, SMP+GPU, and SMP+GPU+Clusters.

    Intel actively supports OpenMP, OpenCL, and OpenMPI in their SMP, SMP+GPU, and SMP+GPU+Cluster based systems.

    Altera (Now part of Intel) also actively supports OpenCL for FPGA’s.

    So for Intel based system, SMP+GPU+FPGA+Clusters pretty much covers SIMD, MISD, MIMD problems well.

    Apple has supported SMP+GPU+Clusters with OpenCL and OpenMPI for nearly a decade now.

    Even Xilinx finally understood they could not avoid OpenCL, and have been on-board for a couple years now.

    OpenCL on FPGA’s is only nasty because the backend PR tools are not programmer friendly.

  4. @TotallyLost – agreed, but I think it misses my point a bit.

    You say “For skilled engineers using OpenMP, OpenCL, CUDA and/or OpenMPI,”… I agree completely there.

    But, I’d argue that is a very small percentage of software engineers. And, none of those address the issue of legacy applications. Rewriting all the world’s legacy software to take advantage of heterogeneous computer architectures is a pretty daunting task. Unfortunately, coming up with tools that could automatically make legacy applications work efficiently on heterogeneous architectures is also a pretty daunting task. There does not appear to be a magic bullet here.

  5. @Kevin – You say “Rewriting all the world’s legacy software to take advantage of heterogeneous computer architectures is a pretty daunting task.”

    The reality is very little sequential code will benefit the user significantly, if ported to some parallel platform. And in particular some mixed architecture system, with a mix of ISA’s, GPU’s and FPGA’s. Better than 99% of legacy code, or even new code, is low resource, highly sequential code, with run times that are insignificant on any modern platform. This is the code that most programmers write and support … they are highly unlikely to benefit from being optimized for some mixed architecture system.

    The first step in performance optimization is nearly always optimizing memory accesses to be cache efficient — improving locality of access in the algorithm, then packing density to optimize references per cache line fill. With CPU to memory latency ratios typically higher than 10:1, this is the critical “low hanging fruit” for optimization. Without doing this first, most SMP parallel optimizations will remain serially blocked on memory resources (primary sequential bottleneck subject to Amdahl’s law). This skill we teach to sequential programmers — before parallel skills.

    The few programs that are still heavily resource driven, with long run times, generally can easily be effectively parallelized with some very minor rewrite/partitioning with OpenMP and/or easily pushed into a mixed architecture system with parallel optimizations — by a skilled parallel professional engineer.

    Letting a generic sequential software engineer do the job, that lacks both training and experience in parallel programming is a mistake, unless it is specifically tasked as a training exercise without market lead time concerns.

    Hardware engineers have similar tasking boundaries by skill set and experience … we let some engineers design at the board level … we let skilled VLSI engineers design at the chip level. Only an idiot manager would let board level designers take a critical revenue and market lead time project to VLSI … a good hardware manager would hire a seasoned VLSI chip level designer.

    A good software manager will hire a seasoned parallel programmer if the project is critical revenue and time to market.

    If either a hardware or software engineering manager is not concerned about critical revenue and time to market, it’s sometimes the best choice to train in house team members, assuming that is the best way to allocate staff budgeting.

    This is exceptionally low risk/budget when training software engineers good OpenMP skills in-house … for deployment on modern multi-core SMP machines. It’s still a good idea to press staff to take a continuing education class in parallel programming that includes threads, OpenMP, OpenMPI and OpenCL — and then reward that professional commitment with cherry picked in-house parallel projects.

    Likewise on the hardware side, we generally migrate board level engineers slowly into large FPGA projects, and then into VLSI projects after they have taken some graduate level VLSI classes as continuing education.

    It’s just as unlikely that significant amounts legacy code will need to be parallelized, as it’s unlikely that significant amounts of legacy board level designs will need to be pushed into VLSI.

Leave a Reply

featured blogs
May 20, 2022
I'm very happy with my new OMTech 40W CO2 laser engraver/cutter, but only because the folks from Makers Local 256 helped me get it up and running....
May 20, 2022
This week was the 11th Embedded Vision Summit. So that means the first one, back in 2011, was just a couple of years after what I regard as the watershed event in vision, the poster session (it... ...
May 19, 2022
Learn about the AI chip design breakthroughs and case studies discussed at SNUG Silicon Valley 2022, including autonomous PPA optimization using DSO.ai. The post Key Highlights from SNUG 2022: AI Is Fast Forwarding Chip Design appeared first on From Silicon To Software....
May 12, 2022
By Shelly Stalnaker Every year, the editors of Elektronik in Germany compile a list of the most interesting and innovative… ...

featured video

Intel® Agilex™ M-Series with HBM2e Technology

Sponsored by Intel

Intel expands the Intel® Agilex™ FPGA product offering with M-Series devices equipped with high fabric densities, in-package HBM2e memory, and DDR5 interfaces for high-memory bandwidth applications.

Learn more about the Intel® Agilex™ M-Series

featured paper

5 common Hall-effect sensor myths

Sponsored by Texas Instruments

Hall-effect sensors can be used in a variety of automotive and industrial systems. Higher system performance requirements created the need for improved accuracy and more integration – extending the use of Hall-effect sensors. Read this article to learn about common Hall-effect sensor misconceptions and see how these sensors can be used in real-world applications.

Click to read more

featured chalk talk

Twinax Flyover Systems for Next Gen Speeds

Sponsored by Samtec

As the demand for higher and higher speed connectivity increases, we need to look at our interconnect solutions to help solve the design requirements inherent with these kinds of designs. In this episode of Chalk Talk, Amelia Dalton and Matthew Burns from Samtec discuss how Samtec’s Flyover technology is helping solve our high speed connectivity needs. They take closer look at how Samtec’s Flyover technology helps solve the issue with PCB reach, the details of FLYOVER® QSFP SYSTEM, and how this cost effective, high–performance and heat efficient can help you with the challenges of your 56 Gbps bandwidths and beyond design.

Click here for more information about Twinax Flyover® Systems for Next Gen Speeds