feature article
Subscribe Now

Parallel Processing

You Say Heterogeneous, He Says Homogeneous

In the days when the American colonies were revolting, British soldiers were dressed in red, and all their actions were executed in unison. (At least, those actions concerned with military matters.) A rigid formation marched together, and, in combat, loaded, aimed, and fired their muskets as one. The musket had only a relatively short range, and a volley fired was aimed not at individuals but at the mass of the opposing forces. On the whole this worked well when facing another army. It was also an efficient way of deploying large numbers of men.

However the revolting colonists didn’t work that way. They, by the European standards, cheated. They wore dark clothes, ran individually rather than marching as a fixed formation, hid behind trees and rocks, and used rifles that were aimed at a specific target and that threw a bullet for a long distance. The British Army, eventually, copied this approach and created Rifle Regiments, usually dressed in green, where the men were trained as skirmishers, operating independently. (If you have seen Sharpe on television or read Robert Cornwell’s original novels, you know that Sharpe was a rifleman.) Ok – you have got this far, and July 4th is not far away, but what on earth has this to do with embedded computing?

Well, with a bit of a stretch, the two models of infantry resemble two general models of multicore computing. The redcoats were classic SIMD – Single Instruction, Multiple Device. Bark an order and all the processors do the same thing at the same time. The green-jacketed riflemen are the alternative, MIMD (Multiple Instructions, Multiple Devices) – identical devices (at one level of abstraction) carrying out different instructions while solving a single problem.

But armies, even then, had more than just infantry. There were cavalry and artillery as well. So now we have another model of multicore computing. The SIMD and MIMD systems are homogeneous parallel computing, while an army is heterogeneous, with distinctly different units cooperating to achieve the same end.

This flight of fancy was triggered by a day spent at a conference on heterogeneous versus homogeneous architectures. (The multicore challenge was organised by Test and Verification Solutions, and the actual presentations are all on-line here). The venue was in Bristol, where thirty years ago Inmos was working on the transputer – a very early parallel computing element. In fact both of the keynote speakers, David May, of XMOS and Bristol University, and Tony King-Smith, of Imagination Technologies, were at Inmos, and several of the other speakers were either ex-Inmos or worked for one of the companies that were founded by ex-Inmos people.

This was one of the themes that came through the conference: multiple processors co-operating to solve a problem is not a new approach. David May’s keynote was originally titled “A History of Homogeneous vs. Hetereogeneous” but his delivery was “Heterogeneous Multicores? Why?” He traced the heterogeneous architecture back to the IBM 360 range, launched in 1964. This was a single, general-purpose architecture that had a wide range of different models, serving both commercial and scientific markets – the first time that this approach had been used. But the central processing units were fed with IO through channels, each managed by a specialist processor with a dedicated communications architecture. As time went on, different specialised processors were added to general purpose CPUs. These included floating point units, communications processors, and processors. Understanding how to build multicore devices was well understood twenty years ago, but, as the PC gained increasing processing power through the workings of Moore’s Law driven by exponential market growth, that knowledge was generally ignored. Listen to at least the historical part of May’s presentation – it is fascinating.

We are now at the point where simply scaling the process is no longer working, and so we are again looking at multiple processors. Graphics processors are being used, not just for graphics, but also for high performance computing (HPC). May’s view is that the drive for FLOPS (Floating Point Operations per Second) to be in the top 500 list may have a distorting effect on the usefulness of the systems. But general purpose computing is a different set of issues. Here performance is governed, not just by processing, but also by the interconnect. And the languages. And the tools and algorithms.

And here we are, back to where we were 20 years ago. There is a lack of standardisation – of graphics architectures, which makes it impossible to write portable, re-usable parallel software, and of programming languages, in which to write portable applications. The debate between SIMD and MIMD he dismissed as irrelevant, particularly as MIMD can always run SIMD, but not vice-versa. (And May’s XMOS is shipping large quantities of MIMD.)

Why is developing parallel systems still an issue? Partly because, driven by Moore’s Law, we have made things complex because we can; partly because a focus on cache-coherent shared memory has inhibited thinking about simpler models of parallelism; and partly because “design by aggregation” in hardware – assembling large devices from smaller ones -means that the larger devices inherit all the issues of the smaller ones, as it is too risky to alter proven designs.

May’s answer is to start with the interconnect – which is the physical manifestation of scalable message passing. By thinking of processes and communications, instead of today’s focus, in education and elsewhere, on algorithms and data structures; by providing low latency communications and processors that have to be only fast enough to keep up with the interconnect; by concentrating on an economical chip size for processors and memory and stacking the chips; and by using silicon photonics to join the stacks, it will be possible to create general purpose computing elements whose behaviour is defined by software.

Heterogeneous architectures suffer from complexity of multiple architectures, languages, and tool chains, and May argues that heterogeneity should be in the implementation, not in the architecture. In the model of processing that he described, heterogeneity may be needed at the interface to other domains and for specialist domains – perhaps quantum computing.

The other key note speaker had a different approach, not surprisingly, as Imagination Technologies (the other British IP company) has an IP portfolio that includes a Graphics Processing Unit (GPU), Video (and vision) Processing Unit (VPU), and a Radio Processing Unit (RPU) – plus, since December 2012, the MIPS CPU core. Where May was looking at history and theory, Tony King-Smith kicked off by putting the focus on the consumer. He returned to evolution by claiming that the historic focus on a core plus “off-load” engines was wrong – today the “off-load” engines are processors of equal power and importance as the CPU. A heterogeneous SoC, with multiple processors on board, frequently replaces multiple chips and buses. Much of the capability of the processor cores is either programmable or configurable, so they can be highly tuned for their specific functions, and silicon designers will implement a different balance of functions for their particular system.

For most of the uses of these SoCs, typically mobile phones and tablets, there is pressure to produce new versions every year. Chip manufacturers have to simplify the transition, which they do by trying to maintain the pin out through successive generations, but there is still the issue of the middleware between the applications and the silicon, and the time this takes to be developed.  Since it is the applications that are the reason that the phone, or whatever, is being bought, the development cycle bottlenecks include mapping the application onto the SoC. King-Smith suggests that future applications will be developed to be self-configuring – drawing information on the resources from the hardware and then allocating actions to appropriate processors: the application makes decisions on the optimal way it should be running itself.

He is positive about the growth of standard interfaces that will improve portability: OpenCL, for example. (However, once you start looking, there is an alphabet soup of APIs and similar tools.)

Outside the keynotes were quite a number of presentations covering low power, high performance, and related matters. There were poster sessions on academic research and on academic/industrial cooperative research.  And there were demonstrations of tools. What was, I suppose, inevitable was that there was a certain amount of terminological confusion, particularly with the differences in the aims of someone building a new supercomputer or providing the computing elements to build it, and those aiming at things like the fast changing consumer market. There was little on industrial applications, which reflects the shift in the driving factor for electronics from industry to consumer electronics and particularly the smart phone and the internet. As David May pointed out, the world is throwing away a billion phones a year, and that number is still increasing. Chips are in production for a year or so and then replaced, as the hand-sets are replaced. (My 14-month-old Galaxy SIII has now been overtaken by the S4, and I still haven’t begun to use many of the features!)

I have touched on only a few elements of a long and interesting day, and there is a ton of material on the web site – so do follow it up. My impressions of the day are that parallel computing, whether homogeneous or heterogeneous, is inevitable – in fact it is already here. But much of what is being done to create systems is only just above the software equivalent of duct tape and cable ties. It is not that there is insufficient knowledge on how to rise above this level, but that this knowledge is not widely disseminated. A significant barrier to better adoption of parallel programming is the emphasis still placed on sequential processing and on the apparatus of data structures and algorithms that have grown up to support this approach. As someone in the lunch queue said, “My five-year-old thinks in parallel – she can dance and sing at the same time.” But, allowing for gross simplification, computer science students at the beginning of their course are fertile ground for ideas. By the end of their degree most are conditioned to thinking sequentially – in fact, almost without exception, they are already religious zealots for a specific programming language.

What is needed is work on standardisation for abstraction layers, so that applications can be portable. It is true that the most prominent example of an attempt to do this, with Java, is not generally regarded as a success, but we have moved on since then, haven’t we? With a nice, efficient interface between the application and the silicon, application developers can concentrate on their applications, insulated from the sordid realities of the platform that will be executing it. But still, I have a sneaking feeling that, in twenty years’ time, a similar conference will be addressing similar issues.

7 thoughts on “Parallel Processing”

  1. “In the days when the American colonies were revolting”

    You mean, before they installed showers and sewage treatment systems? I suspect many of the British camps were pretty revolting as well…

    [runs]

  2. While having the same roots as David’s may CSP inspired transputer, OpenComRTOS also supports system-wide shared data structures.
    OpenComRTOS is the unique formally developed RTOS that can program seamlessly even heterogenous systems from a single processor to 1000’s of processing nodes. It comes with a visual modeling environment whereby the developer independently specifies his parallel multi-processor target system and application architecture. Tasks and interaction entities can then be transparently mapped to any node in the system, even when the processors are of a different type. With prioritised scheduling and support for distributed priority inheritance, the system remains real time predictable with a typical code size of less than 10 KB per processing node. From v.1.6 of the OpenComRTOS Designer environment on, the developers benefits from a streamlined kernel source code and new features. OpenComRTOS Designer is however a lot more than an RTOS.
    See more at http://www.altreonic.com/content/opencomrtos-designer-new-v16-release
    Topics: OpenComRTOS: Middleware as Extended OS / Converting legacy POSIX style RTOS applications / Blackboard-Hub / Improved distributed priority inheritance / Build-in workload monitor / Standardised interrupt latency measurement.

  3. Fascinating, (as a complete aside) perhaps there is a lesson and a timely warning from history in this article.

    The modern enemies of the dominant world power (now the USA, rather than England), use military tactics which are not conventional battlefield tactics. Instead of behaving like rifle regiments who wear uniforms and are visible on the battlefield, they use techniques which we think of as “cheating”, and so we call them cowards and terrorists.

    Perhaps the USA now needs to adapt its military tactics to this new way of fighting, just like England had to adapt its outmoded regimented battlefield fighting style, in order to avoid losing every battle to the nimbler “cheating” Americans colonists.

    Perhaps fighting like the terrorists do may seem like an unconscionable idea to modern Americans, but historically the suborn pride of the English generals and their unwillingness to adopt these new “cheating” techniques cost the lives of thousands of British soldiers and many defeats.

    Surely the military concept of a visible army of soldiers in uniform fighting against an opposing visible army of soldiers in uniform, is now as antiquated as that of having neat lines of musket-men wearing bright red uniforms marching slowly towards the enemy?

    Just a passing thought anyway 🙂

Leave a Reply

featured blogs
Aug 16, 2018
Learn about the challenges and solutions for integrating and verification PCIe(r) Gen4 into an Arm-Based Server SoC. Listen to this relatively short webinar by Arm and Cadence, as they describe the collaboration and results, including methodology and technology for speeding i...
Aug 16, 2018
All of the little details were squared up when the check-plots came out for "final" review. Those same preliminary files were shared with the fab and assembly units and, of course, the vendors have c...
Aug 15, 2018
VITA 57.4 FMC+ Standard As an ANSI/VITA member, Samtec supports the release of the new ANSI/VITA 57.4-2018 FPGA Mezzanine Card Plus Standard. VITA 57.4, also referred to as FMC+, expands upon the I/O capabilities defined in ANSI/VITA 57.1 FMC by adding two new connectors that...
Aug 14, 2018
I worked at HP in Ft. Collins, Colorado back in the 1970s. It was a heady experience. We were designing and building early, pre-PC desktop computers and we owned the market back then. The division I worked for eventually migrated to 32-bit workstations, chased from the deskto...
Jul 30, 2018
As discussed in part 1 of this blog post, each instance of an Achronix Speedcore eFPGA in your ASIC or SoC design must be configured after the system powers up because Speedcore eFPGAs employ nonvolatile SRAM technology to store its configuration bits. The time required to pr...