Get Happy with MTAPI!

Standards Group Offers up a Multicore Programming Library

by Jim Turley

“I believe robots should only have faces if they truly need them.” – Donald A. Norman, PhD.

Two heads are better than one. But too many cooks spoil the broth. Which aphorism applies to multicore programming?

We all know that multicore processors are here, but, like the impending robot apocalypse, we’re not sure what to do about it. It’s all very nice to ooh and ahh over the latest multicore processors – Twenty cores! Forty cores! Bring ’em on! – but when it comes time to actually code the things… we’re often left staring at our shoes.

Programming multicore is hard (who knew?), but help is on the way. From several quarters, in fact. One of the beneficent entities is the Multicore Association which, it must be said, is appropriately named. They’re all about multicore hardware and software, and they’ve come up with a cunning plan.

It’s called MTAPI, and it’s a new multicore task-management (the that’s the MT- part of the name) application programming interface (the -API part) for multicore hardware with different (i.e., heterogeneous) computing resources. Got a DSP, a CPU, and a GPU all cohabitating in your system? Not sure how to allocate tasks to them all? MTAPI is here to help.

Like any API, MTAPI is just a set of rules, not code. It’s an agreed-upon interface layer between what you want and what the system can provide. It provides a framework for parallelization without having to know exactly how parallel your hardware is. And, like any API, it accommodates updates, upgrades, and complete overhauls of the underlying hardware.

In MTAPI’s view of the world, multicore programming is broken up into Jobs, Actions, and Tasks. A Job is simply a high-level abstraction along the lines of, “I want to apply a filter to these JPG images.” The Job is then broken up into Actions, which will be dispatched to individual hardware units, like a DSP or a GPU. Finally, Tasks are callable routines that perform the actual work of the Action. In a dual-processor system, you might have one Job (filtering the images), two Actions (one for the GPU, one for the DSP), and several Tasks that perform the hardware-specific filtering.

The idea is that you can replace the underlying hardware with more, or fewer, computing resources but retain the structure of your Jobs and Actions. The Tasks are generally going to be hardware-specific.

That may all sound pretty vague, but getting even that far was an achievement. There are lots of parallel-programming frameworks around (OpenGL and CUDA, for instance), but they’re intended either for heavyweight computer systems or for specific application areas like 3D graphics or gene folding. What the Multicore Association wanted was an API for embedded systems that wasn’t domain-specific.

To back up its interface, the group also created EMB2. It’s the meat in the MTAPI sandwich: a real, working library of code that implements many of the MTAPI APIs for resource-constrained embedded systems. The EMB2 library (which stands for Embedded Multicore Building Blocks, natch) was developed mostly at Siemens for use by its own developers in different business areas. Siemens has its fingers in a lot of pies, and the company’s far-flung developers were often trying to solve the same problems, on opposite sides of the world. Thus, the company decided to bite the bullet and take the lead in developing EMB2, partly to satisfy its own requirements and partly because it’s an active member of the Multicore Association and that’s the right thing to do.

EMB2 is free, open-source code with its own website. It comes with its own task scheduler, although you’re free to use an RTOS or other operating system if you prefer. Key among its design guidelines was determinism: EMB2 does not allocate memory on the fly. Once an object is allocated or a task instantiated, its memory is fixed. No runtime garbage collection here.

The current task scheduler is fairly… straightforward. It simply dispatches Actions (in the Job-Action-Task sense) to the next idle processor core, regardless of whether it’s a DSP, a CPU, a GPU, or some other compute engine. As such, some tasks may get assigned to processors for which they’re ill-suited, but that’s better than nothing, and better than hand-coding processor affinities. Future versions of EMB2 will be a lot smarter about pairing up software with the appropriate hardware.

Although EMB2 is a giveaway, it’s also solid, production-ready code that’s been through the corporate software-quality wringer, according to Siemens. It’s also not the only way to implement MTAPI standards. Other developers can implement the same feature set in other ways. Indeed, a few Multicore Association members have already scurried off and created their own equivalents to EMB2 for their own internal use. Commercial alternatives might also be on the horizon. EMB2 is the first, but probably not the only, solution to the multicore problem.

The MTAPI standard and the EMB2 library bridge the gap between theory and existence proof. MTAPI tells us how it should be done; EMB2 shows how it can be done. Between the two, we’re a lot farther along in our march toward robot domination.

One thought on “Get Happy with MTAPI!”

Knives have been with us for a few thousand years, but every year hundreds (more likely thousands) of new ones are designed every year. And many people thrive to design another one as another form or art or utility or just because they are bored, want a challenge, or more simply just because they can. And that’s actually really cool, and doesn’t really cost everyone something.

Designing new API’s, tools and processes just because we can, and then put into production code, unfortunately comes with some significant real costs and risks. They have to be maintained. They take up space and cause bloat when they are duplicates. People actually have to learn them, and train their successors, and their successors. And sometimes that knowledge gets lost, or miss-understood, and really band things unexpectedly happen.

The real questions to be asked, are fairly simple. Does this something new greatly reduce the complexity of the vast code library we must support. Does it provide something substantially new that we can not do without? Does it replace worse solutions? Does it provide significantly greater abstraction of the problem, and completely hide those difficult details? Etc.

And when this comes to parallel programming some specific questions are important. Does it completely hide race conditions, object and flow locking/sequencing, and transparently handle communications and messaging? Or is it just another tool for implementing fine grain complexity?

From a hardware perspective we know that we do not need another RTL language or tool chain, We know that we really need significantly higher level tools, that automatically hide that complexity and remove the timing risks of low level design.

This applies at the software level too … we need high level tools to describe process and flow, we do not want low level tools and API’s that force us to stay at the software equiv of RTL designs. We need tools that completely and transparently, instantiate sequencing, concurrent use, locking, and safe communications as part of the implementation of a high level description, hiding the technology it runs on.

The MTAPI web site claims “Compared to existing APIs that provide task management functionality (i.e. OpenMP, TBB, Cilk), MTAPI is designed for embedded systems.” resulting from a collection of designs starting back in 2005.

Back in 2005 the RTL guys were brutally stubborn resisting any migration away from RTL by high level design and synthesis tools. And embedded systems of that day were really low level designs with real discrete tiny microprocessors and memories. Embedded SOC systems today, are quite literally the desktop and server systems of 2005 on a chip.

A lot has changed. OpenMP was designed starting in 1997 to hide the complexity multi-core, multi-processor, HPC cluster hardware designs that are not that different that today’s SOC’s and embedded systems in 2017. From OpenMP.org “The users of OpenMP are working in industry and academia, in fields varying from aeronautics, automotive, pharmaceutics to finance, and on devices varying from accelerators, embedded multicore systems to high-end supercomputing systems.” This is Intel, IBM, AMD, TI, Cray, NEC, Fujitsu and nearly every serious stakehold in HPC systems design. http://www.openmp.org/about/members/

MTAPI.org believes they can create an API for parallel processing, but it’s not clear from their membership list that this is about increasingly complex HPC like systems we will be designing in the next decade. http://www.multicore-association.org/member/memberlist.php

From the TBB.org web site Intel states “Intel® Threading Building Blocks (Intel® TBB) lets you easily write parallel C++ programs that take full advantage of multicore performance, that are portable and composable, and that have future-proof scalability.”

From the CilkPlus.org web site Intel states “Intel® Cilk™ Plus is the easiest, quickest way to harness the power of both multicore and vector processing.”

MTAPI.org want’s to dismiss OpenMP.org, who’s goals are fully inclusive of MTAPI.org … why do we need another supposedly smaller lower level standard, written by people that do not have a LONG and CLEAR history of solving complex parallel programming hardware and software system level designs, like OpenMP already solves?

OpenMP started earlier, by people that really understand parallel programming, and is working to solve the design challenges for the next generation system level designs, including embedded.

From OpenMP.org:

The strength of the OpenMP ARB comes from the diverse representation from across its member
companies, all working together to ensure that the OpenMP API continues to grow and provide
the stable basis for computing that it has provided for more than 15 years.
Any organization providing products or services which support or depend upon the OpenMP
API should consider becoming a member of the OpenMP ARB. Everyone is invited to participate,
regardless of means and experience.

The OpenMP ARB today has the following subcommittees:

Accelerator subcommittee. This subcommittee deals with the development of mechanisms to describe regions of code where data and/or computation should be moved to another computing device.

Error Model subcommittee. This subcommittee defines error handling capabilities to improve the resiliency and stability of OpenMP applications in the presence of system-level, runtime-level, and user-defined errors. Features to abort parallel OpenMP execution cleanly have been defined, based on conditional cancellation and user-defined cancellation points.

Task subcommittee. This subcommittee with the tasking model in the OpenMP API.

Tools subcommittee. This subcommittee deals with the tools used around OpenMP.

Affinity subcommittee. This committee deals with the control of OpenMP thread affinity.

Fortran 2003 subcommittee. This committee deals with the support of Fortran 2003 features.

See: http://www.openmp.org/about/openmp-faq/#Problems