Intel released more information last week on One API – a “unified programming model to simplify application development across diverse computing architectures.” This obviously ostentatious claim evokes a firewall between the functionality of a software application and the computing hardware that is executing it. The most extreme interpretation would be that, using One API, a developer or team could develop code once and have it execute on anything from a simple MCU to a complex distributed parallel heterogeneous computing system, including CPUs, GPUs, FPGAs, neural network processors, and other specialized accelerators.
If you’ve studied the challenges of cross-platform optimization for long, your instinctive response to this claim is to put on your anti-gravity boots, fire up your perpetual motion machine, take a big old gulp of your favorite snake-oil elixir, and burp up a resounding “Nope!” It’s hard to think of a more grandiose claim one could even make in the world of application development. The idea is just not even in the realm of feasible.
But we engineers are big science fiction fans, aren’t we? We understand the concept of temporarily suspending disbelief for the sake of fleshing out an interesting conceptual scenario. We get that accepting a false postulate can allow one to prove absolutely anything, but sometimes asking yourself, “OK, what if time travel actually IS possible?” allows one to explore useful and interesting narratives.
Let’s go there, shall we?
What if… Intel’s One API actually works? What if the company is able to deliver a set of tools and languages that effectively separate software application development from the underlying compute architecture? After all, wasn’t this the promise of the first high-level language compilers – that we could write code with no consideration whatsoever of the processor instruction set? And, didn’t compilers actually deliver on exactly that promise? When was the last time you were writing C++ or Python and needed to know the opcode for a particular machine instruction?
But the world got a bit more complicated than single-thread von Neuman machines, and the absolute abstraction promised and delivered by conventional software compilers was unceremoniously and permanently taken away. Today, the development of any application for a modern, parallel, accelerated heterogeneous computing platform requires an intimate knowledge of the underlying hardware architecture and a vast range of developer expertise, depending on the elements that comprise that architecture.
If your platform includes FPGA acceleration, for example, you’ll need some FPGA experts in the house if you plan to extract even marginally optimized performance from your FPGA. If you’re counting on GPUs for some of your heavy lifting, there had better be some folks with some solid experience using CUDA, OpenCL, or something similar hanging around, or you’ll be burning a bunch of GPU power with very little return. Bring in the brave new world of systems that include accelerators for convolutional neural networks and you have to add data scientists and some other rare talent to the roster. The whole idea of taking a complex code base and executing it well on any system with an arbitrary mixture of those elements is absolute Pollyanna, pie-in-the-sky, starry eyed, rubbish.
Intel? Just how gullible to you think we all are?
But – we promised to suspend disbelief here, didn’t we? Let’s get back on track and start with Intel’s actual claims on what they plan to deliver:
Intel’s breadth of architectures span scalar (CPU), vector (GPU), matrix (AI) and spatial (FPGA). These architectures, often referred to at Intel with the acronym SVMS, require an efficient software programming model to deliver performance. One API addresses this with ease-of-use and performance, while eliminating the need to maintain separate code bases, multiple programming languages, and different tools and workflows.
OK, saying the architectures “require” such a programming model is a bit like saying we “require” cold fusion. Sure, it would be awesome if we had it, but upgrading “really nice to have” to the status of “requirement” is likely to disappoint us in the long run when we start considering things like the laws of physics. But, we digress – suspend, suspend, suspend. If we accept that “requirement,” exactly what is Intel going to do about it?
One API supports direct programming and API programming, and will deliver a unified language and libraries that offer full native code performance across a range of hardware, including CPUs, GPUs, FPGAs, and AI accelerators.
Whoa! They actually said that, and without any “Hahaha just kidding” at the end, or April 1 dateline on the press release.
How does Intel say they will accomplish this?
One API contains a new direct programming language, Data Parallel C++ (DPC++), an open, cross-industry alternative to single architecture proprietary languages. DPC++ delivers parallel programming productivity and performance using a programming model familiar to developers. DPC++ is based on C++, incorporates SYCL* from The Khronos Group and includes language extensions developed in an open community process.
One API’s powerful libraries span several workload domains that benefit from acceleration. Library functions are custom-coded for each target architecture.
Building on leading analysis tools, Intel will deliver enhanced versions of analysis and debug tools to support DPC++ and the range of SVMS architectures.
What if we look more closely at just one sub-component of that plan: the FPGA part. The single biggest barrier to the adoption of FPGAs as compute accelerators, one that FPGAs have toiled against for decades, is that producing truly optimized (or even reasonably good) implementations of any algorithm in FPGA fabric requires engineers with an intimate understanding of logic design and fluency in some hardware description language (HDL) such as Verilog or VHDL. And, the resulting application will most likely be written at a fairly low level of abstraction such as register-transfer level, which requires a sequential algorithm to be elegantly partitioned, pipelined, unrolled, quantized, characterized, and tweaked into a collection of datapaths, controllers, and memory that creates the perfect compromise between hardware resource utilization, power consumption, throughput, latency, and accuracy.
This multi-dimemsional NP-complete computing problem is just one of the obstacles standing in the path of taking some software code that includes a seemingly innocuous structure such as a nested “for” loop with some computation inside, and turning it into a months-long engineering optimization nightmare with heated debates held over late-night-delivered cold pizza slices. The industry has attempted numerous solutions to this challenge, but the most viable two seem to be: 1) encapsulation of hand-optimized hardware IP blocks for popular functions that can be accessed via API and 2) algorithmic synthesis approaches such as high-level synthesis (HLS) that attempt to automatically generate optimized parallel hardware microarchitectures from sequential software code based on a set of optimization targets and metrics.
Crazy as that all sounds, we haven’t yet crossed the line into data-science-fiction. HLS tools have been steadily improving since about 1995 and are now in widespread use in the design of algorithmic blocks for custom ASICs. Xilinx has mainstreamed the use of HLS for FPGAs by including a reasonably-capable HLS tool in their default tool suite, and a large number of FPGA designs have successfully taken advantage of HLS. Intel themselves now have an HLS tool for FPGA design, although Intel’s tool currently has far less mileage than Xilinx’s.
The problem with HLS tools in the context of the One API vision is the expertise required. All of the successful HLS tools the industry has seen thus far are intended to be power tools for hardware engineers, rather than automated hardware design tools for software engineers. In the hands of digital hardware experts, HLS tools have proven to dramatically reduce design cycles and have frequently produced hardware accelerators that outperform those hand-optimized by expert RTL designers. The problem is that HLS in the hands of a software engineer is like handing a violin to a professional piano player. Regardless of the pianist’s level of musical expertise, you won’t get pleasing results.
Intel, however, has claimed from the beginning that their HLS tool was aimed at software developers, and that would separate it dramatically from the HLS tools that have actually achieved widespread adoption and technical success. It remains to be seen whether Intel’s HLS tool can deliver on that promise in a meaningful way, but numerous companies have clearly failed in similar attempts. Still, it will be interesting to watch market response to Intel’s HLS, and to see how Intel integrates HLS into One API.
While we have just looked at the plausibility of one sub-branch of the One API challenge tree, rest assured that all the other branches are soberingly similar. Getting conventional code to run smoothly in a hybrid CPU/GPU environment, integrating CNN accelerators into a system, dealing with arbitrary networked heterogeneous compute resources – all of these are discrete and daunting problems. Heck, even just getting conventional applications to run efficiently on multi-core homogeneous CPUs is crazy hard. Bring in the idea of optimized IP function blocks smoothly swapped in, depending on the target compute platform and – Intel has thrown down a gauntlet for themselves that will be mighty challenging to pick up.
Still… what if?
How long will we have to wait to find out? “Intel will release a developer beta and additional details on the One API project in 2019’s fourth quarter.”
OK, Intel. Let’s see what you’ve got.