feature article
Subscribe Now

Intel oneAPI and DPC++: One Programming Language to Rule Them All (CPUs, GPUs, FPGAs, etc)

Nvidia has something that Intel and AMD covet. No, it’s not GPUs. Intel and AMD both make GPUs. However, they don’t have Nvidia’s not-so-secret weapon that’s a close GPU companion: CUDA, the parallel programming language that allows developers to harness GPUs to accelerate general-purpose (non-graphics) algorithms. Since its introduction in 2006, CUDA has become a tremendous and so-far unrivaled competitive advantage for Nvidia because it works with Nvidia GPUs, and only with Nvidia GPUs. Understandably, neither Intel nor AMD plan to let that competitive advantage go unchallenged.

Nvidia’s first GPU, the GeForce 256, appeared in 1999. That’s the same year that Ian Buck started working on his PhD at Stanford University, where he developed Brook, a series of extensions for the C programming language that allowed software developers to harness programmable graphics hardware, specifically GPUs, to perform general-purpose computations. Buck’s Brook extensions transformed GPUs into streaming, highly parallel, vector coprocessors. Nvidia liked what it saw in Brook, hired Buck in 2004, and introduced a more generalized set of parallel programming language extensions called CUDA in 2006.

CUDA now works with a variety of programming languages including C, C++, Fortran, Python, and MATLAB. The CUDA ecosystem has grown over the past 15 years into a significant competitive advantage for Nvidia and its GPUs. CUDA is now being used to help GPUs accelerate workloads in multiple computational domains including:

  •         Bioinformatics
  •         Computational chemistry
  •         Computational fluid dynamics
  •         Data science
  •         Machine learning and artificial intelligence (ML and AI)
  •         Weather and climate modeling

AMD’s answer to CUDA is ROCm, which closely resembles CUDA in the goal of making AMD Radeon and AMD Instinct GPUs more accessible as highly parallel vector processors that software developers can use to accelerate general-purpose computing workloads. According to the ROCm Information Portal, “AMD ROCm is the first open-source software development platform for HPC/Hyperscale-class GPU computing.” In addition, AMD-Xilinx is involved with a related project called SYCL, as you’ll see below.

Intel’s answer to CUDA is more complex and more nuanced, possibly because Intel makes so many kinds of processors including CPUs, GPUs, FPGAs, and specialized accelerators such as the Gaudi and Gaudi 2 AI processors – which can all be pressed into service as hardware-programmable accelerators for many different workloads. Perhaps because of the collective complexity of these multiple processors, Intel has developed a four-pronged workload taxonomy that just happens to overlay the company’s silicon on a 1-to-1 basis:

  • Scalar: complex workloads that run best on a CPU
  • Vector: workloads that can be decomposed into vectors of instructions or vectors of data elements and accelerated by running the code on a vector processor like a GPU
  • Matrix: workloads, including AI and ML, that perform many matrix calculations and run best on specialized AI/ML chips such as tensor processors
  • Spatial: workloads that require specialized, one-of-a-kind processors best constructed on the fly using FPGAs to meet the computational needs of the specific workload.

Of course, there’s some blurring of the lines here. Intel’s x86 CPUs have had SIMD instruction extensions since the last century, starting with 64-bit vector registers and MMX (Matrix Math eXtensions) instructions in the Pentium MX CPU, introduced in 1997. (Not to be confused with the matrix type in the Intel processor taxonomy shown above.) The latest Intel AVX-512 instruction extensions give Intel’s CPUs 512-bit SIMD vector capabilities. Similarly, Intel Agilex SoC FPGAs contain multiple Arm Cortex-A53 processors that provide SIMD vector capability through the Arm Neon instruction extensions.

In addition, the “scalar” (CPU), “vector” (GPU), and “matrix” (AI/ML processor) taxonomy types all describe the main type of calculations that the processor performs, while the “spatial” taxonomy type describes physical characteristics of the FPGA. Configured appropriately, FPGAs can execute scalar, vector, and matrix workloads but perhaps not as fast or as efficiently as specialized processors. So, I’d say that Intel’s four-pronged processor taxonomy is asymmetric or non-orthogonal, but perhaps that’s just me.

Intel’s solution to a unified programming model for all these processor types is oneAPI. Where Nvidia’s CUDA and AMD’s ROCm focus on accelerating vector workloads using a GPU’s innate vector capabilities, the oneAPI initiative aims to define a unified programming environment, toolset, and library for a computing world that now encompasses all four workload types listed above. The goal is to deliver a unified, open programming experience that reduces, and ultimately eliminates, the complexity of separately managed code bases, programming languages, tools, and workflows. To that end, Intel established oneAPI.io as an open community working to bring oneAPI tools to as many architectures as possible, with the intent of making oneAPI the one true universal multi architecture programming environment. OneAPI to rule them all. Of course, Intel will offer its own version of oneAPI to support its own silicon.

Intel’s Raja Koduri announced the oneAPI initiative and its core programming language, Data Parallel C++ (DPC++), during Intel’s Architecture Day in 2018. DPC++ is based on ISO C++ and the Khronos Group’s SYCL standards. SYCL is a royalty-free, cross-platform abstraction layer that allows software developers to write code for a collection of heterogeneous processors. Intel’s DPC++ extends these standards by providing explicit parallel constructs and offload interfaces to support a broad range of heterogeneous computing architectures and processors, including CPUs, GPUs, FPGAs, and other hardware accelerators. Intel has focused on supporting its own silicon, of course, but other parties have expanded oneAPI and DPC++ to support silicon processing elements from other semiconductor vendors, including AMD and Nvidia.

In addition to Intel DPC++, there are other SYCL implementations including:

  • Codeplay Software’s ComputeCpp
  • hipSYCL from the University of Heidelberg
  • neoSYCL from Tohoku University
  • triSYCL from AMD-Xilinx

Codeplay Software’s Chief Business Officer, Charles Macfarlane, gave an hour-long presentation during the Intel Vision event held on May 10 in Dallas where he described his company’s work with SYCL, oneAPI, and DPC++. Macfarlane explained that both Nvidia CUDA and SYCL aim to accelerate workload execution by running kernels, which are specific portions of the overall workload code, on alternative execution engines. In CUDA’s case, the target accelerators are Nvidia GPUs. For SYCL and DPC++, the choices are substantially broader.

SYCL has built-in mechanisms to permit easy retargeting of code to a variety of execution engines including CPUs, GPUs, AI engines, and FPGAs. Because SYCL is a non-proprietary standard, you can target processors from multiple silicon vendors if you have the right compiler. Various SYCL compilers support CPU architectures from multiple vendors.

Codeplay currently offers SYCL compilers that can target either Nvidia or AMD GPUs. However, an Intel blog posted on June 1 announced that the company had signed an agreement to acquire Codeplay, so it’s likely that Intel’s GPUs will soon be supported as well.

During his Intel Vision talk, Macfarlane recounted two examples that highlighted the effectiveness of oneAPI and DPC++ relative to CUDA. In the first example, the Zuse Institute Berlin took code for easyWave, a tsunami simulation workload, and automatically converted code written in CUDA for Nvidia GPUs into DPC++ code using Intel’s DPC++ Compatibility Tool (DPCT). The converted code could then be retargeted to Intel CPUs, GPUs, and FPGAs using appropriate compilers and libraries. However, that same DPC++ code could also run on Nvidia GPUs by using a different compiler and libraries.

Codeplay compiled the easyWave DPC++ code for Nvidia GPUs, and the results were within 4% of the original CUDA performance results. That experiment used machine-converted code with no additional tuning. Now, just between us, I suspect that a 4% performance loss is unlikely to get many people excited enough to convert from CUDA to DPC++, even if they acknowledge that a little tuning might achieve even better performance.

However, Macfarlane’s second example was more convincing. Codeplay used DPCT to convert N-body kernel code written in CUDA into SYCL code. The N-body kernel uses multidimensional vector math to simulate the motion of multiple particles under the influence of various physical forces. Codeplay compiled the resulting SYCL version of the N-body kernel directly, with no additional optimization or tuning. The original CUDA version of the N-body code kernel ran in 10.2 milliseconds while the converted DPC++ version of the N-body kernel ran in 8.79 milliseconds. Both results are for the same Nvidia GPU target. That’s a 14% performance improvement for machine-translated code. Optimization and tuning could further improve that result. Given the price of time on the supercomputers that generally run this type of workload, a 14% performance improvement is notable.

During his talk, Macfarlane explained that developers have two optimization levels that can make DPC++ code run even faster:

  • Auto tuning, which automatically selects the “best” algorithm from available libraries
  • Hand tuning using platform-specific optimization guidelines

There’s another optimization tool available to developers when targeting Intel silicon: the VTune Profiler. That’s Intel’s performance analysis and power optimization tool. Originally, the VTune Profiler worked only with CPU code, but Intel has extended the tool to cover code targeting Intel GPUs and FPGAs and has now integrated VTune into its oneAPI Base Toolkit.

Currently, Intel oneAPI and DPC++ support Intel CPUs and GPUs. There’s also some support for Intel FPGAs through the “Intel FPGA Add-On for the oneAPI Base Toolkit,” but the add-on supports only offline (ahead-of-time) kernel compilation for FPGAs. That means you need to generate the target kernel acceleration hardware before compiling your SYCL/DPC++ code. Support for the Intel Gaudi and Gaudi 2 AI processors from Habana Labs is a work in progress. The company expects to support these processors in the future.

It’s just a guess, but I expect that Intel’s acquisition of Codeplay will accelerate the development of oneAPI and DPC++ coverage for all of Intel’s processor types. Less clear to me was how or if Codeplay will continue to support non-Intel silicon after Intel completes this acquisition. However, just after the acquisition was announced, Codeplay’s CEO Andrew Richards published a blog on Codeplay’s site and specifically addressed this question. He wrote:

“Intel is focused on open standards, and Codeplay has led and contributed to multiple open standards including SYCL…

“This puts Codeplay in a strong position to work across the industry to bring SYCL and other open standards to both processor vendors and teams of software developers supporting the stated strategy of both parties. Codeplay is therefore at the heart of Intel’s strategy to democratize oneAPI and SYCL, ensuring that all processors support open standards.

“Intel’s support will enable Codeplay’s continued strength in the community and ecosystem. Codeplay will operate as a subsidiary business of Intel and will continue to support the multi-architecture, multi-vendor accelerator market.”

In his June 1 blog post announcing the acquisition, Intel General Manager for Software Products & Ecosystem, Joe Curley, seemed to echo Richards on this topic. He wrote:

“Subject to the closing of the transaction, which we anticipate later this quarter, Codeplay will operate as a subsidiary business as part of Intel’s Software and Advanced Technology Group (SATG). Through the subsidiary structure, we plan to foster Codeplay’s unique entrepreneurial spirit and open ecosystem approach for which it is known and respected in the industry.”

The contract to implement a DPC++ compiler to support AMD GPU-based supercomputers that Argonne National Laboratory in collaboration with Oak Ridge National Laboratory awarded to Codeplay in mid June is an encouraging sign that Codeplay and DPC++ will continue to support processors and accelerators from multiple vendors. It all looks very hopeful, and it will be interesting to see how this all plays out in the future.

One thought on “Intel oneAPI and DPC++: One Programming Language to Rule Them All (CPUs, GPUs, FPGAs, etc)”

Leave a Reply

featured blogs
Dec 5, 2023
Introduction PCIe (Peripheral Component Interconnect Express) is a high-speed serial interconnect that is widely used in consumer and server applications. Over generations, PCIe has undergone diversified changes, spread across transaction, data link and physical layers. The l...
Nov 27, 2023
See how we're harnessing generative AI throughout our suite of EDA tools with Synopsys.AI Copilot, the world's first GenAI capability for chip design.The post Meet Synopsys.ai Copilot, Industry's First GenAI Capability for Chip Design appeared first on Chip Design....
Nov 6, 2023
Suffice it to say that everyone and everything in these images was shot in-camera underwater, and that the results truly are haunting....

featured video

Dramatically Improve PPA and Productivity with Generative AI

Sponsored by Cadence Design Systems

Discover how you can quickly optimize flows for many blocks concurrently and use that knowledge for your next design. The Cadence Cerebrus Intelligent Chip Explorer is a revolutionary, AI-driven, automated approach to chip design flow optimization. Block engineers specify the design goals, and generative AI features within Cadence Cerebrus Explorer will intelligently optimize the design to meet the power, performance, and area (PPA) goals in a completely automated way.

Click here for more information

featured paper

Power and Performance Analysis of FIR Filters and FFTs on Intel Agilex® 7 FPGAs

Sponsored by Intel

Learn about the Future of Intel Programmable Solutions Group at intel.com/leap. The power and performance efficiency of digital signal processing (DSP) workloads play a significant role in the evolution of modern-day technology. Compare benchmarks of finite impulse response (FIR) filters and fast Fourier transform (FFT) designs on Intel Agilex® 7 FPGAs to publicly available results from AMD’s Versal* FPGAs and artificial intelligence engines.

Read more

featured chalk talk

Power Gridlock
The power grid is struggling to meet the growing demands of our electrifying world. In this episode of Chalk Talk, Amelia Dalton and Jake Michels from YAGEO Group discuss the challenges affecting our power grids today, the solutions to help solve these issues and why passive components will be the heroes of grid modernization.
Nov 28, 2023