Nvidia has something that Intel and AMD covet. No, it’s not GPUs. Intel and AMD both make GPUs. However, they don’t have Nvidia’s not-so-secret weapon that’s a close GPU companion: CUDA, the parallel programming language that allows developers to harness GPUs to accelerate general-purpose (non-graphics) algorithms. Since its introduction in 2006, CUDA has become a tremendous and so-far unrivaled competitive advantage for Nvidia because it works with Nvidia GPUs, and only with Nvidia GPUs. Understandably, neither Intel nor AMD plan to let that competitive advantage go unchallenged.
Nvidia’s first GPU, the GeForce 256, appeared in 1999. That’s the same year that Ian Buck started working on his PhD at Stanford University, where he developed Brook, a series of extensions for the C programming language that allowed software developers to harness programmable graphics hardware, specifically GPUs, to perform general-purpose computations. Buck’s Brook extensions transformed GPUs into streaming, highly parallel, vector coprocessors. Nvidia liked what it saw in Brook, hired Buck in 2004, and introduced a more generalized set of parallel programming language extensions called CUDA in 2006.
CUDA now works with a variety of programming languages including C, C++, Fortran, Python, and MATLAB. The CUDA ecosystem has grown over the past 15 years into a significant competitive advantage for Nvidia and its GPUs. CUDA is now being used to help GPUs accelerate workloads in multiple computational domains including:
- Computational chemistry
- Computational fluid dynamics
- Data science
- Machine learning and artificial intelligence (ML and AI)
- Weather and climate modeling
AMD’s answer to CUDA is ROCm, which closely resembles CUDA in the goal of making AMD Radeon and AMD Instinct GPUs more accessible as highly parallel vector processors that software developers can use to accelerate general-purpose computing workloads. According to the ROCm Information Portal, “AMD ROCm is the first open-source software development platform for HPC/Hyperscale-class GPU computing.” In addition, AMD-Xilinx is involved with a related project called SYCL, as you’ll see below.
Intel’s answer to CUDA is more complex and more nuanced, possibly because Intel makes so many kinds of processors including CPUs, GPUs, FPGAs, and specialized accelerators such as the Gaudi and Gaudi 2 AI processors – which can all be pressed into service as hardware-programmable accelerators for many different workloads. Perhaps because of the collective complexity of these multiple processors, Intel has developed a four-pronged workload taxonomy that just happens to overlay the company’s silicon on a 1-to-1 basis:
- Scalar: complex workloads that run best on a CPU
- Vector: workloads that can be decomposed into vectors of instructions or vectors of data elements and accelerated by running the code on a vector processor like a GPU
- Matrix: workloads, including AI and ML, that perform many matrix calculations and run best on specialized AI/ML chips such as tensor processors
- Spatial: workloads that require specialized, one-of-a-kind processors best constructed on the fly using FPGAs to meet the computational needs of the specific workload.
Of course, there’s some blurring of the lines here. Intel’s x86 CPUs have had SIMD instruction extensions since the last century, starting with 64-bit vector registers and MMX (Matrix Math eXtensions) instructions in the Pentium MX CPU, introduced in 1997. (Not to be confused with the matrix type in the Intel processor taxonomy shown above.) The latest Intel AVX-512 instruction extensions give Intel’s CPUs 512-bit SIMD vector capabilities. Similarly, Intel Agilex SoC FPGAs contain multiple Arm Cortex-A53 processors that provide SIMD vector capability through the Arm Neon instruction extensions.
In addition, the “scalar” (CPU), “vector” (GPU), and “matrix” (AI/ML processor) taxonomy types all describe the main type of calculations that the processor performs, while the “spatial” taxonomy type describes physical characteristics of the FPGA. Configured appropriately, FPGAs can execute scalar, vector, and matrix workloads but perhaps not as fast or as efficiently as specialized processors. So, I’d say that Intel’s four-pronged processor taxonomy is asymmetric or non-orthogonal, but perhaps that’s just me.
Intel’s solution to a unified programming model for all these processor types is oneAPI. Where Nvidia’s CUDA and AMD’s ROCm focus on accelerating vector workloads using a GPU’s innate vector capabilities, the oneAPI initiative aims to define a unified programming environment, toolset, and library for a computing world that now encompasses all four workload types listed above. The goal is to deliver a unified, open programming experience that reduces, and ultimately eliminates, the complexity of separately managed code bases, programming languages, tools, and workflows. To that end, Intel established oneAPI.io as an open community working to bring oneAPI tools to as many architectures as possible, with the intent of making oneAPI the one true universal multi architecture programming environment. OneAPI to rule them all. Of course, Intel will offer its own version of oneAPI to support its own silicon.
Intel’s Raja Koduri announced the oneAPI initiative and its core programming language, Data Parallel C++ (DPC++), during Intel’s Architecture Day in 2018. DPC++ is based on ISO C++ and the Khronos Group’s SYCL standards. SYCL is a royalty-free, cross-platform abstraction layer that allows software developers to write code for a collection of heterogeneous processors. Intel’s DPC++ extends these standards by providing explicit parallel constructs and offload interfaces to support a broad range of heterogeneous computing architectures and processors, including CPUs, GPUs, FPGAs, and other hardware accelerators. Intel has focused on supporting its own silicon, of course, but other parties have expanded oneAPI and DPC++ to support silicon processing elements from other semiconductor vendors, including AMD and Nvidia.
In addition to Intel DPC++, there are other SYCL implementations including:
- Codeplay Software’s ComputeCpp
- hipSYCL from the University of Heidelberg
- neoSYCL from Tohoku University
- triSYCL from AMD-Xilinx
Codeplay Software’s Chief Business Officer, Charles Macfarlane, gave an hour-long presentation during the Intel Vision event held on May 10 in Dallas where he described his company’s work with SYCL, oneAPI, and DPC++. Macfarlane explained that both Nvidia CUDA and SYCL aim to accelerate workload execution by running kernels, which are specific portions of the overall workload code, on alternative execution engines. In CUDA’s case, the target accelerators are Nvidia GPUs. For SYCL and DPC++, the choices are substantially broader.
SYCL has built-in mechanisms to permit easy retargeting of code to a variety of execution engines including CPUs, GPUs, AI engines, and FPGAs. Because SYCL is a non-proprietary standard, you can target processors from multiple silicon vendors if you have the right compiler. Various SYCL compilers support CPU architectures from multiple vendors.
Codeplay currently offers SYCL compilers that can target either Nvidia or AMD GPUs. However, an Intel blog posted on June 1 announced that the company had signed an agreement to acquire Codeplay, so it’s likely that Intel’s GPUs will soon be supported as well.
During his Intel Vision talk, Macfarlane recounted two examples that highlighted the effectiveness of oneAPI and DPC++ relative to CUDA. In the first example, the Zuse Institute Berlin took code for easyWave, a tsunami simulation workload, and automatically converted code written in CUDA for Nvidia GPUs into DPC++ code using Intel’s DPC++ Compatibility Tool (DPCT). The converted code could then be retargeted to Intel CPUs, GPUs, and FPGAs using appropriate compilers and libraries. However, that same DPC++ code could also run on Nvidia GPUs by using a different compiler and libraries.
Codeplay compiled the easyWave DPC++ code for Nvidia GPUs, and the results were within 4% of the original CUDA performance results. That experiment used machine-converted code with no additional tuning. Now, just between us, I suspect that a 4% performance loss is unlikely to get many people excited enough to convert from CUDA to DPC++, even if they acknowledge that a little tuning might achieve even better performance.
However, Macfarlane’s second example was more convincing. Codeplay used DPCT to convert N-body kernel code written in CUDA into SYCL code. The N-body kernel uses multidimensional vector math to simulate the motion of multiple particles under the influence of various physical forces. Codeplay compiled the resulting SYCL version of the N-body kernel directly, with no additional optimization or tuning. The original CUDA version of the N-body code kernel ran in 10.2 milliseconds while the converted DPC++ version of the N-body kernel ran in 8.79 milliseconds. Both results are for the same Nvidia GPU target. That’s a 14% performance improvement for machine-translated code. Optimization and tuning could further improve that result. Given the price of time on the supercomputers that generally run this type of workload, a 14% performance improvement is notable.
During his talk, Macfarlane explained that developers have two optimization levels that can make DPC++ code run even faster:
- Auto tuning, which automatically selects the “best” algorithm from available libraries
- Hand tuning using platform-specific optimization guidelines
There’s another optimization tool available to developers when targeting Intel silicon: the VTune Profiler. That’s Intel’s performance analysis and power optimization tool. Originally, the VTune Profiler worked only with CPU code, but Intel has extended the tool to cover code targeting Intel GPUs and FPGAs and has now integrated VTune into its oneAPI Base Toolkit.
Currently, Intel oneAPI and DPC++ support Intel CPUs and GPUs. There’s also some support for Intel FPGAs through the “Intel FPGA Add-On for the oneAPI Base Toolkit,” but the add-on supports only offline (ahead-of-time) kernel compilation for FPGAs. That means you need to generate the target kernel acceleration hardware before compiling your SYCL/DPC++ code. Support for the Intel Gaudi and Gaudi 2 AI processors from Habana Labs is a work in progress. The company expects to support these processors in the future.
It’s just a guess, but I expect that Intel’s acquisition of Codeplay will accelerate the development of oneAPI and DPC++ coverage for all of Intel’s processor types. Less clear to me was how or if Codeplay will continue to support non-Intel silicon after Intel completes this acquisition. However, just after the acquisition was announced, Codeplay’s CEO Andrew Richards published a blog on Codeplay’s site and specifically addressed this question. He wrote:
“Intel is focused on open standards, and Codeplay has led and contributed to multiple open standards including SYCL…
“This puts Codeplay in a strong position to work across the industry to bring SYCL and other open standards to both processor vendors and teams of software developers supporting the stated strategy of both parties. Codeplay is therefore at the heart of Intel’s strategy to democratize oneAPI and SYCL, ensuring that all processors support open standards.
“Intel’s support will enable Codeplay’s continued strength in the community and ecosystem. Codeplay will operate as a subsidiary business of Intel and will continue to support the multi-architecture, multi-vendor accelerator market.”
In his June 1 blog post announcing the acquisition, Intel General Manager for Software Products & Ecosystem, Joe Curley, seemed to echo Richards on this topic. He wrote:
“Subject to the closing of the transaction, which we anticipate later this quarter, Codeplay will operate as a subsidiary business as part of Intel’s Software and Advanced Technology Group (SATG). Through the subsidiary structure, we plan to foster Codeplay’s unique entrepreneurial spirit and open ecosystem approach for which it is known and respected in the industry.”
The contract to implement a DPC++ compiler to support AMD GPU-based supercomputers that Argonne National Laboratory in collaboration with Oak Ridge National Laboratory awarded to Codeplay in mid June is an encouraging sign that Codeplay and DPC++ will continue to support processors and accelerators from multiple vendors. It all looks very hopeful, and it will be interesting to see how this all plays out in the future.