feature article
Subscribe Now

Intel’s Gamble on oneAPI and DPC++ for Parallel Processing and Heterogeneous Computing: An Interview with Intel’s James Reinders

Intel is placing many big bets on semiconductor process improvements, building new fabs and manufacturing plants around the world, new packaging technologies, and even software. One of those bets, or perhaps a group of bets, is oneAPI and Data Parallel C++ (DPC++), which are an open, cross-architecture programming model that frees developers to use a single code base across multiple architectures and a parallel-programming variant of C/C++ based on Khronos SYCL. These bets are designed to make it easier for software developers to create relatively portable code for systems based on heterogeneous computing architectures.

James Reinders recently returned to Intel after a four year absence. He previously spent 27 years at Intel and has a ton of parallel-processing experience under his belt. He’s the author of the book titled “Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL,” which you can download for free by clicking on the link.

I recently spent an hour interviewing Reinders and he covered a wide range of topics. Here’s an edited version of his views on a selected set of those topics relating to parallel processing and heterogeneous computing.

One oneAPI and SYCL:

“Both oneAPI and SYCL are foundational tools that share a vision of accelerated computing based on open specifications and open projects. Both oneAPI and SYCL must serve the needs of multiple vendors and multiple architectures. Not just the needs of one vendor. Not just GPUs, CPUs, or FPGAs. The tools need to be open to the maximum extent that we can figure out how to make them open, because these languages and programming environments give you a high-performance foundation for everything else that you do.”

On Python versus C/C++ or DPC++ under oneAPI

“Python is largely written in C. Key libraries are also written in C, so it’s not that you ignore Python when developing oneAPI. If you get the foundation right, other good things happen. OneAPI says, “Hey, the C or C++ languages aren’t the whole world. You need libraries. You need tools. You need other languages.” So oneAPI is kind of a blanket name not just for the languages, but for all the rest of the things you need to develop software for heterogeneous computing.”

On John Hennessy’s and David Patterson’s “Golden Age of Computers”

“John Hennessy and David Patterson are legends in our industry. Any time they’ve spoken in public over the last four years, they’d discussed a new, Golden Age of computer architecture. The version I usually point to the most is the article they put in Communications of the ACM in early 2019, where they do a great job of discussing the progression of computer architectures over time, and they end up with the answer. They say we’re entering a new Golden Age of computer architecture where specialized, Domain Specific Architectures (DSAs) are increasingly used to accelerate workloads and to get better performance per watt, which is a driving concern for some problems. OneAPI is designed to handle those DSAs in a unified programming environment.”

On Chiplets and UCIe:

“If you just look at Intel’s portfolio, we’ve got all sorts of acceleration capabilities. We put specialized hardware accelerators on the same die with our processors. We’ve got GPUs. We’ve got FPGAs. We’ve got Gaudi, which is optimized for deep learning. We’ve got blockchain ASICs and we’ve got research projects including work on neuromorphic computing and graphics, and this is just Intel. You go more broadly into the industry, and you see even more diversity.

“What really brings this all home to me is the imminent use of UCIe, the Universal Chiplet Interconnect Express. You know, in the old days we plugged in PCIe cards to put different functions in computers including sound cards and some early graphics accelerators. The idea was that if you wanted to have an accelerator or something that performed a specialized function, even a sound card, you put it in a motherboard slot.

“Now the question is when you’re building up chips, what do you do? There are no slots. Increasingly our designs are multichip devices made from chiplets or tiles. [Intel’s top-of-the-line GPU] Ponte Vecchio is made with an insane number of chiplets, with 47 active tiles. How do you get all of those tiles to talk to each other when they’re from different vendors?

“You can standardize how they talk to each other. There’s been a little of this done ad hoc. You know, Intel had a SKU a while back where we paired a processor with an AMD GPU. Obviously, somebody had agree on how these devices would talk to each other. That’s a natural sort of reason to create a standard.

“Let’s say that Intel has a Xeon CPU that uses this standard. Some other company, perhaps a startup, can develop a little chiplet that does something very specific. If that chiplet also employs that standard, suddenly that startup company can just sort of ask Intel to glue their chiplet to the Xeon CPU in the same package. You can then drop that augmented Xeon CPU into a standard motherboard that you can get from Dell or another vendor. That’s what UCIe is for.

“There are immediate benefits to this capability. You don’t have to design a new system or motherboard. You just deploy the augmented CPU in an existing system.

“And then the question is, how hard is it to get the software into a system like this? If the software tools are ready for this sort of multi-vendor multi architecture, and if those tools include the compiler, the libraries, and the performance-analysis tools, then it’s much easier to develop software for this sort of augmented architecture. The barrier to entry for software is reduced. The barrier to entry for hardware goes down because of the move to chiplets and the adoption of a standard chiplet interconnect. You get to market far more quickly.”

On Intel’s Acquisition of Codeplay

“The company Codeplay became available, and Intel decided to acquire them. I was thrilled. I’ve worked with the people at Codeplay and have loved working with them. They’ve been working on Nvidia and AMD GPUs for a while but, as a commercial company, they were always looking for someone to underwrite their work. Will a customer want it? Some of the labs sometimes gave them seed money, but not enough to fully productize their work. I hesitate a little to say, “blank check,” but they essentially now have a blank check from Intel to productize their work and they don’t need to worry about anyone else paying for it. You should see results from this acquisition later this year.

“You’ll see their tools integrate with Intel’s releases of SYCL so that SYCL/DPC++ ends up being able to target all GPUs from Intel, Nvidia, and AMD. People in the know could build this sort of software using open-source tools over the last year. But let’s face it, most of us want to be as lazy as we can be. I really like just being able to download a binary with a click, install it, and have it just work, instead of building it from open-source files and reading lots of instructions to turn the files into usable tools.

“We’re also turning over the management of the oneAPI community to Codeplay and they will transform it into something that’s industry-driven. We say it’s industry driven, but Intel had to hold the pen very tightly to make the industry drive it. Now Codeplay will run the show to help transition to full industry control.

On Intel’s Acquisition of ArrayFire

“You know, Codeplay employs close to 100 engineers. ArrayFire has four. So, the acquisition of these two companies is different in that respect. But the folks at ArrayFire are very talented and they obviously have a deep history with the companies, with the technology. They’re real pioneers. In fact, you may have seen I put a little blog out last week mentioning the ArrayFire acquisition. (See “ArrayFire Team joins Intel for oneAPI.”)

“When I met with John [Melonakos, CEO & Co-Founder of ArrayFire] the week before I published the blog, I asked him to write something about the acquisition and what he wrote was really unassuming. I said, “Oh my gosh! You guys are pioneers. We need something more than this!” John agreed, so I added some words about ArrayFire’s pioneering work because I’m hugely in love the things they’ve done. We’re super excited to have them on board.

“Just so you know, the guys at ArrayFire developed a lot of things that eventually became the parallel toolkits and related tools in MATLAB. They sold that off or licensed those tools and then created a library of portable GPU intrinsics that are really easy to use. These intrinsics just run on anybody’s GPUs. So, they were solving the problem of writing code for GPUs without writing that code in [Nvidia’s] CUDA, so that software developers could take advantage of anybody’s GPU. Some researchers at Facebook used ArrayFire’s intrinsics to develop code for machine learning and got fantastic speedups. Their code performed better than the CUDA implementation, which is a real testament to the guys at ArrayFire. They really understand how to optimize GPU performance. Any GPU.”

On the Future of oneAPI

“I see a couple of big steps happening for oneAPI in the next few years. First of all, we have to prove that oneAPI works for Intel. We’ve done a great job showing one API does a great job on our CPUs and our FPGAs. Everyone’s waiting for [support for] the [Intel] GPU Ponte Vecchio and its successors. That’s going to happen. Birthing a new architecture is always painful, no matter how much we say it won’t be. I’ve been through this more than a few times, so I think this is going to be really exciting. I’m really excited about what Ponte Vecchio will do.

“But proving that oneAPI really satisfies Intel’s needs and Intel’s customers’ needs across the board is the first big challenge. The next challenge is to show that oneAPI works well for other architectures. So, the things that I mentioned about Codeplay, regarding Nvidia and AMD support… over the next couple of years, you’ll see some interesting results published. We will publish more results this year, but over the next couple of years, I think it’ll get to the point where it becomes a common understanding that oneAPI is viable for software developers that target multiple architectures from multiple vendors. Right now, there’s plenty of evidence for that anecdotally, with early adopters publishing lots of neat papers over the last few years that show positive results, but it’s not yet common knowledge. I think in about two years, it’ll become common knowledge. That is my expectation.

“So that’s the high level. What the heck is oneAPI? You’ll see it at the Intel Innovation event. Moving oneAPI development and support to Codeplay is the next step in the evolution of the standards. I think Intel did a great job giving birth to oneAPI, but now it needs additional help, so Intel needs to let go a bit. I’m helping Intel do that and encouraging the industry to tell us what’s most important to guide oneAPI forward from here.”

8 thoughts on “Intel’s Gamble on oneAPI and DPC++ for Parallel Processing and Heterogeneous Computing: An Interview with Intel’s James Reinders”

  1. It was just a few weeks ago that I went looking to find what One API really is and found nothing other than how great it is. There is no meat on the bones, just smoke and mirrors/wishful thinking, maybe.

    Since there is nothing there, I have fund an API that works.

    This article focuses on accelerators and GPU which implies processing data.
    rule 1: First there must be data, then it can be processed.
    rule 2: If the data is missing, then find the data and make it available.

    And it goes on from here…

    1. It was just a few weeks ago that I went looking to find what One API really is and found nothing other than how great it is. There is no meat on the bones, just smoke and mirrors/wishful thinking, maybe.

      Since there is nothing there, I have fund an API that works.

      This article focuses on accelerators and GPU which implies processing data.
      rule 1: First there must be data, then it can be processed.
      rule 2: If the data is missing, then find the data and make it available.

      And it goes on from here…

      One API has nothing to do with those things. It is simply to program the GPUs and accelerators that are already there. Sorry folks.

Leave a Reply

featured blogs
Jun 9, 2024
I've seen several do-it-yourself (DIY) miniature versions of the Las Vegas Sphere, but this is the best contender thus far....

featured chalk talk

Ultra-low Power Fuel Gauging for Rechargeable Embedded Devices
Fuel gauging is a critical component of today’s rechargeable embedded devices. In this episode of Chalk Talk, Amelia Dalton and Robin Saltnes of Nordic Semiconductor explore the variety of benefits that Nordic Semiconductor’s nPM1300 PMIC brings to rechargeable embedded devices, the details of the fuel gauge system at the heart of this solution, and the five easy steps that you can take to implement this solution into your next embedded design.
May 8, 2024