feature article
Subscribe Now

A Cloud-On-A-Chip

The Kind of Fun Stuff Intel Gets To Do In The Labs

The only reason people pay beaucoups bucks for EDA tools is to avoid creating more mask sets than necessary. Each design must result in a production chip as quickly and painlessly as possible. So it’s easy to lose sight of the fact that some of the larger companies can make a chip just for fun. It’s the real “R” in “R&D.”

That’s the kind of stuff you get to see at ISSCC. And, in particular, Intel had an interesting talk about a chip that takes the SoC concept one step further: it’s more than a system on a chip, it’s a cloud on a chip. 48 cores built to talk to each other.

The typical approach to multicore is to keep adding cores and then do things to facilitate multi-threading – context swaps and things like that. Most programs are self-contained multi-threaded entities, so that works fine for them. From a high-level standpoint, each core sees the same architecture and memory; this is “symmetric multi-processing.”

But many specialized programs, especially in the embedded world, don’t work that way. Performance can be faster if you have multiple independent processes, often with each assigned to a specific core, typically with each core having its own OS – if it has any OS at all. Each of those processes does some specific thing very efficiently, sharing data as needed. Even the architectures may look different for each core, earning the phrases “asymmetric multi-processing” and “heterogeneous” according to how things are set up.

For example, when processing an incoming packet in a networking application, one core may take care of checksum validation (which may itself involve handoff to a hardware accelerator), while another parses the headers for a particular layer of the stack. The two operations are independent, but, when these are arranged in a pipeline, one core receives a pointer to the packet buffer, does its work, and (assuming it was successful) passes the pointer on to the next core.

At a programmer’s level, the exchange of information – in this example, passing the packet pointer – is thought of as passing messages back and forth. If you want to send some data to another process, which has a separate addressing space that’s inaccessible to you, you send it a message with the data. There are a couple of established protocols for doing this; in particular, the Message Passing Interface (MPI, heavier-weight) and the Multicore Communications API (MCAPI, lighter-weight for embedded applications with stringent memory and timing requirements), where the hardware details of message-passing are abstracted away.

What was interesting about the Intel project was that the architecture specifically provided a means of passing messages in the hardware. And doing so in an expedited fashion. They did this by means of a Message Passing Buffer (MPB). Each two-core tile includes a 16KB buffer; added together, they constitute a 384KB distributed shared memory.

The obvious question is: how do you integrate such shared memory into the standard architecture? They did this by creating a new datatype (MPMT) and tweaking the TLB. They added a single bit to the TLB line to distinguish shared MPMT data from standard unshared data.

The shared data is not kept coherent like the normal cached data and, in fact, is primarily intended to be used on a send-message-read-message-once basis. A message is valid only until read and then is considered stale. According to this model, when sending a message, the first thing you have to do is invalidate the current message buffer contents in the TLB. This invalidation is done using a new instruction that, in a single cycle, invalidates all the MPMT entries in the TLB. This forces a cache miss, ensuring that the message will be written to the MPB.

On the receiving side, when reading a message, again, the MPMT entries in its TLB are first invalidated to ensure that the read will bring in the fresh message from the MPB.

This, of course, raises the question, do you have to do it this way? What if you want to do something differently and not invalidate the cache all the time? What if you want, for instance, to keep the message cached and read multiple times?

It sounds like this is doable; in fact, given that this is a research project, it sounds like many things are doable – at least for the moment. Whether or not those things remain in a production version remains to be seen. As it stands, at the lowest level, the programmer would have to remember to issue the invalidation instruction each time for things to work as described. The flip-side of that is that, if the programmer wants to hijack the mechanism and use it some other novel way, that might be possible.

Given that most people will want to adhere to orthodoxy, Intel has developed a moderately low-level message-passing library they call RCCE. It’s much more lightweight than MPI, and they expect that this is how people would generally program the chip to avoid forgetting the invalidation step.

The other nuance here is the role of the MPB itself. There’s nothing suggesting that users are required to use it; they can pass messages through off-chip memory as well. In fact, messages above 16K in size don’t fit in the MPB and must use DRAM. But the penalty is performance. Says Intel’s Tim Mattson, who co-wrote RCCE along with Rob van der Wijngaart, “[We] are old fashioned HPC programmers.  The latency hit of using DRAM for message passing was abhorrent to us.”

They did some performance comparisons as the message size was increased. For message sizes up to about 16K, message passing was 15 times faster using the MPB than using DRAM. That advantage went away for larger messages simply because larger messages overflowed the MPB and went to DRAM.

As a result, they designed RCCE to use only the MPB; if a message is too big, it will be broken into multiple messages that will fit in the MPB. Another group of developers at Intel is modifying a shared-memory model they refer to as MYO that will handle interaction with off-chip DRAM. Eventually, the capabilities of MYO and RCCE will be combined into a single interface: the MYO model will abstract away the memory model for safe, simple computing, whereas, for those that like to live life on the edge, RCCE-like operation will require the programmer to understand the memory structure explicitly.

It’s not clear when this model will be available for the outside world to play with. For now it’s a research toy. We’ll stay tuned for any signs of life in the real world, where it might be of service in computers running the many tools required by the other semiconductor companies that don’t get to do chips just for fun.

Leave a Reply

A Cloud-On-A-Chip

The Kind of Fun Stuff Intel Gets To Do In The Labs

The only reason people pay beaucoups bucks for EDA tools is to avoid creating more mask sets than necessary. Each design must result in a production chip as quickly and painlessly as possible. So it’s easy to lose sight of the fact that some of the larger companies can make a chip just for fun. It’s the real “R” in “R&D.”

That’s the kind of stuff you get to see at ISSCC. And, in particular, Intel had an interesting talk about a chip that takes the SoC concept one step further: it’s more than a system on a chip, it’s a cloud on a chip. 48 cores built to talk to each other.

The typical approach to multicore is to keep adding cores and then do things to facilitate multi-threading – context swaps and things like that. Most programs are self-contained multi-threaded entities, so that works fine for them. From a high-level standpoint, each core sees the same architecture and memory; this is “symmetric multi-processing.”

But many specialized programs, especially in the embedded world, don’t work that way. Performance can be faster if you have multiple independent processes, often with each assigned to a specific core, typically with each core having its own OS – if it has any OS at all. Each of those processes does some specific thing very efficiently, sharing data as needed. Even the architectures may look different for each core, earning the phrases “asymmetric multi-processing” and “heterogeneous” according to how things are set up.

For example, when processing an incoming packet in a networking application, one core may take care of checksum validation (which may itself involve handoff to a hardware accelerator), while another parses the headers for a particular layer of the stack. The two operations are independent, but, when these are arranged in a pipeline, one core receives a pointer to the packet buffer, does its work, and (assuming it was successful) passes the pointer on to the next core.

At a programmer’s level, the exchange of information – in this example, passing the packet pointer – is thought of as passing messages back and forth. If you want to send some data to another process, which has a separate addressing space that’s inaccessible to you, you send it a message with the data. There are a couple of established protocols for doing this; in particular, the Message Passing Interface (MPI, heavier-weight) and the Multicore Communications API (MCAPI, lighter-weight for embedded applications with stringent memory and timing requirements), where the hardware details of message-passing are abstracted away.

What was interesting about the Intel project was that the architecture specifically provided a means of passing messages in the hardware. And doing so in an expedited fashion. They did this by means of a Message Passing Buffer (MPB). Each two-core tile includes a 16KB buffer; added together, they constitute a 384KB distributed shared memory.

The obvious question is: how do you integrate such shared memory into the standard architecture? They did this by creating a new datatype (MPMT) and tweaking the TLB. They added a single bit to the TLB line to distinguish shared MPMT data from standard unshared data.

The shared data is not kept coherent like the normal cached data and, in fact, is primarily intended to be used on a send-message-read-message-once basis. A message is valid only until read and then is considered stale. According to this model, when sending a message, the first thing you have to do is invalidate the current message buffer contents in the TLB. This invalidation is done using a new instruction that, in a single cycle, invalidates all the MPMT entries in the TLB. This forces a cache miss, ensuring that the message will be written to the MPB.

On the receiving side, when reading a message, again, the MPMT entries in its TLB are first invalidated to ensure that the read will bring in the fresh message from the MPB.

This, of course, raises the question, do you have to do it this way? What if you want to do something differently and not invalidate the cache all the time? What if you want, for instance, to keep the message cached and read multiple times?

It sounds like this is doable; in fact, given that this is a research project, it sounds like many things are doable – at least for the moment. Whether or not those things remain in a production version remains to be seen. As it stands, at the lowest level, the programmer would have to remember to issue the invalidation instruction each time for things to work as described. The flip-side of that is that, if the programmer wants to hijack the mechanism and use it some other novel way, that might be possible.

Given that most people will want to adhere to orthodoxy, Intel has developed a moderately low-level message-passing library they call RCCE. It’s much more lightweight than MPI, and they expect that this is how people would generally program the chip to avoid forgetting the invalidation step.

The other nuance here is the role of the MPB itself. There’s nothing suggesting that users are required to use it; they can pass messages through off-chip memory as well. In fact, messages above 16K in size don’t fit in the MPB and must use DRAM. But the penalty is performance. Says Intel’s Tim Mattson, who co-wrote RCCE along with Rob van der Wijngaart, “[We] are old fashioned HPC programmers.  The latency hit of using DRAM for message passing was abhorrent to us.”

They did some performance comparisons as the message size was increased. For message sizes up to about 16K, message passing was 15 times faster using the MPB than using DRAM. That advantage went away for larger messages simply because larger messages overflowed the MPB and went to DRAM.

As a result, they designed RCCE to use only the MPB; if a message is too big, it will be broken into multiple messages that will fit in the MPB. Another group of developers at Intel is modifying a shared-memory model they refer to as MYO that will handle interaction with off-chip DRAM. Eventually, the capabilities of MYO and RCCE will be combined into a single interface: the MYO model will abstract away the memory model for safe, simple computing, whereas, for those that like to live life on the edge, RCCE-like operation will require the programmer to understand the memory structure explicitly.

It’s not clear when this model will be available for the outside world to play with. For now it’s a research toy. We’ll stay tuned for any signs of life in the real world, where it might be of service in computers running the many tools required by the other semiconductor companies that don’t get to do chips just for fun.

Leave a Reply

featured blogs
Apr 25, 2024
Structures in Allegro X layout editors let you create reusable building blocks for your PCBs, saving you time and ensuring consistency. What are Structures? Structures are pre-defined groups of design objects, such as vias, connecting lines (clines), and shapes. You can combi...
Apr 25, 2024
See how the UCIe protocol creates multi-die chips by connecting chiplets from different vendors and nodes, and learn about the role of IP and specifications.The post Want to Mix and Match Dies in a Single Package? UCIe Can Get You There appeared first on Chip Design....
Apr 18, 2024
Are you ready for a revolution in robotic technology (as opposed to a robotic revolution, of course)?...

featured video

MaxLinear Integrates Analog & Digital Design in One Chip with Cadence 3D Solvers

Sponsored by Cadence Design Systems

MaxLinear has the unique capability of integrating analog and digital design on the same chip. Because of this, the team developed some interesting technology in the communication space. In the optical infrastructure domain, they created the first fully integrated 5nm CMOS PAM4 DSP. All their products solve critical communication and high-frequency analysis challenges.

Learn more about how MaxLinear is using Cadence’s Clarity 3D Solver and EMX Planar 3D Solver in their design process.

featured paper

Designing Robust 5G Power Amplifiers for the Real World

Sponsored by Keysight

Simulating 5G power amplifier (PA) designs at the component and system levels with authentic modulation and high-fidelity behavioral models increases predictability, lowers risk, and shrinks schedules. Simulation software enables multi-technology layout and multi-domain analysis, evaluating the impacts of 5G PA design choices while delivering accurate results in a single virtual workspace. This application note delves into how authentic modulation enhances predictability and performance in 5G millimeter-wave systems.

Download now to revolutionize your design process.

featured chalk talk

Addressing the Challenges of Low-Latency, High-Performance Wi-Fi
In this episode of Chalk Talk, Amelia Dalton, Andrew Hart from Infineon, and Andy Ross from Laird Connectivity examine the benefits of Wi-Fi 6 and 6E, why IIoT designs are perfectly suited for Wi-Fi 6 and 6E, and how Wi-Fi 6 and 6E will bring Wi-Fi connectivity to a broad range of new applications.
Nov 17, 2023
21,095 views