feature article
Subscribe Now

A Cloud-On-A-Chip

The Kind of Fun Stuff Intel Gets To Do In The Labs

The only reason people pay beaucoups bucks for EDA tools is to avoid creating more mask sets than necessary. Each design must result in a production chip as quickly and painlessly as possible. So it’s easy to lose sight of the fact that some of the larger companies can make a chip just for fun. It’s the real “R” in “R&D.”

That’s the kind of stuff you get to see at ISSCC. And, in particular, Intel had an interesting talk about a chip that takes the SoC concept one step further: it’s more than a system on a chip, it’s a cloud on a chip. 48 cores built to talk to each other.

The typical approach to multicore is to keep adding cores and then do things to facilitate multi-threading – context swaps and things like that. Most programs are self-contained multi-threaded entities, so that works fine for them. From a high-level standpoint, each core sees the same architecture and memory; this is “symmetric multi-processing.”

But many specialized programs, especially in the embedded world, don’t work that way. Performance can be faster if you have multiple independent processes, often with each assigned to a specific core, typically with each core having its own OS – if it has any OS at all. Each of those processes does some specific thing very efficiently, sharing data as needed. Even the architectures may look different for each core, earning the phrases “asymmetric multi-processing” and “heterogeneous” according to how things are set up.

For example, when processing an incoming packet in a networking application, one core may take care of checksum validation (which may itself involve handoff to a hardware accelerator), while another parses the headers for a particular layer of the stack. The two operations are independent, but, when these are arranged in a pipeline, one core receives a pointer to the packet buffer, does its work, and (assuming it was successful) passes the pointer on to the next core.

At a programmer’s level, the exchange of information – in this example, passing the packet pointer – is thought of as passing messages back and forth. If you want to send some data to another process, which has a separate addressing space that’s inaccessible to you, you send it a message with the data. There are a couple of established protocols for doing this; in particular, the Message Passing Interface (MPI, heavier-weight) and the Multicore Communications API (MCAPI, lighter-weight for embedded applications with stringent memory and timing requirements), where the hardware details of message-passing are abstracted away.

What was interesting about the Intel project was that the architecture specifically provided a means of passing messages in the hardware. And doing so in an expedited fashion. They did this by means of a Message Passing Buffer (MPB). Each two-core tile includes a 16KB buffer; added together, they constitute a 384KB distributed shared memory.

The obvious question is: how do you integrate such shared memory into the standard architecture? They did this by creating a new datatype (MPMT) and tweaking the TLB. They added a single bit to the TLB line to distinguish shared MPMT data from standard unshared data.

The shared data is not kept coherent like the normal cached data and, in fact, is primarily intended to be used on a send-message-read-message-once basis. A message is valid only until read and then is considered stale. According to this model, when sending a message, the first thing you have to do is invalidate the current message buffer contents in the TLB. This invalidation is done using a new instruction that, in a single cycle, invalidates all the MPMT entries in the TLB. This forces a cache miss, ensuring that the message will be written to the MPB.

On the receiving side, when reading a message, again, the MPMT entries in its TLB are first invalidated to ensure that the read will bring in the fresh message from the MPB.

This, of course, raises the question, do you have to do it this way? What if you want to do something differently and not invalidate the cache all the time? What if you want, for instance, to keep the message cached and read multiple times?

It sounds like this is doable; in fact, given that this is a research project, it sounds like many things are doable – at least for the moment. Whether or not those things remain in a production version remains to be seen. As it stands, at the lowest level, the programmer would have to remember to issue the invalidation instruction each time for things to work as described. The flip-side of that is that, if the programmer wants to hijack the mechanism and use it some other novel way, that might be possible.

Given that most people will want to adhere to orthodoxy, Intel has developed a moderately low-level message-passing library they call RCCE. It’s much more lightweight than MPI, and they expect that this is how people would generally program the chip to avoid forgetting the invalidation step.

The other nuance here is the role of the MPB itself. There’s nothing suggesting that users are required to use it; they can pass messages through off-chip memory as well. In fact, messages above 16K in size don’t fit in the MPB and must use DRAM. But the penalty is performance. Says Intel’s Tim Mattson, who co-wrote RCCE along with Rob van der Wijngaart, “[We] are old fashioned HPC programmers.  The latency hit of using DRAM for message passing was abhorrent to us.”

They did some performance comparisons as the message size was increased. For message sizes up to about 16K, message passing was 15 times faster using the MPB than using DRAM. That advantage went away for larger messages simply because larger messages overflowed the MPB and went to DRAM.

As a result, they designed RCCE to use only the MPB; if a message is too big, it will be broken into multiple messages that will fit in the MPB. Another group of developers at Intel is modifying a shared-memory model they refer to as MYO that will handle interaction with off-chip DRAM. Eventually, the capabilities of MYO and RCCE will be combined into a single interface: the MYO model will abstract away the memory model for safe, simple computing, whereas, for those that like to live life on the edge, RCCE-like operation will require the programmer to understand the memory structure explicitly.

It’s not clear when this model will be available for the outside world to play with. For now it’s a research toy. We’ll stay tuned for any signs of life in the real world, where it might be of service in computers running the many tools required by the other semiconductor companies that don’t get to do chips just for fun.

Leave a Reply

A Cloud-On-A-Chip

The Kind of Fun Stuff Intel Gets To Do In The Labs

The only reason people pay beaucoups bucks for EDA tools is to avoid creating more mask sets than necessary. Each design must result in a production chip as quickly and painlessly as possible. So it’s easy to lose sight of the fact that some of the larger companies can make a chip just for fun. It’s the real “R” in “R&D.”

That’s the kind of stuff you get to see at ISSCC. And, in particular, Intel had an interesting talk about a chip that takes the SoC concept one step further: it’s more than a system on a chip, it’s a cloud on a chip. 48 cores built to talk to each other.

The typical approach to multicore is to keep adding cores and then do things to facilitate multi-threading – context swaps and things like that. Most programs are self-contained multi-threaded entities, so that works fine for them. From a high-level standpoint, each core sees the same architecture and memory; this is “symmetric multi-processing.”

But many specialized programs, especially in the embedded world, don’t work that way. Performance can be faster if you have multiple independent processes, often with each assigned to a specific core, typically with each core having its own OS – if it has any OS at all. Each of those processes does some specific thing very efficiently, sharing data as needed. Even the architectures may look different for each core, earning the phrases “asymmetric multi-processing” and “heterogeneous” according to how things are set up.

For example, when processing an incoming packet in a networking application, one core may take care of checksum validation (which may itself involve handoff to a hardware accelerator), while another parses the headers for a particular layer of the stack. The two operations are independent, but, when these are arranged in a pipeline, one core receives a pointer to the packet buffer, does its work, and (assuming it was successful) passes the pointer on to the next core.

At a programmer’s level, the exchange of information – in this example, passing the packet pointer – is thought of as passing messages back and forth. If you want to send some data to another process, which has a separate addressing space that’s inaccessible to you, you send it a message with the data. There are a couple of established protocols for doing this; in particular, the Message Passing Interface (MPI, heavier-weight) and the Multicore Communications API (MCAPI, lighter-weight for embedded applications with stringent memory and timing requirements), where the hardware details of message-passing are abstracted away.

What was interesting about the Intel project was that the architecture specifically provided a means of passing messages in the hardware. And doing so in an expedited fashion. They did this by means of a Message Passing Buffer (MPB). Each two-core tile includes a 16KB buffer; added together, they constitute a 384KB distributed shared memory.

The obvious question is: how do you integrate such shared memory into the standard architecture? They did this by creating a new datatype (MPMT) and tweaking the TLB. They added a single bit to the TLB line to distinguish shared MPMT data from standard unshared data.

The shared data is not kept coherent like the normal cached data and, in fact, is primarily intended to be used on a send-message-read-message-once basis. A message is valid only until read and then is considered stale. According to this model, when sending a message, the first thing you have to do is invalidate the current message buffer contents in the TLB. This invalidation is done using a new instruction that, in a single cycle, invalidates all the MPMT entries in the TLB. This forces a cache miss, ensuring that the message will be written to the MPB.

On the receiving side, when reading a message, again, the MPMT entries in its TLB are first invalidated to ensure that the read will bring in the fresh message from the MPB.

This, of course, raises the question, do you have to do it this way? What if you want to do something differently and not invalidate the cache all the time? What if you want, for instance, to keep the message cached and read multiple times?

It sounds like this is doable; in fact, given that this is a research project, it sounds like many things are doable – at least for the moment. Whether or not those things remain in a production version remains to be seen. As it stands, at the lowest level, the programmer would have to remember to issue the invalidation instruction each time for things to work as described. The flip-side of that is that, if the programmer wants to hijack the mechanism and use it some other novel way, that might be possible.

Given that most people will want to adhere to orthodoxy, Intel has developed a moderately low-level message-passing library they call RCCE. It’s much more lightweight than MPI, and they expect that this is how people would generally program the chip to avoid forgetting the invalidation step.

The other nuance here is the role of the MPB itself. There’s nothing suggesting that users are required to use it; they can pass messages through off-chip memory as well. In fact, messages above 16K in size don’t fit in the MPB and must use DRAM. But the penalty is performance. Says Intel’s Tim Mattson, who co-wrote RCCE along with Rob van der Wijngaart, “[We] are old fashioned HPC programmers.  The latency hit of using DRAM for message passing was abhorrent to us.”

They did some performance comparisons as the message size was increased. For message sizes up to about 16K, message passing was 15 times faster using the MPB than using DRAM. That advantage went away for larger messages simply because larger messages overflowed the MPB and went to DRAM.

As a result, they designed RCCE to use only the MPB; if a message is too big, it will be broken into multiple messages that will fit in the MPB. Another group of developers at Intel is modifying a shared-memory model they refer to as MYO that will handle interaction with off-chip DRAM. Eventually, the capabilities of MYO and RCCE will be combined into a single interface: the MYO model will abstract away the memory model for safe, simple computing, whereas, for those that like to live life on the edge, RCCE-like operation will require the programmer to understand the memory structure explicitly.

It’s not clear when this model will be available for the outside world to play with. For now it’s a research toy. We’ll stay tuned for any signs of life in the real world, where it might be of service in computers running the many tools required by the other semiconductor companies that don’t get to do chips just for fun.

Leave a Reply

featured blogs
Jul 25, 2021
https://youtu.be/cwT7KL4iShY Made on "a tropical beach" Monday: Aerospace and Defense Systems Day...and DAU Tuesday: 75 Years of the Microprocessor Wednesday: CadenceLIVE Cloud Panel... [[ Click on the title to access the full blog on the Cadence Community site. ]]...
Jul 24, 2021
Many modern humans have 2% Neanderthal DNA in our genomes. The combination of these DNA snippets is like having the ghost of a Neanderthal in our midst....
Jul 23, 2021
Synopsys co-CEO Aart de Geus explains how AI has become an important chip design tool as semiconductor companies continue to innovate in the SysMoore Era. The post Entering the SysMoore Era: Synopsys Co-CEO Aart de Geus on the Need for AI-Designed Chips appeared first on Fro...
Jul 9, 2021
Do you have questions about using the Linux OS with FPGAs? Intel is holding another 'Ask an Expert' session and the topic is 'Using Linux with Intel® SoC FPGAs.' Come and ask our experts about the various Linux OS options available to use with the integrated Arm Cortex proc...

featured video

Adopt a Shift-left Methodology to Accelerate Your Product Development Process

Sponsored by Cadence Design Systems

Validate your most sophisticated SoC designs before silicon and stay on schedule. Balance your workload between simulation, emulation and prototyping for complete system validation. You need the right tool for the right job. Emulation meets prototyping -- Cadence Palladium and Protium Dynamic Duo for IP/SoC verification, hardware and software regressions, and early software development.

More information about Emulation and Prototyping

featured paper

Long-term consistent performance matters for humidity sensing applications

Sponsored by Texas Instruments

The exposed polymer of humidity sensors can be impacted by the environment, leading to drift over time. This article from Texas Instruments discusses the accuracy and long-term drift of humidity sensors and how these parameters affect system performance and lifetime.

Click to read more

featured chalk talk

Traveo II Microcontrollers for Automotive Solutions

Sponsored by Mouser Electronics and Infineon

Today’s automotive designs are more complicated than ever, with a slew of safety requirements, internal memory considerations, and complicated power issues to consider. In this episode of Chalk Talk, Amelia Dalton chats with Marcelo Williams Silva from Infineon about the Traveo™ II Microcontrollers that deal with all of these automotive-related challenges with ease. Amelia and Marcelo take a closer look at how the power efficiency, smart IO signal paths, and over the air firmware updates included with this new MCU family will make all the time-saving difference in your next automotive design.

Click here for more information about Cypress Semiconductor Traveo™ II 32-bit Arm Automotive MCUs