feature article
Subscribe Now

A Cloud-On-A-Chip

The Kind of Fun Stuff Intel Gets To Do In The Labs

The only reason people pay beaucoups bucks for EDA tools is to avoid creating more mask sets than necessary. Each design must result in a production chip as quickly and painlessly as possible. So it’s easy to lose sight of the fact that some of the larger companies can make a chip just for fun. It’s the real “R” in “R&D.”

That’s the kind of stuff you get to see at ISSCC. And, in particular, Intel had an interesting talk about a chip that takes the SoC concept one step further: it’s more than a system on a chip, it’s a cloud on a chip. 48 cores built to talk to each other.

The typical approach to multicore is to keep adding cores and then do things to facilitate multi-threading – context swaps and things like that. Most programs are self-contained multi-threaded entities, so that works fine for them. From a high-level standpoint, each core sees the same architecture and memory; this is “symmetric multi-processing.”

But many specialized programs, especially in the embedded world, don’t work that way. Performance can be faster if you have multiple independent processes, often with each assigned to a specific core, typically with each core having its own OS – if it has any OS at all. Each of those processes does some specific thing very efficiently, sharing data as needed. Even the architectures may look different for each core, earning the phrases “asymmetric multi-processing” and “heterogeneous” according to how things are set up.

For example, when processing an incoming packet in a networking application, one core may take care of checksum validation (which may itself involve handoff to a hardware accelerator), while another parses the headers for a particular layer of the stack. The two operations are independent, but, when these are arranged in a pipeline, one core receives a pointer to the packet buffer, does its work, and (assuming it was successful) passes the pointer on to the next core.

At a programmer’s level, the exchange of information – in this example, passing the packet pointer – is thought of as passing messages back and forth. If you want to send some data to another process, which has a separate addressing space that’s inaccessible to you, you send it a message with the data. There are a couple of established protocols for doing this; in particular, the Message Passing Interface (MPI, heavier-weight) and the Multicore Communications API (MCAPI, lighter-weight for embedded applications with stringent memory and timing requirements), where the hardware details of message-passing are abstracted away.

What was interesting about the Intel project was that the architecture specifically provided a means of passing messages in the hardware. And doing so in an expedited fashion. They did this by means of a Message Passing Buffer (MPB). Each two-core tile includes a 16KB buffer; added together, they constitute a 384KB distributed shared memory.

The obvious question is: how do you integrate such shared memory into the standard architecture? They did this by creating a new datatype (MPMT) and tweaking the TLB. They added a single bit to the TLB line to distinguish shared MPMT data from standard unshared data.

The shared data is not kept coherent like the normal cached data and, in fact, is primarily intended to be used on a send-message-read-message-once basis. A message is valid only until read and then is considered stale. According to this model, when sending a message, the first thing you have to do is invalidate the current message buffer contents in the TLB. This invalidation is done using a new instruction that, in a single cycle, invalidates all the MPMT entries in the TLB. This forces a cache miss, ensuring that the message will be written to the MPB.

On the receiving side, when reading a message, again, the MPMT entries in its TLB are first invalidated to ensure that the read will bring in the fresh message from the MPB.

This, of course, raises the question, do you have to do it this way? What if you want to do something differently and not invalidate the cache all the time? What if you want, for instance, to keep the message cached and read multiple times?

It sounds like this is doable; in fact, given that this is a research project, it sounds like many things are doable – at least for the moment. Whether or not those things remain in a production version remains to be seen. As it stands, at the lowest level, the programmer would have to remember to issue the invalidation instruction each time for things to work as described. The flip-side of that is that, if the programmer wants to hijack the mechanism and use it some other novel way, that might be possible.

Given that most people will want to adhere to orthodoxy, Intel has developed a moderately low-level message-passing library they call RCCE. It’s much more lightweight than MPI, and they expect that this is how people would generally program the chip to avoid forgetting the invalidation step.

The other nuance here is the role of the MPB itself. There’s nothing suggesting that users are required to use it; they can pass messages through off-chip memory as well. In fact, messages above 16K in size don’t fit in the MPB and must use DRAM. But the penalty is performance. Says Intel’s Tim Mattson, who co-wrote RCCE along with Rob van der Wijngaart, “[We] are old fashioned HPC programmers.  The latency hit of using DRAM for message passing was abhorrent to us.”

They did some performance comparisons as the message size was increased. For message sizes up to about 16K, message passing was 15 times faster using the MPB than using DRAM. That advantage went away for larger messages simply because larger messages overflowed the MPB and went to DRAM.

As a result, they designed RCCE to use only the MPB; if a message is too big, it will be broken into multiple messages that will fit in the MPB. Another group of developers at Intel is modifying a shared-memory model they refer to as MYO that will handle interaction with off-chip DRAM. Eventually, the capabilities of MYO and RCCE will be combined into a single interface: the MYO model will abstract away the memory model for safe, simple computing, whereas, for those that like to live life on the edge, RCCE-like operation will require the programmer to understand the memory structure explicitly.

It’s not clear when this model will be available for the outside world to play with. For now it’s a research toy. We’ll stay tuned for any signs of life in the real world, where it might be of service in computers running the many tools required by the other semiconductor companies that don’t get to do chips just for fun.

Leave a Reply

A Cloud-On-A-Chip

The Kind of Fun Stuff Intel Gets To Do In The Labs

The only reason people pay beaucoups bucks for EDA tools is to avoid creating more mask sets than necessary. Each design must result in a production chip as quickly and painlessly as possible. So it’s easy to lose sight of the fact that some of the larger companies can make a chip just for fun. It’s the real “R” in “R&D.”

That’s the kind of stuff you get to see at ISSCC. And, in particular, Intel had an interesting talk about a chip that takes the SoC concept one step further: it’s more than a system on a chip, it’s a cloud on a chip. 48 cores built to talk to each other.

The typical approach to multicore is to keep adding cores and then do things to facilitate multi-threading – context swaps and things like that. Most programs are self-contained multi-threaded entities, so that works fine for them. From a high-level standpoint, each core sees the same architecture and memory; this is “symmetric multi-processing.”

But many specialized programs, especially in the embedded world, don’t work that way. Performance can be faster if you have multiple independent processes, often with each assigned to a specific core, typically with each core having its own OS – if it has any OS at all. Each of those processes does some specific thing very efficiently, sharing data as needed. Even the architectures may look different for each core, earning the phrases “asymmetric multi-processing” and “heterogeneous” according to how things are set up.

For example, when processing an incoming packet in a networking application, one core may take care of checksum validation (which may itself involve handoff to a hardware accelerator), while another parses the headers for a particular layer of the stack. The two operations are independent, but, when these are arranged in a pipeline, one core receives a pointer to the packet buffer, does its work, and (assuming it was successful) passes the pointer on to the next core.

At a programmer’s level, the exchange of information – in this example, passing the packet pointer – is thought of as passing messages back and forth. If you want to send some data to another process, which has a separate addressing space that’s inaccessible to you, you send it a message with the data. There are a couple of established protocols for doing this; in particular, the Message Passing Interface (MPI, heavier-weight) and the Multicore Communications API (MCAPI, lighter-weight for embedded applications with stringent memory and timing requirements), where the hardware details of message-passing are abstracted away.

What was interesting about the Intel project was that the architecture specifically provided a means of passing messages in the hardware. And doing so in an expedited fashion. They did this by means of a Message Passing Buffer (MPB). Each two-core tile includes a 16KB buffer; added together, they constitute a 384KB distributed shared memory.

The obvious question is: how do you integrate such shared memory into the standard architecture? They did this by creating a new datatype (MPMT) and tweaking the TLB. They added a single bit to the TLB line to distinguish shared MPMT data from standard unshared data.

The shared data is not kept coherent like the normal cached data and, in fact, is primarily intended to be used on a send-message-read-message-once basis. A message is valid only until read and then is considered stale. According to this model, when sending a message, the first thing you have to do is invalidate the current message buffer contents in the TLB. This invalidation is done using a new instruction that, in a single cycle, invalidates all the MPMT entries in the TLB. This forces a cache miss, ensuring that the message will be written to the MPB.

On the receiving side, when reading a message, again, the MPMT entries in its TLB are first invalidated to ensure that the read will bring in the fresh message from the MPB.

This, of course, raises the question, do you have to do it this way? What if you want to do something differently and not invalidate the cache all the time? What if you want, for instance, to keep the message cached and read multiple times?

It sounds like this is doable; in fact, given that this is a research project, it sounds like many things are doable – at least for the moment. Whether or not those things remain in a production version remains to be seen. As it stands, at the lowest level, the programmer would have to remember to issue the invalidation instruction each time for things to work as described. The flip-side of that is that, if the programmer wants to hijack the mechanism and use it some other novel way, that might be possible.

Given that most people will want to adhere to orthodoxy, Intel has developed a moderately low-level message-passing library they call RCCE. It’s much more lightweight than MPI, and they expect that this is how people would generally program the chip to avoid forgetting the invalidation step.

The other nuance here is the role of the MPB itself. There’s nothing suggesting that users are required to use it; they can pass messages through off-chip memory as well. In fact, messages above 16K in size don’t fit in the MPB and must use DRAM. But the penalty is performance. Says Intel’s Tim Mattson, who co-wrote RCCE along with Rob van der Wijngaart, “[We] are old fashioned HPC programmers.  The latency hit of using DRAM for message passing was abhorrent to us.”

They did some performance comparisons as the message size was increased. For message sizes up to about 16K, message passing was 15 times faster using the MPB than using DRAM. That advantage went away for larger messages simply because larger messages overflowed the MPB and went to DRAM.

As a result, they designed RCCE to use only the MPB; if a message is too big, it will be broken into multiple messages that will fit in the MPB. Another group of developers at Intel is modifying a shared-memory model they refer to as MYO that will handle interaction with off-chip DRAM. Eventually, the capabilities of MYO and RCCE will be combined into a single interface: the MYO model will abstract away the memory model for safe, simple computing, whereas, for those that like to live life on the edge, RCCE-like operation will require the programmer to understand the memory structure explicitly.

It’s not clear when this model will be available for the outside world to play with. For now it’s a research toy. We’ll stay tuned for any signs of life in the real world, where it might be of service in computers running the many tools required by the other semiconductor companies that don’t get to do chips just for fun.

Leave a Reply

featured blogs
Apr 13, 2021
We explain the NHTSA's latest automotive cybersecurity best practices, including guidelines to protect automotive ECUs and connected vehicle technologies. The post NHTSA Shares Best Practices for Improving Autmotive Cybersecurity appeared first on From Silicon To Software....
Apr 13, 2021
If a picture is worth a thousand words, a video tells you the entire story. Cadence's subsystem SoC silicon for PCI Express (PCIe) 5.0 demo video shows you how we put together the latest... [[ Click on the title to access the full blog on the Cadence Community site. ]]...
Apr 12, 2021
The Semiconductor Ecosystem- It is the definition of '€œHigh Tech'€, but it isn'€™t just about… The post Calibre and the Semiconductor Ecosystem appeared first on Design with Calibre....
Apr 8, 2021
We all know the widespread havoc that Covid-19 wreaked in 2020. While the electronics industry in general, and connectors in particular, took an initial hit, the industry rebounded in the second half of 2020 and is rolling into 2021. Travel came to an almost stand-still in 20...

featured video

Learn the basics of Hall Effect sensors

Sponsored by Texas Instruments

This video introduces Hall Effect, permanent magnets and various magnetic properties. It'll walk through the benefits of Hall Effect sensors, how Hall ICs compare to discrete Hall elements and the different types of Hall Effect sensors.

Click here for more information

featured paper

Understanding Functional Safety FIT Base Failure Rate Estimates per IEC 62380 and SN 29500

Sponsored by Texas Instruments

Functional safety standards such as IEC 61508 and ISO 26262 require semiconductor device manufacturers to address both systematic and random hardware failures. Base failure rates (BFR) quantify the intrinsic reliability of the semiconductor component while operating under normal environmental conditions. Download our white paper which focuses on two widely accepted techniques to estimate the BFR for semiconductor components; estimates per IEC Technical Report 62380 and SN 29500 respectively.

Click here to download the whitepaper

featured chalk talk

Single Pair Ethernet

Sponsored by Mouser Electronics and Phoenix Contact

Single-pair Ethernet is revolutionizing industrial system design, with new levels of performance and simplicity. But, before you make the jump, you need to understand the options for cables, connectors, and other infrastructure. In this episode of Chalk Talk, Amelia Dalton chats with Lyndsey Walling of Phoenix Contact about the latest in single-pair Ethernet for industrial applications.

Click here for more information about Phoenix Contact Single Pair Ethernet (SPE) Connectors