feature article
Subscribe Now

A Cloud-On-A-Chip

The Kind of Fun Stuff Intel Gets To Do In The Labs

The only reason people pay beaucoups bucks for EDA tools is to avoid creating more mask sets than necessary. Each design must result in a production chip as quickly and painlessly as possible. So it’s easy to lose sight of the fact that some of the larger companies can make a chip just for fun. It’s the real “R” in “R&D.”

That’s the kind of stuff you get to see at ISSCC. And, in particular, Intel had an interesting talk about a chip that takes the SoC concept one step further: it’s more than a system on a chip, it’s a cloud on a chip. 48 cores built to talk to each other.

The typical approach to multicore is to keep adding cores and then do things to facilitate multi-threading – context swaps and things like that. Most programs are self-contained multi-threaded entities, so that works fine for them. From a high-level standpoint, each core sees the same architecture and memory; this is “symmetric multi-processing.”

But many specialized programs, especially in the embedded world, don’t work that way. Performance can be faster if you have multiple independent processes, often with each assigned to a specific core, typically with each core having its own OS – if it has any OS at all. Each of those processes does some specific thing very efficiently, sharing data as needed. Even the architectures may look different for each core, earning the phrases “asymmetric multi-processing” and “heterogeneous” according to how things are set up.

For example, when processing an incoming packet in a networking application, one core may take care of checksum validation (which may itself involve handoff to a hardware accelerator), while another parses the headers for a particular layer of the stack. The two operations are independent, but, when these are arranged in a pipeline, one core receives a pointer to the packet buffer, does its work, and (assuming it was successful) passes the pointer on to the next core.

At a programmer’s level, the exchange of information – in this example, passing the packet pointer – is thought of as passing messages back and forth. If you want to send some data to another process, which has a separate addressing space that’s inaccessible to you, you send it a message with the data. There are a couple of established protocols for doing this; in particular, the Message Passing Interface (MPI, heavier-weight) and the Multicore Communications API (MCAPI, lighter-weight for embedded applications with stringent memory and timing requirements), where the hardware details of message-passing are abstracted away.

What was interesting about the Intel project was that the architecture specifically provided a means of passing messages in the hardware. And doing so in an expedited fashion. They did this by means of a Message Passing Buffer (MPB). Each two-core tile includes a 16KB buffer; added together, they constitute a 384KB distributed shared memory.

The obvious question is: how do you integrate such shared memory into the standard architecture? They did this by creating a new datatype (MPMT) and tweaking the TLB. They added a single bit to the TLB line to distinguish shared MPMT data from standard unshared data.

The shared data is not kept coherent like the normal cached data and, in fact, is primarily intended to be used on a send-message-read-message-once basis. A message is valid only until read and then is considered stale. According to this model, when sending a message, the first thing you have to do is invalidate the current message buffer contents in the TLB. This invalidation is done using a new instruction that, in a single cycle, invalidates all the MPMT entries in the TLB. This forces a cache miss, ensuring that the message will be written to the MPB.

On the receiving side, when reading a message, again, the MPMT entries in its TLB are first invalidated to ensure that the read will bring in the fresh message from the MPB.

This, of course, raises the question, do you have to do it this way? What if you want to do something differently and not invalidate the cache all the time? What if you want, for instance, to keep the message cached and read multiple times?

It sounds like this is doable; in fact, given that this is a research project, it sounds like many things are doable – at least for the moment. Whether or not those things remain in a production version remains to be seen. As it stands, at the lowest level, the programmer would have to remember to issue the invalidation instruction each time for things to work as described. The flip-side of that is that, if the programmer wants to hijack the mechanism and use it some other novel way, that might be possible.

Given that most people will want to adhere to orthodoxy, Intel has developed a moderately low-level message-passing library they call RCCE. It’s much more lightweight than MPI, and they expect that this is how people would generally program the chip to avoid forgetting the invalidation step.

The other nuance here is the role of the MPB itself. There’s nothing suggesting that users are required to use it; they can pass messages through off-chip memory as well. In fact, messages above 16K in size don’t fit in the MPB and must use DRAM. But the penalty is performance. Says Intel’s Tim Mattson, who co-wrote RCCE along with Rob van der Wijngaart, “[We] are old fashioned HPC programmers.  The latency hit of using DRAM for message passing was abhorrent to us.”

They did some performance comparisons as the message size was increased. For message sizes up to about 16K, message passing was 15 times faster using the MPB than using DRAM. That advantage went away for larger messages simply because larger messages overflowed the MPB and went to DRAM.

As a result, they designed RCCE to use only the MPB; if a message is too big, it will be broken into multiple messages that will fit in the MPB. Another group of developers at Intel is modifying a shared-memory model they refer to as MYO that will handle interaction with off-chip DRAM. Eventually, the capabilities of MYO and RCCE will be combined into a single interface: the MYO model will abstract away the memory model for safe, simple computing, whereas, for those that like to live life on the edge, RCCE-like operation will require the programmer to understand the memory structure explicitly.

It’s not clear when this model will be available for the outside world to play with. For now it’s a research toy. We’ll stay tuned for any signs of life in the real world, where it might be of service in computers running the many tools required by the other semiconductor companies that don’t get to do chips just for fun.

Leave a Reply

A Cloud-On-A-Chip

The Kind of Fun Stuff Intel Gets To Do In The Labs

The only reason people pay beaucoups bucks for EDA tools is to avoid creating more mask sets than necessary. Each design must result in a production chip as quickly and painlessly as possible. So it’s easy to lose sight of the fact that some of the larger companies can make a chip just for fun. It’s the real “R” in “R&D.”

That’s the kind of stuff you get to see at ISSCC. And, in particular, Intel had an interesting talk about a chip that takes the SoC concept one step further: it’s more than a system on a chip, it’s a cloud on a chip. 48 cores built to talk to each other.

The typical approach to multicore is to keep adding cores and then do things to facilitate multi-threading – context swaps and things like that. Most programs are self-contained multi-threaded entities, so that works fine for them. From a high-level standpoint, each core sees the same architecture and memory; this is “symmetric multi-processing.”

But many specialized programs, especially in the embedded world, don’t work that way. Performance can be faster if you have multiple independent processes, often with each assigned to a specific core, typically with each core having its own OS – if it has any OS at all. Each of those processes does some specific thing very efficiently, sharing data as needed. Even the architectures may look different for each core, earning the phrases “asymmetric multi-processing” and “heterogeneous” according to how things are set up.

For example, when processing an incoming packet in a networking application, one core may take care of checksum validation (which may itself involve handoff to a hardware accelerator), while another parses the headers for a particular layer of the stack. The two operations are independent, but, when these are arranged in a pipeline, one core receives a pointer to the packet buffer, does its work, and (assuming it was successful) passes the pointer on to the next core.

At a programmer’s level, the exchange of information – in this example, passing the packet pointer – is thought of as passing messages back and forth. If you want to send some data to another process, which has a separate addressing space that’s inaccessible to you, you send it a message with the data. There are a couple of established protocols for doing this; in particular, the Message Passing Interface (MPI, heavier-weight) and the Multicore Communications API (MCAPI, lighter-weight for embedded applications with stringent memory and timing requirements), where the hardware details of message-passing are abstracted away.

What was interesting about the Intel project was that the architecture specifically provided a means of passing messages in the hardware. And doing so in an expedited fashion. They did this by means of a Message Passing Buffer (MPB). Each two-core tile includes a 16KB buffer; added together, they constitute a 384KB distributed shared memory.

The obvious question is: how do you integrate such shared memory into the standard architecture? They did this by creating a new datatype (MPMT) and tweaking the TLB. They added a single bit to the TLB line to distinguish shared MPMT data from standard unshared data.

The shared data is not kept coherent like the normal cached data and, in fact, is primarily intended to be used on a send-message-read-message-once basis. A message is valid only until read and then is considered stale. According to this model, when sending a message, the first thing you have to do is invalidate the current message buffer contents in the TLB. This invalidation is done using a new instruction that, in a single cycle, invalidates all the MPMT entries in the TLB. This forces a cache miss, ensuring that the message will be written to the MPB.

On the receiving side, when reading a message, again, the MPMT entries in its TLB are first invalidated to ensure that the read will bring in the fresh message from the MPB.

This, of course, raises the question, do you have to do it this way? What if you want to do something differently and not invalidate the cache all the time? What if you want, for instance, to keep the message cached and read multiple times?

It sounds like this is doable; in fact, given that this is a research project, it sounds like many things are doable – at least for the moment. Whether or not those things remain in a production version remains to be seen. As it stands, at the lowest level, the programmer would have to remember to issue the invalidation instruction each time for things to work as described. The flip-side of that is that, if the programmer wants to hijack the mechanism and use it some other novel way, that might be possible.

Given that most people will want to adhere to orthodoxy, Intel has developed a moderately low-level message-passing library they call RCCE. It’s much more lightweight than MPI, and they expect that this is how people would generally program the chip to avoid forgetting the invalidation step.

The other nuance here is the role of the MPB itself. There’s nothing suggesting that users are required to use it; they can pass messages through off-chip memory as well. In fact, messages above 16K in size don’t fit in the MPB and must use DRAM. But the penalty is performance. Says Intel’s Tim Mattson, who co-wrote RCCE along with Rob van der Wijngaart, “[We] are old fashioned HPC programmers.  The latency hit of using DRAM for message passing was abhorrent to us.”

They did some performance comparisons as the message size was increased. For message sizes up to about 16K, message passing was 15 times faster using the MPB than using DRAM. That advantage went away for larger messages simply because larger messages overflowed the MPB and went to DRAM.

As a result, they designed RCCE to use only the MPB; if a message is too big, it will be broken into multiple messages that will fit in the MPB. Another group of developers at Intel is modifying a shared-memory model they refer to as MYO that will handle interaction with off-chip DRAM. Eventually, the capabilities of MYO and RCCE will be combined into a single interface: the MYO model will abstract away the memory model for safe, simple computing, whereas, for those that like to live life on the edge, RCCE-like operation will require the programmer to understand the memory structure explicitly.

It’s not clear when this model will be available for the outside world to play with. For now it’s a research toy. We’ll stay tuned for any signs of life in the real world, where it might be of service in computers running the many tools required by the other semiconductor companies that don’t get to do chips just for fun.

Leave a Reply

featured blogs
Dec 3, 2021
Believe it or not, I ran into John (he told me I could call him that) at a small café just a couple of evenings ago as I pen these words....
Dec 3, 2021
The annual Design Automation Conference (DAC) is coming up December 5th to 9th, next week. It is in-person in San Francisco's Moscone Center West. It will be available virtually from December... [[ Click on the title to access the full blog on the Cadence Community site...
Dec 1, 2021
We discuss semiconductor lithography and the importance of women in engineering with Mariya Braylovska, Director of R&D for Custom Design & Manufacturing. The post Q&A with Mariya Braylovska, R&D Director, on the Joy of Solving Technical Challenges with a...
Nov 8, 2021
Intel® FPGA Technology Day (IFTD) is a free four-day event that will be hosted virtually across the globe in North America, China, Japan, EMEA, and Asia Pacific from December 6-9, 2021. The theme of IFTD 2021 is 'Accelerating a Smart and Connected World.' This virtual event ...

featured video

Integrity 3D-IC: Industry’s First Fully Integrated 3D-IC Platform

Sponsored by Cadence Design Systems

3D stacking of ICs is emerging as a preferred solution for chip designers facing a slowdown in Moore’s Law and the rising costs of advanced nodes. However, chip stacking creates new complexities, with extra considerations required for the mechanical, electrical, and thermal aspects of the whole stacked system. Watch this video for an overview of Cadence® Integrity™ 3D-IC, a comprehensive platform for 3D planning, implementation, and system analysis, enabling system-driven PPA for multi-chiplet designs.

Click here for more information

featured paper

How to Fast-Charge Your Supercapacitor

Sponsored by Analog Devices

Supercapacitors (or ultracapacitors) are suited for short charge and discharge cycles. They require high currents for fast charge as well as a high voltage with a high number in series as shown in two usage cases: an automatic pallet shuttle and a fail-safe backup system. In these and many other cases, the fast charge is provided by a flexible, high-efficiency, high-voltage, and high-current charger based on a synchronous, step-down, supercapacitor charger controller.

Click to read more

featured chalk talk

BLDC Applications and Product Solutions

Sponsored by Mouser Electronics and ON Semiconductor

In many ways, Industry 4.0 is encouraging innovation in the arena of brushless motor design. In this episode of Chalk Talk, Amelia Dalton chats with CJ Waters of ON Semiconductor about the components involved in brushless motor design and how new applications like collaborative robots can take advantage of the benefits of BLDCs.

Click here for more information about ON Semiconductor Brushless DC Motor Control Solutions