The only reason people pay beaucoups bucks for EDA tools is to avoid creating more mask sets than necessary. Each design must result in a production chip as quickly and painlessly as possible. So it’s easy to lose sight of the fact that some of the larger companies can make a chip just for fun. It’s the real “R” in “R&D.”
That’s the kind of stuff you get to see at ISSCC. And, in particular, Intel had an interesting talk about a chip that takes the SoC concept one step further: it’s more than a system on a chip, it’s a cloud on a chip. 48 cores built to talk to each other.
The typical approach to multicore is to keep adding cores and then do things to facilitate multi-threading – context swaps and things like that. Most programs are self-contained multi-threaded entities, so that works fine for them. From a high-level standpoint, each core sees the same architecture and memory; this is “symmetric multi-processing.”
But many specialized programs, especially in the embedded world, don’t work that way. Performance can be faster if you have multiple independent processes, often with each assigned to a specific core, typically with each core having its own OS – if it has any OS at all. Each of those processes does some specific thing very efficiently, sharing data as needed. Even the architectures may look different for each core, earning the phrases “asymmetric multi-processing” and “heterogeneous” according to how things are set up.
For example, when processing an incoming packet in a networking application, one core may take care of checksum validation (which may itself involve handoff to a hardware accelerator), while another parses the headers for a particular layer of the stack. The two operations are independent, but, when these are arranged in a pipeline, one core receives a pointer to the packet buffer, does its work, and (assuming it was successful) passes the pointer on to the next core.
At a programmer’s level, the exchange of information – in this example, passing the packet pointer – is thought of as passing messages back and forth. If you want to send some data to another process, which has a separate addressing space that’s inaccessible to you, you send it a message with the data. There are a couple of established protocols for doing this; in particular, the Message Passing Interface (MPI, heavier-weight) and the Multicore Communications API (MCAPI, lighter-weight for embedded applications with stringent memory and timing requirements), where the hardware details of message-passing are abstracted away.
What was interesting about the Intel project was that the architecture specifically provided a means of passing messages in the hardware. And doing so in an expedited fashion. They did this by means of a Message Passing Buffer (MPB). Each two-core tile includes a 16KB buffer; added together, they constitute a 384KB distributed shared memory.
The obvious question is: how do you integrate such shared memory into the standard architecture? They did this by creating a new datatype (MPMT) and tweaking the TLB. They added a single bit to the TLB line to distinguish shared MPMT data from standard unshared data.
The shared data is not kept coherent like the normal cached data and, in fact, is primarily intended to be used on a send-message-read-message-once basis. A message is valid only until read and then is considered stale. According to this model, when sending a message, the first thing you have to do is invalidate the current message buffer contents in the TLB. This invalidation is done using a new instruction that, in a single cycle, invalidates all the MPMT entries in the TLB. This forces a cache miss, ensuring that the message will be written to the MPB.
On the receiving side, when reading a message, again, the MPMT entries in its TLB are first invalidated to ensure that the read will bring in the fresh message from the MPB.
This, of course, raises the question, do you have to do it this way? What if you want to do something differently and not invalidate the cache all the time? What if you want, for instance, to keep the message cached and read multiple times?
It sounds like this is doable; in fact, given that this is a research project, it sounds like many things are doable – at least for the moment. Whether or not those things remain in a production version remains to be seen. As it stands, at the lowest level, the programmer would have to remember to issue the invalidation instruction each time for things to work as described. The flip-side of that is that, if the programmer wants to hijack the mechanism and use it some other novel way, that might be possible.
Given that most people will want to adhere to orthodoxy, Intel has developed a moderately low-level message-passing library they call RCCE. It’s much more lightweight than MPI, and they expect that this is how people would generally program the chip to avoid forgetting the invalidation step.
The other nuance here is the role of the MPB itself. There’s nothing suggesting that users are required to use it; they can pass messages through off-chip memory as well. In fact, messages above 16K in size don’t fit in the MPB and must use DRAM. But the penalty is performance. Says Intel’s Tim Mattson, who co-wrote RCCE along with Rob van der Wijngaart, “[We] are old fashioned HPC programmers. The latency hit of using DRAM for message passing was abhorrent to us.”
They did some performance comparisons as the message size was increased. For message sizes up to about 16K, message passing was 15 times faster using the MPB than using DRAM. That advantage went away for larger messages simply because larger messages overflowed the MPB and went to DRAM.
As a result, they designed RCCE to use only the MPB; if a message is too big, it will be broken into multiple messages that will fit in the MPB. Another group of developers at Intel is modifying a shared-memory model they refer to as MYO that will handle interaction with off-chip DRAM. Eventually, the capabilities of MYO and RCCE will be combined into a single interface: the MYO model will abstract away the memory model for safe, simple computing, whereas, for those that like to live life on the edge, RCCE-like operation will require the programmer to understand the memory structure explicitly.
It’s not clear when this model will be available for the outside world to play with. For now it’s a research toy. We’ll stay tuned for any signs of life in the real world, where it might be of service in computers running the many tools required by the other semiconductor companies that don’t get to do chips just for fun.