feature article
Subscribe Now

IPUs – A New Breed of Processor

Machine Learning Platforms for AI

Last year, Jim Turley wondered why we have ranges of different processors. Now I want to bring another TLA to the processor table – the IPU (Intelligent Processing Unit). This is the brainchild of Graphcore, a company that came out of stealth mode last November with the announcement of $30m Series A funding from investors including Bosch, Samsung, Amadeus Capital, C4 Ventures, Draper Esprit, Foundation Capital, and Pitango Capital.

Graphcore is based in Bristol, in the West of England, and, if its management team does not include “all the usual suspects” in the Bristol silicon and parallel-processing hot spot, it certainly includes many of them. CEO Nigel Toon was previously CEO at XMOS (where he is still Chairman) and Picochip, and, before that, he was a founder of Icera, a 3G cellular modem company. These last two were sold to larger companies. The CTO is Simon Knowles, another Icera founder and before that at Element 14 – a fabless semiconductor company created by people from Acorn Computer and the ex-Inmos team at STMicroelectronics. And others in the engineering team share similar backgrounds

Graphcore is targeting the area of machine learning, which, the company argues, is not well served by the existing forms of processors but is going to be an essential tool for future developments in data analysis and artificial intelligence (AI) applications such as self-driving cars. (As an aside, another of the Bristol usual suspects, Stan Boland of Element 14, Icera and Neul, has recently announced FiveAI, a company working on artificial intelligence and computer vision for autonomous vehicles.) Toon says, “There is not an application that will not be improved by machine learning.”

Before we get into detail, let’s look at machine learning. A very simple example is something you see every day – predictive texting in a smart-phone. Your phone arrives with some predictive ability, based on how most people construct words and combine words into sentences. As you use it, it begins to recognise words and combinations of words that you use frequently, speeding up your texts, tweets and emails. A similar approach, only much more complex, is behind the voice recognition software driving Siri and Alexa.

More advanced machine learning is needed, for example, in remote monitoring of older people. As we get an increasing number of older people in the population, they want to remain as independent as possible. This is also, to put it bluntly, cost efficient for society as a whole. Where we used to have their children, or even paid servants, living with them to monitor, we are now, at least in advanced societies, moving to using sensors and wearable devices to provide remote monitoring. Let us assume that the monitoring system shows an increase in pulse rate and body temperature. This could be a sign of distress, but if we know that the person has just cycled back from the local shop, then, as long as temperature and pulse return to normal in a reasonable way, there is no need to worry.

Graphcore argues that intelligence is a capacity for judgement, informed by knowledge and adapted with experience. The judgement is an approximate computation, delivering probabilistic answers where exact answers are not possible.

Knowledge can be expressed as a data model – a summary of all the data previously experienced – and can be expressed as a probability distribution.

In human learning terms – you construct a model of what happened in the past and use that knowledge to predict what is likely to happen next. Of course, as humans, we don’t abstract it like that, but, essentially, that is how it works. For machines, we need to create an abstraction.

The data model of knowledge can be constructed as graphs, with each vertex a measure of the probability of a particular feature and the edges representing correlation or causation between features. Typically each vertex links to only a few others so the graph is described as sparse.

Massively parallel processing is commonly used in applications with graphs, allowing work on multiple edges and vertices at the same time and is a clear choice here. What is unusual is that the resolution of the probabilities is very small – only when these are aggregated is there a higher resolution output. Calculations are carried out in small words – often half-precision floating-point – so we are looking at low-precision data in a high-performance computing environment – very unlike traditional high-performance computing.

As machine learning is still an early-stage technology, the detailed models and the algorithms for processing them are still evolving. Today, people are turning from CPUs to GPUs when developing new approaches to machine learning, but these are still expensive, and, compared to the speeds needed for machine learning to take place in real time, are at least two orders of magnitude too slow. Microsoft is using FPGAs in its exploration of machine intelligence, with the argument that they need the flexibility to change as they gain greater understanding of the issues. (They are reported to be a major user of Altera FPGAs, and this was one of the drivers behind Intel’s acquisition of Altera last year.)

Google has taken the route of developing a custom ASIC, the Tensor Processing Unit (TPU – yet another TLA) for its machine-learning applications using the TensorFlow software library. The TPU is an accelerator that is used in machine-learning applications alongside CPUs and GPUs. (And it was used in the system that beat the Go master Lee Se-dol.)

Graphcore calls itself a chip company, based around its IPU. But it is offering more than that – it is offering the Poplar development framework that exploits the IPUs. Within Polar are tools, drivers and application libraries. It has C++ and Python interfaces, and there will be seamless interfaces to MXNet – an open-source deep-learning framework, which has been adopted by Amazon – and TensorFlow, the Google software that is also available as open source.

The IPU itself has been optimised for massively parallel, low-precision floating-point compute, and so it provides much higher compute density than other solutions.

Like a human brain, the IPU holds the complete machine-learning model inside the processor and has over 100x more memory bandwidth than other solutions. This results in both lower power consumption and much higher performance. 

During 2017, Graphcore will be releasing the IPU-Appliance, which the company is aiming at data centres, both corporate and in the cloud. It aims to provide an increase in the performance of machine-learning activities by between 10x and 100x compared to today’s fastest systems. The roadmap, discussed only in very broad terms, looks at downward scaling through the IPU-Accelerator, a PCIe card to improve server-based learning applications, and eventually to moving into edge devices to carry out learning at the edge of the IoT.

While Graphcore came out of stealth only last year, CTO Simon Knowles has been working on approaches to machine learning for over five years, and the team began to be assembled over two years ago. In that time, they have had conversations with a lot of AI players, and there are strong hints that there are serious engagements as soon as the IPU-Appliance is ready for shipping.

Artificial intelligence can be compared to a gold rush, like that in California in the late 1840s. There is considerable investment and much hard work to be done. Some players will be successful and get a great return; others will be left by the wayside. However, many of those who did well out of the gold rush were not the miners, but those who supplied the tools to carry out the prospecting – picks and shovels, in the main. Graphcore’s mission is to supply the picks and shovels for the artificial intelligence gold rush.

One thought on “IPUs – A New Breed of Processor”

Leave a Reply

featured blogs
Jul 17, 2018
As I mentioned last week in my blog about narrowband IoT , 4G is the standard that is used across the radio interface of most of the connected phones in the world. 4G is the fourth generation of this standard (and LTE is kind of like Rev2 of 4G), mostly dealing with the speed...
Jul 16, 2018
Each instance of an Achronix Speedcore eFPGA in your ASIC or SoC design must be configured after the system powers up because Speedcore eFPGAs employ nonvolatile SRAM technology to store the eFPGA'€™s configuration bits. Each Speedcore instance contains its own FPGA configu...
Jul 12, 2018
A single failure of a machine due to heat can bring down an entire assembly line to halt. At the printed circuit board level, we designers need to provide the most robust solutions to keep the wheels...