feature article
Subscribe Now

Birth of the AI Machine

Digital Design is New Again

The dawn of artificial intelligence has ironically coincided with the dusk of Moore’s Law. Just as our collective engineering genius begins to wane on the task of exponential improvements in the performance and efficiency of von Neumann processors, those processors reach a point where they can do a rudimentary job on artificial intelligence applications. Apparently something in the cosmos either has a sinister sense of irony, or the universe is warning us not to cause the rise of the machines.

Chip designers, however, welcome our new silicon overlords.

Let’s back up a bit. It’s been years since we hit the wall on processor-frequency scaling and went to multi-core architectures to get our speed. Now, with silicon process technology slowing down, power consumption (and therefore thermal issues) reaching a knee, and software compilation struggling to take advantage of ever-increasing core count, our progress in processor performance has slowed considerably. At the same time, AI technology – and convolutional neural networks in particular – burst onto the scene, demanding more than our processors can deliver and offering enormous rewards in terms of applications that were simply not possible with conventional software approaches. 

The energy of the industry has been thrown behind solving the AI challenge, and the contenders and progress are almost too numerous and too fast to track. At this point in time, there is wide belief that accelerating feed-forward convolutional neural networks (CNNs) is a high-value bet. There are other approaches to AI, of course (and we’ll go into detail on some of those in a few weeks), but the bulk of the industry’s effort right now is around novel hardware architectures for training and inferencing on CNNs. 

This gives rise to an opportunity for an entirely new category of compute machine. And, unfortunately, nobody yet knows what it should look like. It appears we actually need two different things – machines that can do rapid training of CNNs, and machines that can do blazingly-fast, energy-efficient, low-latency inferencing of those trained machines. Logic designers, let the De Morgans commence! 

CNN training is generally done once, on enormous data sets, and is most often accomplished in a data center environment where compute resources are plentiful and performance metrics such as latency and throughput are not critical. Sure, faster training is a good thing, but there are no life-or-death situations hanging in the balance at training time. Today, the go-to solution seems to be massive clusters of GPUs crunching floating-point data at breakneck speed. Damn the power consumption. Full speed ahead!

Inferencing uses that trained machine in the field to do real work. If you trained your system to recognize pedestrians stepping into the street, your inferencing system must be able to perform that task with a high degree of accuracy, in real time, with minimal latency, and probably with minimal power consumption. Nobody wants a system to tell them that the object they struck 200ms ago was actually a former pedestrian. Inferencing is currently done via a wide variety of approaches and hardware architectures, and it is clearly where the most market payoff will occur.

On the training side, we land squarely in the new battle for data center dominance. Sitting right in the center of that melee is Intel, who has long held a commanding share of the data-center hardware market, driven by their Xeon processor line. Intel is clearly taking a system-level approach to retaining that dominance, and leaving no flank uncovered, by attacking with a range of technologies. On the processing front, Intel is defending their perch with AI-enhanced versions of Xeon processors, to FPGAs, to specialized chips for inferencing – both in the data center and in the cloud. But Intel is also fortifying their offerings in memory, connectivity, storage, software tools, and (perhaps most strategically) something they call “Select Solutions,” which bundles lots of Intel stuff together into a proven, working combination that OEMs can easily slap a label on and sell. This creates a formidable barrier to entry for competitors trying to sell a single-chip architecture or technology into the data center. Nobody wants to buy a Volvo, but then stick a Toyota engine in it. The weaker, less dominant technologies in Intel’s portfolio can cruise into the data center on the coattails of the Xeon.

Challenging Intel’s data-center position are the usual suitors like AMD, of course, but, in the AI domain in particular, NVidia has carved out a very strong business in the AI/neural-network-training space with their GPUs (and their programming framework). Xilinx, the long-time dominant player in the FPGA market, has declared themselves to be “data center first” and is posting some impressive results in winning data center business, including key wins in “super 7.” Xilinx also is touting an ambitious strategy to produce extremely powerful and efficient accelerator chips in an attempt both to cut into NVidia’s niche and to continue their longstanding FPGA feud with Intel PSG (formerly archrival Altera).

Look for Intel to continue bundling their solutions, in order to block socket wins for the likes of NVidia and Xilinx, and to bet on a wide array of technologies and approaches to inoculate themselves from other potentially disruptive technologies from new challengers.

Moving to the more interesting inferencing side of the AI equation, it pays to take a look at the actual computational challenges of neural networks – and they are numerous. CNNs are graphs arranged in layers – often hundreds or thousands of layers deep. The layers consist of nodes (neurons) interconnected by weighted edges (synapses). During training, the weights or coefficients are computed. During inference, the resulting graph with weighted nodes is used to process and classify new inputs. Computationally, inferencing stresses every aspect of system resources, and the relative requirements for memory, arithmetic, and connectivity vary with each network’s unique topology. What CNNs always are, however, is massively parallel. That means building a “standard” processor for CNNs is a difficult challenge. 

The inference problem can be further divided, between applications where latency is not critical and inferencing can be offloaded to the cloud, and those where latency is critical and where inferencing must be done at the edge, largely with power-, cost-, and space-constrained hardware. For the data-center side, GPUs and CPUs still compete for the bulk of the business. NVidia has been stepping up their game in inferencing performance of late, and Intel just announced AI-specific extensions to the Xeon called “Intel DL Boost” that the company claims delivers an ~11x improvement in inferencing performance via special instructions. Other vendors are delivering GPU or FPGA acceleration solutions that run in PCIe slots.

Heavily configurable architectures, such as FPGAs, bring compelling value to the table with their massive parallel connectivity, enormous amounts of optimized arithmetic (particularly DSP-oriented multiply-accumulate (MAC) units), and large amounts of memory located very close to computing resources. Further reinforcing the custom/configurable hardware idea, calculations can often be done at very low integer precision during inferencing. If the network is quantized to find the least precision that will deliver accurate results, massive amounts of hardware and power can be saved, and performance can be dramatically increased. Here again, FPGAs shine.

Of course, at the edge, even FPGAs may be too slow, too expensive, and too power hungry. In those situations, there are a number of purpose-built neural network accelerator devices, or a trained network can be implemented as an ASIC or an ASSP. This delivers the highest performance, lowest power, and lowest unit cost, at the expense of flexibility and development cost. In this situation, Intel’s recent acquisition of eASIC comes into focus, as it facilitates the creation of an ASIC solution directly from a working FPGA design.  

It is clear that the surface has barely been scratched in developing hardware platforms to support artificial intelligence. AI is such a novel and demanding workload, with such astronomical potential in terms of applications and economics, that it should drive a massive wave of engineering innovation. Today, we are in the most nascent stages of that wave. It will be interesting to watch.

3 thoughts on “Birth of the AI Machine”

    1. How did THAT get through?!? And not just once but TWICE!
      Thanks for the note, Tom. The article has been corrected and we are re-training our instance with a bit less quantization. – Kevin

Leave a Reply

featured blogs
Mar 28, 2024
The difference between Olympic glory and missing out on the podium is often measured in mere fractions of a second, highlighting the pivotal role of timing in sports. But what's the chronometric secret to those photo finishes and record-breaking feats? In this comprehens...
Mar 26, 2024
Learn how GPU acceleration impacts digital chip design implementation, expanding beyond chip simulation to fulfill compute demands of the RTL-to-GDSII process.The post Can GPUs Accelerate Digital Design Implementation? appeared first on Chip Design....
Mar 21, 2024
The awesome thing about these machines is that you are limited only by your imagination, and I've got a GREAT imagination....

featured video

We are Altera. We are for the innovators.

Sponsored by Intel

Today we embark on an exciting journey as we transition to Altera, an Intel Company. In a world of endless opportunities and challenges, we are here to provide the flexibility needed by our ecosystem of customers and partners to pioneer and accelerate innovation. As we leap into the future, we are committed to providing easy-to-design and deploy leadership programmable solutions to innovators to unlock extraordinary possibilities for everyone on the planet.

To learn more about Altera visit: http://intel.com/altera

featured chalk talk

What are the Differences Between an Integrated ADC and a Standalone ADC?
Sponsored by Mouser Electronics and Microchip
Many designs today require some form of analog to digital conversion but how you implement an ADC into your design can make a big difference when it comes to accuracy and precision. In this episode of Chalk Talk, Iman Chalabi from Microchip and Amelia Dalton investigate the benefits of both integrated ADC solutions and standalone ADCs. They discuss the roles that internal switching noise, process technology, and design complexity play when choosing the right ADC solution for your next design.
Apr 17, 2023
38,897 views