We now live in a data-driven world. With the proliferation of IoT, we are rapidly ramping toward a trillion connected sensors on the planet, all gathering data at an ever-accelerating pace. In just the last year or so, it is likely that we have gathered more data than in all of human history prior to that. But data is useless unless we do something with it, and this deluge of data makes the challenge of extracting information even more difficult.
The science of big data has exploded in recent years, particularly in the area of techniques such as AI/neural networks. Convolutional neural networks (CNNs) bring a whole new world of capability to the emerging art of data analysis, and our industry is standing on its head trying to come up with an entirely new computational infrastructure to harness that power. The IoT is deploying compute acceleration at every stage of the hierarchy, from edge-based computing adjacent to the sensors themselves, all the way back to giant cloud data centers where specialized hardware such as GPUs, FPGAs, and custom accelerators team up with conventional processors to make sense of – and extract value from – this exploding mass of data.
Much of the attention these days is focused on the processing aspect of that compute problem. From giant players like Intel to a growing ecosystem of startups, talented teams are taking up the challenge of designing new types of processors and accelerators that can drive data-centric techniques such as CNNs orders of magnitude faster and more efficiently than traditional von Neumann processors. The field is still wide open, and it’s anybody’s guess what the go-to processing engine will look like even five years from now.
But what about memory and storage?
Everyone trying to revolutionize processor architecture runs into the same challenge – our traditional storage and memory architecture, developed over decades of evolutionary progress in von Neumann machines, is woefully inadequate for this new era. The simplistic model of some on-chip cache, DDR for frequently accessed information, and persistent storage for long-term keeping has gaps that create massive bottlenecks for these new processing architectures. Yet, many of our industry standards that form the backbone of computing system design are predicated on that architecture. What we need is a renewed focus on storing and moving data that is comparable to the current obsession with processing.
FPGAs (and devices that are FPGAs but are not marketed as such) have long grappled with the data architecture. Modern high-end FPGAs pack massive amounts of memory on chip, with a hyper-flexible architecture that allows data to be as close as the same cell with low-level (LUT) logic functions, through several levels of on-chip aggregation up to large blocks of on-chip RAM, and now with large-capacity ultra-high bandwidth HBM stacks in the same package with the FPGA die. Used cleverly, that memory flexibility can keep huge caches of data very close to where it’s needed – in the middle of massively parallel logic and arithmetic functions involved in acceleration.
FPGAs are likely to be prototyping vehicles for those designing alternative processor architectures, and those doing so would be wise to learn the lessons FPGAs teach regarding on-chip and in-package memory structure. Those lessons include keeping the computation as close to the source of the data as possible, keeping the data as close to the computation as possible, and moving the data as few times as possible. All these concepts sound simple, but putting them into practice requires new and novel architectures, as well as fundamentally new memory technologies.
One such example is 3D Xpoint (3D crosspoint) persistent memory technology like Intel’s Optane and Micron’s QuantX. Compared with NAND flash, 3D Xpoint is much faster, with latencies more like RAM, and it brings higher longevity. Cost-wise, it is in between NAND flash and RAM, so it truly represents a new cost-performance point in the memory hierarchy, extending non-volatile storage up into the higher bandwidths.
As soon as we are out of the white-hot realm of in-package memory and into the realm of off-chip memory busses, we begin to be constrained by the assumptions of standard de-facto architectures. That makes it difficult to take advantage of new architectures such as 3D Xpoint. Intel’s initial solution to that is bundling Optane as part of a storage system, blending a smaller portion of Optane with a large amount of NAND flash or other non-volatile storage as a kind of cache. In the right balance, that would let systems take advantage of near-RAM latencies in their non-volatile storage.
However, to truly exploit the advantages of new technologies such as 3D Xpoint, we probably need to break down the traditional demarcation between what we consider “memory” and “storage.” Latency, throughput, persistence, density, and cost are likely to be more of a continuum that can be exploited at the system level, at the expense of traditional plug-and-play interfaces. Or, more versatile interfaces may emerge that will be able to take advantage of this new, finer-grained memory/storage hierarchy.
Orthogonal to the challenge of the continuity of the hierarchy is the question of access. In a homogeneous compute system such as one with conventional multi-core processors, the problem of memory access for the compute elements is much more straightforward than in emerging heterogeneous computing architectures. When different types of processing elements, such as larger and smaller conventional processor cores – or processors working together with accelerators such as FPGAs – need to share access across a memory/storage subsystem, the design and allocation of data pipes within that system can be one of the biggest challenges.
Each option one examines for putting accelerators in the data center, for example, presents challenges and compromises with efficiently getting the data where it needs to be. If the processor and accelerator are too far apart (such as with pools of FPGAs in some systems) the overall system performance can be constrained by memory and storage access, rather than by the performance of the accelerators themselves. When one begins to explore the issue of data location – from the hierarchy inside a logic device such as an FPGA, to package-based resources like HBM, then to board-level and enterprise-level memory/storage from DRAM to HDDs – the optimization problem can be overwhelming. Shifting away from the general case, however, and focusing on a specific application domain – such as neural network inferencing – can transform this from a black art into a manageable engineering problem.
Behind the scenes in the entire memory/storage question is the issue of power consumption. At every level of the IoT – from edge to data center and back – power consumption is becoming the key driver. At the edge, we often need ultra-low-power solutions that can run on battery or even harvested energy. In the data center, the key constraint is often the amount of energy the power company can deliver to the building. In each of those cases, the amount of energy consumed by memory and storage can be as important (or more important) than the energy consumed in computation. Each time we have to move data, we consume power. The earlier in the data path we can reduce, filter, and organize data for subsequent use, the more power we can save.
The challenge of memory architecture optimization is a system-level one. Looking at the problem from any given subsystem or element can give a misleading and incomplete picture and can lead to mis-optimization. As our computing infrastructure morphs into the distributed heterogeneous system that is taking shape today, we need to consider shifting our focus from processing to data. With data-centric thinking, we are likely to come up with new ways of computing that will dramatically outperform the aging architectures of today. It will be interesting to watch.
A bunch of problems turn out to be much the same, (digital) simulation is very similar to evaluating neural networks, and making that go fast on regular processors has been a hard problem (for decades), but there is a fix (since I’ve been working on that one a lot longer than the current AI craze) –
Wandering Threads – https://youtu.be/Bh5axlxIUvM – http://parallel.cc
The memory/CPU architecture we currently use is geared to using slow hard disks and DRAM for virtual memory, a lot of it isn’t necessary if you die-stack your CPUs with solid-state memory. Die-stacking is difficult with Intel/AMD CPUs, but these guys worked out how to make asynchronous logic work –
https://etacompute.com/
Sub/near-threshold might work too. At 3GHz CPUs generally max out at 2Mb of cache, so you probably just want arrays of cores around that size. The IP for WT includes ways to avoid needing the branch prediction hardware too. Whether you need DRAM when using something like 3D-Xpoint is open for debate.
So the pieces are there, someone just needs to put them together 😉
People (doing it the hard way) –
https://www.emutechnology.com/technology/ (who programs in Cilk?)
https://www.tidalscale.com/technology (coarse grain, cache-coherency?)
https://wavecomp.ai/technology/ (processor arrays, unprogrammable?, too hot to stack)