feature article
Subscribe Now

Achieving Exa-Scale

Intel’s Shekhar Borkar Breaks It Down

Computers use a lot of power. Big, fast computers use a ton of power. And rooms full of big, fast computers use… well… lots of tons of power.

At Semicon West, keynote speaker Shekhar Borkar, from Intel, envisaged an age of “exa-scale” computing – meaning computing at the rate of one quintillion floating-point operations per second. For reference, today, tera-scale is accessible; peta-scale (1000 times faster) has been achieved, but is out there; and exa-scale is 1000 times that.

The biggest hurdle between here and there should be no surprise: power. An exa-scale computer would need its own nuclear power plant if we simply scaled what we do now. Today’s tera-scale machine needs about 3.5 kW; in order to scale a tera-scale machine to exa-scale, the tera-scale machine would need to drop its power requirement to 20 W, or about 20 pJ/operation.

To put that in some perspective, Mr. Borkar likened one pJ to the amount of energy picked up by a cell phone in one minute’s time. Not much.

He then went on to break the internals of a computer into three obvious chunks: computing, memory, and interconnect. (Computing includes not just an ALU, but other required support logic like address decoders as well.) He described the technology developments that he thought would have the most impact on all of them, and each takes a completely different approach.

More miserly computing

The computing side of things points to a technology that, at least as far as a quick Google search suggests, is primarily being driven by Intel. (No surprise… an Intel speaker… and, let’s face it: how many companies can afford this kind of research?) It involves transistors operating near the threshold voltage – dubbed “near threshold voltage” or “NTV” transistors.

This is motivated by their studies of energy efficiency. Obviously, more energy means more performance – you do get something for your efforts. But it’s less efficient: increases in speed are exceeded by increases in power. Said another way, as you reduce VDD from the 1-V or greater region, speed does come down, but not as fast as power does.

That works only up to a point, however. Once you get below threshold, efficiency rolls off, and now your performance starts going down faster than your power. So there’s this region right around the threshold voltage where you get the most performance bang for your power buck. That’s what they’re looking at exploiting.

They took an old Pentium processor design and replaced the circuits with NTV technology. And no, it’s not a matter of simply swapping transistors. They had to redesign many of the circuits, and memory presents its own issues. But they ended up with an actual chip having a wide dynamic range, where they can still operate it with VDD as low as VT. That is, it runs in the 400­–500-mV range.

They did this at the 32-nm node. Exa-scale computing is supposed to come of age in the 2018 timeframe, and 32 nm will feel rather outdated by then. Newer technologies will give NTV circuits a harder time than traditional circuits because of the narrow logic swing. That miserly blip that will constitute a switching event will be exceedingly hard to distinguish from noise. But, even worse yet, atomic-level variations will be significant at this point. Meaning that different transistors will have different VTs. Hence, new circuits and techniques.

While design will have to accommodate this to a large degree, overall reliability will have to be tackled at the system level. He refers to this as “resiliency” – the ability of a system to shrug off an error or two here and there and go with what the CPU actually meant rather than what it said. Something we’ve wanted for years. (OK, some applications companies have tried to give it to us, but mostly they get it wrong…)

This means new developments in system design and figuring out what the CPU really meant. In theory, it can be done today, but it typically requires expensive brute-force approaches like triple redundancy.

More memory pins

New circuits can affect memory as well, but Mr. Borkar pointed to another area of inefficiency: the way RAS and CAS are used on DRAMs. As non-volatile and alternative memory technologies evolve, it’s always tempting to think that DRAMs will someday be supplanted. After all, they seem kind of clunky with their refresh requirements and inability to remember the simplest things after the power goes off.

But Mr. Borkar compared the memory efficiency of the four main categories of memory – SRAM, DRAM, NVM/PCM, and disk – and concluded that, based on its compromises providing reasonable capacities with reasonable cost and reasonable energy efficiency (it’s not the best at any of those, but, then again, nothing is), DRAMs aren’t going to disappear any time soon unless something revolutionary happens. And so we’re stuck with RAS and CAS.

Here’s his point: the way memories work now, you select a row, energizing all the columns. You then select one column to read out. OK, with more than one bit at a time, you’re energizing a number of pages, but reading out only one. So the energy that went into figuring out the contents of all the other pages is wasted.

He suggests that the reason for doing things this way boils down to one thing: not enough pins. Maybe you could read out all of those pages (if what you needed was on more than one page, which, in many cases, it could be), except that you don’t have enough pins. So you set it all up and then cycle through the pages to multiplex them through pins via the CAS signal.

It seems to me that, if you could set it up where you can get all the pages at once (more on that in a sec), then you didn’t waste the power in figuring out their contents – you needed the data anyway. So what you get is a faster read. And, presumably, if the circuit doesn’t have to be stable for as long, then sense amps can shut down faster and perhaps less charge is needed and so you could reduce power that way.

If you don’t need all the pages, then charging up only the one you need would save power simply by not wasting charge on the unused part of the memory. You would need to present both row and column addresses at the same time to do that. Again, more pins.

The way we get more pins is through die stacking, or so-called 3D ICs. Wide-I/O standards allow lots more pins to talk through the power-saving intra-stack interconnect (through-silicon vias and redistribution layers on the backside of the die).

He contrasted the power consumed by a DDR3 memory versus the new Hybrid Cube. The latter transferred data more than 10 times faster, and yet the energy per bit (in pJ/b) was reduced from the 50-75 range to 8. So that’s roughly an efficiency improvement of 100 (doing optimistic rounding).

More locality

Finally, there’s interconnect. And the real message here is, keep things as close as possible. Distance equals power. The problem is, die sizes aren’t changing; we’re simply putting more in the same area. So on-chip distances aren’t going down. That means off-chip is where we need to pay attention, and he pointed to a few non-traditional interconnect schemes for achieving this. They included adding top-of-package connectors, low-loss flexible interconnect, and low-loss “twinax” (two-conductor co-ax, good for increasingly-prevalent differential signals) connectors.

But there was one more far-flung – literally – conclusion that he came to. If distance makes a difference, then there’s no more inefficient way to compute than by using the cloud. Instead of shifting your bits locally a few inches or so, you may be sending them hundreds or thousands of miles. His numbers showed that shipping data over 3G uses a thousand times more power than keeping it in your machine.  “Think local” seems to apply here as much as it does on Main St.

Of course, it’s not like we will all have exa-scale computers in our offices or phones. But the technology that makes exa-scale computing possible will be available at a smaller scale for us. And the power savings that make the huge computing monsters available will likewise accrue to us and to everything in between. That’s that many fewer nuclear plants that we’ll need.

2 thoughts on “Achieving Exa-Scale”

  1. I rememer experiments in the 70’s about MOS-like devices controled by light on the gate.
    What about a sub-threshold tension boosted by light? Light would not open the gate by itself, only the combination of tension ans light could.
    I think a Raman laser could send impulses at up to 100 GHz…

    Yvan Bozzonetti

Leave a Reply

featured blogs
May 24, 2024
Could these creepy crawly robo-critters be the first step on a slippery road to a robot uprising coupled with an insect uprising?...
May 23, 2024
We're investing in semiconductor workforce development programs in Latin America, including government and academic partnerships to foster engineering talent.The post Building the Semiconductor Workforce in Latin America appeared first on Chip Design....

featured video

Why Wiwynn Energy-Optimized Data Center IT Solutions Use Cadence Optimality Explorer

Sponsored by Cadence Design Systems

In the AI era, as the signal-data rate increases, the signal integrity challenges in server designs also increase. Wiwynn provides hyperscale data centers with innovative cloud IT infrastructure, bringing the best total cost of ownership (TCO), energy, and energy-itemized IT solutions from the cloud to the edge.

Learn more about how Wiwynn is developing a new methodology for PCB designs with Cadence’s Optimality Intelligent System Explorer and Clarity 3D Solver.

featured paper

Achieve Greater Design Flexibility and Reduce Costs with Chiplets

Sponsored by Keysight

Chiplets are a new way to build a system-on-chips (SoCs) to improve yields and reduce costs. It partitions the chip into discrete elements and connects them with a standardized interface, enabling designers to meet performance, efficiency, power, size, and cost challenges in the 5 / 6G, artificial intelligence (AI), and virtual reality (VR) era. This white paper will discuss the shift to chiplet adoption and Keysight EDA's implementation of the communication standard (UCIe) into the Keysight Advanced Design System (ADS).

Dive into the technical details – download now.

featured chalk talk

Autonomous Mobile Robots
Sponsored by Mouser Electronics and onsemi
Robotic applications are now commonplace in a variety of segments in society and are growing in number each day. In this episode of Chalk Talk, Amelia Dalton and Alessandro Maggioni from onsemi discuss the details, functions, and benefits of autonomous mobile robots. They also examine the performance parameters of these kinds of robotic designs, the five main subsystems included in autonomous mobile robots, and how onsemi is furthering innovation in this arena.
Jan 24, 2024