feature article
Subscribe Now

Achieving Exa-Scale

Intel’s Shekhar Borkar Breaks It Down

Computers use a lot of power. Big, fast computers use a ton of power. And rooms full of big, fast computers use… well… lots of tons of power.

At Semicon West, keynote speaker Shekhar Borkar, from Intel, envisaged an age of “exa-scale” computing – meaning computing at the rate of one quintillion floating-point operations per second. For reference, today, tera-scale is accessible; peta-scale (1000 times faster) has been achieved, but is out there; and exa-scale is 1000 times that.

The biggest hurdle between here and there should be no surprise: power. An exa-scale computer would need its own nuclear power plant if we simply scaled what we do now. Today’s tera-scale machine needs about 3.5 kW; in order to scale a tera-scale machine to exa-scale, the tera-scale machine would need to drop its power requirement to 20 W, or about 20 pJ/operation.

To put that in some perspective, Mr. Borkar likened one pJ to the amount of energy picked up by a cell phone in one minute’s time. Not much.

He then went on to break the internals of a computer into three obvious chunks: computing, memory, and interconnect. (Computing includes not just an ALU, but other required support logic like address decoders as well.) He described the technology developments that he thought would have the most impact on all of them, and each takes a completely different approach.

More miserly computing

The computing side of things points to a technology that, at least as far as a quick Google search suggests, is primarily being driven by Intel. (No surprise… an Intel speaker… and, let’s face it: how many companies can afford this kind of research?) It involves transistors operating near the threshold voltage – dubbed “near threshold voltage” or “NTV” transistors.

This is motivated by their studies of energy efficiency. Obviously, more energy means more performance – you do get something for your efforts. But it’s less efficient: increases in speed are exceeded by increases in power. Said another way, as you reduce VDD from the 1-V or greater region, speed does come down, but not as fast as power does.

That works only up to a point, however. Once you get below threshold, efficiency rolls off, and now your performance starts going down faster than your power. So there’s this region right around the threshold voltage where you get the most performance bang for your power buck. That’s what they’re looking at exploiting.

They took an old Pentium processor design and replaced the circuits with NTV technology. And no, it’s not a matter of simply swapping transistors. They had to redesign many of the circuits, and memory presents its own issues. But they ended up with an actual chip having a wide dynamic range, where they can still operate it with VDD as low as VT. That is, it runs in the 400­–500-mV range.

They did this at the 32-nm node. Exa-scale computing is supposed to come of age in the 2018 timeframe, and 32 nm will feel rather outdated by then. Newer technologies will give NTV circuits a harder time than traditional circuits because of the narrow logic swing. That miserly blip that will constitute a switching event will be exceedingly hard to distinguish from noise. But, even worse yet, atomic-level variations will be significant at this point. Meaning that different transistors will have different VTs. Hence, new circuits and techniques.

While design will have to accommodate this to a large degree, overall reliability will have to be tackled at the system level. He refers to this as “resiliency” – the ability of a system to shrug off an error or two here and there and go with what the CPU actually meant rather than what it said. Something we’ve wanted for years. (OK, some applications companies have tried to give it to us, but mostly they get it wrong…)

This means new developments in system design and figuring out what the CPU really meant. In theory, it can be done today, but it typically requires expensive brute-force approaches like triple redundancy.

More memory pins

New circuits can affect memory as well, but Mr. Borkar pointed to another area of inefficiency: the way RAS and CAS are used on DRAMs. As non-volatile and alternative memory technologies evolve, it’s always tempting to think that DRAMs will someday be supplanted. After all, they seem kind of clunky with their refresh requirements and inability to remember the simplest things after the power goes off.

But Mr. Borkar compared the memory efficiency of the four main categories of memory – SRAM, DRAM, NVM/PCM, and disk – and concluded that, based on its compromises providing reasonable capacities with reasonable cost and reasonable energy efficiency (it’s not the best at any of those, but, then again, nothing is), DRAMs aren’t going to disappear any time soon unless something revolutionary happens. And so we’re stuck with RAS and CAS.

Here’s his point: the way memories work now, you select a row, energizing all the columns. You then select one column to read out. OK, with more than one bit at a time, you’re energizing a number of pages, but reading out only one. So the energy that went into figuring out the contents of all the other pages is wasted.

He suggests that the reason for doing things this way boils down to one thing: not enough pins. Maybe you could read out all of those pages (if what you needed was on more than one page, which, in many cases, it could be), except that you don’t have enough pins. So you set it all up and then cycle through the pages to multiplex them through pins via the CAS signal.

It seems to me that, if you could set it up where you can get all the pages at once (more on that in a sec), then you didn’t waste the power in figuring out their contents – you needed the data anyway. So what you get is a faster read. And, presumably, if the circuit doesn’t have to be stable for as long, then sense amps can shut down faster and perhaps less charge is needed and so you could reduce power that way.

If you don’t need all the pages, then charging up only the one you need would save power simply by not wasting charge on the unused part of the memory. You would need to present both row and column addresses at the same time to do that. Again, more pins.

The way we get more pins is through die stacking, or so-called 3D ICs. Wide-I/O standards allow lots more pins to talk through the power-saving intra-stack interconnect (through-silicon vias and redistribution layers on the backside of the die).

He contrasted the power consumed by a DDR3 memory versus the new Hybrid Cube. The latter transferred data more than 10 times faster, and yet the energy per bit (in pJ/b) was reduced from the 50-75 range to 8. So that’s roughly an efficiency improvement of 100 (doing optimistic rounding).

More locality

Finally, there’s interconnect. And the real message here is, keep things as close as possible. Distance equals power. The problem is, die sizes aren’t changing; we’re simply putting more in the same area. So on-chip distances aren’t going down. That means off-chip is where we need to pay attention, and he pointed to a few non-traditional interconnect schemes for achieving this. They included adding top-of-package connectors, low-loss flexible interconnect, and low-loss “twinax” (two-conductor co-ax, good for increasingly-prevalent differential signals) connectors.

But there was one more far-flung – literally – conclusion that he came to. If distance makes a difference, then there’s no more inefficient way to compute than by using the cloud. Instead of shifting your bits locally a few inches or so, you may be sending them hundreds or thousands of miles. His numbers showed that shipping data over 3G uses a thousand times more power than keeping it in your machine.  “Think local” seems to apply here as much as it does on Main St.

Of course, it’s not like we will all have exa-scale computers in our offices or phones. But the technology that makes exa-scale computing possible will be available at a smaller scale for us. And the power savings that make the huge computing monsters available will likewise accrue to us and to everything in between. That’s that many fewer nuclear plants that we’ll need.

2 thoughts on “Achieving Exa-Scale”

  1. I rememer experiments in the 70’s about MOS-like devices controled by light on the gate.
    What about a sub-threshold tension boosted by light? Light would not open the gate by itself, only the combination of tension ans light could.
    I think a Raman laser could send impulses at up to 100 GHz…

    Yvan Bozzonetti

Leave a Reply

featured blogs
Sep 30, 2022
When I wrote my book 'Bebop to the Boolean Boogie,' it was certainly not my intention to lead 6-year-old boys astray....
Sep 30, 2022
Wow, September has flown by. It's already the last Friday of the month, the last day of the month in fact, and so time for a monthly update. Kaufman Award The 2022 Kaufman Award honors Giovanni (Nanni) De Micheli of École Polytechnique Fédérale de Lausanne...
Sep 29, 2022
We explain how silicon photonics uses CMOS manufacturing to create photonic integrated circuits (PICs), solid state LiDAR sensors, integrated lasers, and more. The post What You Need to Know About Silicon Photonics appeared first on From Silicon To Software....

featured video

PCIe Gen5 x16 Running on the Achronix VectorPath Accelerator Card

Sponsored by Achronix

In this demo, Achronix engineers show the VectorPath Accelerator Card successfully linking up to a PCIe Gen5 x16 host and write data to and read data from GDDR6 memory. The VectorPath accelerator card featuring the Speedster7t FPGA is one of the first FPGAs that can natively support this interface within its PCIe subsystem. Speedster7t FPGAs offer a revolutionary new architecture that Achronix developed to address the highest performance data acceleration challenges.

Click here for more information about the VectorPath Accelerator Card

featured paper

Algorithm Verification with FPGAs and ASICs

Sponsored by MathWorks

Developing new FPGA and ASIC designs involves implementing new algorithms, which presents challenges for verification for algorithm developers, hardware designers, and verification engineers. This eBook explores different aspects of hardware design verification and how you can use MATLAB and Simulink to reduce development effort and improve the quality of end products.

Click here to read more

featured chalk talk

ROHM Automotive LED Driver IC

Sponsored by Mouser Electronics and ROHM Semiconductor

There has been a lot of innovation in the world of automotive designs over the last several years and this innovation also includes the LED lights at the rear of our vehicles. In this episode of Chalk Talk, Amelia Dalton chats with Nick Ikuta from ROHM Semiconductor about ROHM’s automotive LED driver ICs. They take a closer look at why their four channel outputs, energy sharing function, and integrated protection functions make these new driver ICs a great solution for rear lamp design.

Click here for more information about ROHM Semiconductor Automotive Lighting Solutions