40nm Altera Stratix IV

New process nodes have a predictable rhythm. Until about 90nm, we knew before anybody announced anything that we’d get double the density, half the power (dynamic, of course), and 50% more speed than we had in the previous generation. Of course, that made waiting for the announcements from semiconductor companies a little less than suspenseful. Our Moore’s Law alarm clock would beep on its two-year cycle. We’d check to see if anybody had announced the thing we were expecting yet, and then we’d hit the one-month snooze button and fade back off into our dazed delirium.

This week, Altera became the first to announce an FPGA family on the 40nm process node, and the results surprised us. (Editor’s note: FPGA Journal was actually the first to announce 45nm – see “45nm Chicken,” but Altera outfoxed us by chipping off 5 more nanometers and turning their amp down to “40.” The result is a future family that surprised us a bit, and it challenges classical definitions of the boundaries of programmable logic.

We didn’t know exactly what to expect at this process node. Our predict-o-meter lost its punch at about 90nm where at least a modicum of drama crept into the scenario. We’d watched the supply voltages step down from the 5V to the 1V range. This meant that the voltage swings were less with each node, and the obligatory dynamic power savings came along pretty much for free. Even though we were clocking more gates faster, the total power stayed the same or even dropped a bit due to the process technology gains. While we weren’t paying attention, however, those transistors got leakier as they got smaller.

It was no big deal at first, but over time we began to see static power consumption due to leakage account for a measurable part of the total power. At 90nm, this effect officially hit the map. For programmable logic, it hit hard. All those configuration transistors that complete the routing and define the LUT functions were not-so-quietly sucking up beaucoup current, raining on our power parade in a big way. Other types of devices with metal-based fixed interconnect didn’t have this bloat, and therefore they could wait a generation or two before the static power problem hit like a tidal wave.

FPGA companies were working hard to keep static power under control. They moved to the lowest-power processes offered by their respective fabs, started designing leakage-reducing features into their architectures, and began to compromise on other axes like performance to keep the static dragon at bay. Through 90nm and 65nm, the results were impressive. FPGA companies managed not only to keep static power at bay, but actually to make some gains compared with previous nodes. They had to. If they doubled the density and the static power per gate stayed static, that component of total power doubled anyway.

Now, fast forward to the current announcement.

Stratix IV is Big. With up to 680,000 logic elements (What exactly are these? We’ll get to that in a bit.), 22.4 Mbits of memory, 1360 18X18 multipliers, and 48 multi-gigabit SerDes transceivers, Altera can safely claim the title of “World’s largest FPGA not yet in production.” OK, there really isn’t a title like that, but the point is – once we get these buggers, they’ll be huge.

What about the power consumption? Yeah, we knew you were gonna ask that. Altera claims to have compromised on speed at the transistor level in order to reduce leakage. These compromises include increasing Vt, increasing channel lengths, thickening gate oxide, and decreasing Vcc. Next, they worked to gain back enough of that speed via other means to assure that Stratix IV is still faster than its predecessor (65nm Stratix III). Altera says the net result is that Stratix IV has an average of 30% lower total power consumption compared with similar designs on Stratix III.

Altera rolled out an innovative architecture with Stratix III that’s still around in this generation, which gives design tools the flexibility to trade off performance for power at the individual logic-cell level. Each cell can be programmed to be high-speed or low-power by programmable back-biasing. Cells on the critical path can be cranked up to the needed performance, and those off the freeway can throttle back and sip current at a leisurely pace. The result is a big savings in overall power without a loss of critical path performance.

With power under control, what happens to speed? Altera claims “over 600MHz logic performance” – faster than Stratix III, but not the full performance gain we saw back in the “good old days” of Moore’s Law. However, most FPGA users no longer want a frequency doubling with every node. FPGAs have long since passed the point where performance is “good enough” for most applications, and other factors like functionality, density, I/O capacity and power consumption have taken center stage.

Functionality-wise, Altera has dumped a boatload of memory, more multipliers than most of us could conceive of using (that’s only those of us that aren’t doing high-performance signal processing applications like video, radar, etc. – those folks will be jumping for joy at the unprecedented 1360 multipliers), and 680,000 of — something.

OK, here we go. In the old days, FPGA companies described the density of their devices in “system gates.” These had absolutely no basis in anything measurable. It took about a zillion system gates to equal 500K ASIC gates. As a result, we made fun of them – a lot. Then, they went to a more realistic measure – the number of 4-input lookup tables (LUTs). That works, right? Nope. Marketing came in and started inflating the LUT counts based on perceived architectural advantages of one LUT structure over another. Pretty soon, we were talking about “effective logic elements” which was the number of 4-input LUTs times a marketing fudge factor.

Now, at least, we could settle in, right? Wrong again, Roger. FPGA companies went to wider logic elements – 6-ish input look up tables. They couldn’t just suddenly change their units, and there was nothing left on the device to count. They semi-settled on what we have today, which is a “logic elements” number that’s equal to the number of 6-input LUTs multiplied by a factor deemed appropriate for the conversion from 4-input LUTs, then multiplied by another “our marketing is better than your marketing” factor in order to make it bigger than the other guy’s.

With Stratix IV, all of this fudge-factoring is not really an issue because 680K is enough bigger (a little more than double the 330K Xilinx claims for their current largest Virtex-5 LX device) than anything else on the market that all the marketing factors in the world won’t bridge the gap. These devices are big enough to handle the demands of a great many high-end ASIC users, which brings us to another important topic – HardCopy. In the same announcement with Stratix IV, Altera is announcing the matching HardCopy family. HardCopy takes your FPGA design and converts it directly to an ASIC, saving significantly on unit cost at high volume.

HardCopy does have an NRE, but it’s an order of magnitude lower than a similar-complexity standard-cell ASIC, and it does take a few weeks to spin your design, but the spin is far faster than a “normal” ASIC. Taking advantage of the 40nm technology, a lot of the inefficiency of the 1:1 FPGA correlation in the architecture is eliminated when comparing with 65nm or particularly with 90nm ASIC. In short, Stratix IV to HardCopy is an extremely attractive strategy for getting a high-performance, high-density ASIC design at 40nm. Unit costs are sill higher than a full-boat ASIC, but by the time you amortize the NRE savings and get your device to market faster, much of that difference is also erased.

Altera says that customers can start designing with the 8.0 release of Quartus II (also announced this week) and can expect engineering samples of the first devices in the fourth quarter of this year. Volume production will likely commence in phases beginning in 2009, and customer tapeouts for HardCopy IV ASICs will start in Q3 2009. That gives you just about enough time to get your product up and working with the Stratix IV FPGA version, win some market share, and then go to profit by cost reducing with HardCopy IV. It’s a nice picture.