The World Cup happens once every four years. The Olympics are each on four-year cycles as well. American football, baseball, and basketball hold their championships annually. The America’s Cup gets a new owner about every 36 months.
In FPGA, however, the championship of the world is held approximately once every two years – a pace that was apparently unwittingly set by Gordon Moore in 1966, long before the first FPGA existed. Every two years, we get a new semiconductor process node. For each node, FPGA vendors scramble to be first and best to take advantage of the bounty that comes with the newly downscaled geometry – lower power, lower cost, higher density, and higher speed. The lower cost should have a big asterisk next to it, because even though the unit cost of transistors keeps dropping, the non-recurring engineering cost for making a new chip keeps increasing exponentially. Eventually, we will probably see silicon more like software – where the development and distribution costs are everything, and the incremental unit cost is near zero.
Like most of the other competitions above, the announcement of each new FPGA product line is only the public face of the battle. The companies themselves never stop competing – spending more than the two years that separate process nodes on the preparations to bring that line to market. Nonetheless, we always eventually reach the defining moment – when the public eye turns on the competition, the lights go up, the scoreboards flash on, and the competitors show their hands.
For the winner – the next two years will bring the benefits of a leg up on their rivals: a favorable sales environment that will boost their market share and pile on the revenue, deepening their war chest as they prepare for the next round. For the losers, the next two years will be a struggle while they rely on marketing and sales to salvage what engineering left on the table, knowing that they’ll have to work twice as hard to get the upper hand next time.
This year, we get to see the entries in the 28nm FPGA championship. Altera has already laid out the general specifications of their upcoming Stratix V family (click here) and is busy positioning themselves for the sales derby. Xilinx has also teased us with the gist of their next major FPGA family – based on TSMC’s low-power/high-performance 28nm CMOS process. We gave their preview a review in an earlier article (click here).
Now, Xilinx, the world’s largest FPGA vendor, is back with the dirt – providing specifics on the new families, and giving us a more solid framework on which to hang our speculations, prognostications, and rants.
As we discussed before, Xilinx is building this new family on top of a foundation with a couple of new twists. First, they have jumped over to TSMC as their primary foundry partner. In the past, Xilinx and Altera had sparred over bragging rights for their fab strategy – with Altera pouring all their energy into TSMC and Xilinx maintaining a multi-fab effort with a number of suppliers including UMC, IBM, Toshiba, and Samsung. Both sides of the argument sounded good: Altera claiming that they could get farther by focusing their energy on TSMC – who has maintained a leadership position for several years, and Xilinx claiming that a multi-fab strategy kept them nimble, flexible, and better isolated from the frightening dynamics of the semiconductor fab wars.
For their new family, Xilinx has jumped on the TSMC bandwagon, thus putting the “who’s got the best fab” debate on hold for now. That doesn’t stop sparring over choice of process, however. TSMC offers three processes at 28nm: high-performance (HP), low-power (LP), and high-performance/low-power (HPL). Xilinx has chosen the “high-performance/low-power” process, while Altera is using the “high-performance” process for their 28nm efforts.
According to Xilinx, all roads lead through power. Want more speed? You have to get the power down. Want more density? Same problem. By focusing their optimization efforts on power, and by choosing the HPL process, Xilinx is able to pack a whopping 2 million logic cells into the new high-end (still called Virtex) devices without hitting the thermal wall and having to leave most of those cells unusable most of the time.
The second major change with Xilinx’s new line is a top-to-bottom unified architecture. For the previous several generations, both Xilinx and Altera have used different architectures for their high-end and low-cost families, with the high-end based on a wider, six-input logic cell and the low-cost based on the venerable 4-input LUT. Now, Xilinx is basing all their 28nm FPGAs on the wider cell, effectively giving us a single architecture across all Xilinx products.
This grand unification will help Xilinx a lot behind the scenes, but it has tangible benefits for the casual user as well. Back in the ‘90s, Jason Cong, a leading FPGA researcher at UCLA, proved that the “optimal” width for an FPGA LUT cell was 4 inputs. For years, the sanctity of the LUT4 was beyond reproach. Every major FPGA family was based on the 4-input LUT. As geometries got smaller and designs got bigger, however, the balance shifted between logic cells and interconnect – with interconnect taking over as the dominant factor. Both Xilinx and Altera switched to wider, 6-input LUTs (or something akin to a LUT6) as the primary cells for their larger devices, realizing that the savings in interconnect more than made up for a few wasted inputs here and there on large designs. Now, even the small, low-cost devices have reached the size where they can benefit from the wider basic logic cell, and Xilinx is the first to move their low-end family to the wider LUT.
In jumping to LUT-6 for the low end, Xilinx has also taken the opportunity to base all their families on the same logic cell. For us, this means that our IP blocks will synthesize the same, regardless of the device family we choose. Inter-family portability of designs should be noticeably improved. For Xilinx, this means that design and verification costs for new families or new family members is reduced, as they will have a simpler, more consistent fabric and tool framework for all their development.
The new offering, which Xilinx is calling their “7 Series,” is comprised of three families: Virtex-7, Kintex-7, and Artix-7. The Spartan name has been officially retired, but the new Kintex and Artix families more than make up for it with a broad range of low-cost and low-power devices.
Starting from the top, Virtex-7 sets new records for just about everything. At present, the family has only 2 flavors: one that favors logic density, and another (XT) with extended capability in multi-gigabit transceivers. In the normal Virtex-7 family, the largest device has 152,200 combinational logic blocks (CLBs). Each CLB has two slices, giving us a total of 304,400 slices. Each slice contains four 6-input LUTs and eight flip-flops, giving us a total of 1.2 million real 6-input LUTs. Multiplying that number by a magic factor (which we believe is 1.6) Xilinx tells us we have a total of 1,954,560 equivalent LUT-4 cells. Fuzzing our eyes a bit, that equals Xilinx’s claim of about 2 million old-school LUT-4s.
Of course, it’s not all about LUTs. The biggest device will pack about 55Mb of block RAM, 2160 DSP slices, 1200 single-ended IOs, 576 differential IO pairs, 24 mixed-mode clock managers, 4 gen-2 PCI express blocks, and 36 GTX 10.3 Gbps SerDes transceivers. Think you can use all that at once? Me neither.
The XT family is a bit less extreme on the LUT side of things, but it brings up to 72 of the 13.1Gbps GTH transceivers to the table. That’s a lot of IO bandwidth. Wanna do 400 gig? Going in each direction (in and out) you could gang up 8 GTHs for each 100G of bandwidth, so with 32 input and 32 output transceivers, we get 400 gig using 64 transceivers with 8 left over for frosting. Of course, there’s a lot more to implementing a 400-gig application than just totaling the transceivers, but the basic arithmetic looks favorable.
Maybe you’re thinking that these huge devices are impressive, but what you’d really like is just the performance we had back with the (current) 40nm Virtex-6, but at a price you can afford for your applications. That’s where Kintex-7 comes in. If you read across the product table, Kintex-7 looks a lot like a 28nm version of Virtex-6. The advantage? Lower cost and lower power. Think of a factor of two on both counts, and you won’t be far off. Many applications that would have loved Virtex-6, but couldn’t afford the premium on price or power consumption, will now be able to jump on the FPGA bandwagon.
For those of you who loved Spartan before – and even for many of you that didn’t, the new Artix-7 family cuts power in half and cost by an estimated 35%. Xilinx is pushing the Artix-7 family for high-volume, cost- and power-sensitive applications as diverse as consumer devices, avionics, and portable ultrasound.
We chatted with Moshe Gavrielov, Xilinx President and CEO, about the 7-series launch. Gavrielov has been at the helm at Xilinx for a little over two years, and this is his first major new product family announcement. His excitement about the new family is evident. “In the past, the FPGA industry was guilty of ‘crying wolf’ about FPGAs being suitable replacements for ASIC and ASSP,” Gavrielov recalls. “Now, we’re actually able to deliver on that in a meaningful way.” Replacing high-end ASICs and ASSPs with FPGAs requires densities like we really are only seeing for the first time now, and performance/power ratios that are only recently achievable in programmable logic.
“We put a significant amount of effort with this family in lowering power consumption,” Gavrielov continues. “Power is the key.” Looking at the choices Xilinx made with this family, and the recent enhancements to their tool suite – including new clock-gating technology and other power optimization techniques, it is clear that the company believes power is the gating item in delivering more density and performance to FPGA users. Gavrielov points out that processors hit the power wall first – requiring switching to multi-core processing rather than chasing larger, faster monolithic devices. ASICs and ASSPs have hit their own power limitations as well. For FPGA, packing more capability into the same size package has to be done with a corresponding reduction in total power. Without that, devices will fail, chassis will not support the thermal loads, and the whole house of cards will come crumbling down.
The new 28nm families from Xilinx will not see production until early 2011 with samples available in Q1, and early access ISE Design Suite software has already been shipped to a limited number of early adopter customers and partners. Given the lead times for decision-making and design-in for today’s systems, we need to be thinking about the implications right away.