Exquisite Chipse

We’d all like to be able to buy arbitrarily big chips.

Wouldn’t it be cool if FPGAs were not big, monolithic devices, but instead, little Lego-like mini-die that we could clip together to make form-fit Franken-FPGAs that were just the size we need for a particular application?

Well, we can’t.

But Xilinx is moving us one step closer to that, and they are using the technology to build some mighty big FPGAs at 28nm. The company is announcing that some of their upcoming 28nm Virtex-7 FPGAs will be made using “stacked silicon interconnect” – a new packaging technique that lets them build extra-large FPGAs from multiple slices. By “extra-large,” we mean the equivalent of 2 million four-input look-up-tables (LUT4s).

The first thing you should know about stacked silicon interconnect is – it isn’t stacked. If you were picturing a nice club sandwich of stacked die – with FPGAs, bacon, lettuce, micro-bumps, tomatoes, and through silicon-vias on rye bread, you need to flatten out that picture a bit. Picture instead a big base die that has only metal interconnect. This die – called the “silicon interposer” connects several FPGA slices to each other – and to the package. The FPGA slices are arranged side-by-side on top of the interposer (there’s your “stacking”) and connected to it by micro-bumps. This scheme allows approximately 10,000 connections between adjacent FPGA slices – routed through the metal layers of the interposer. Compare this with the results of putting two FPGAs side-by-side in conventional packages – you get at least 10x more connections between the FPGAs and, just as importantly, these connections behave like on-chip interconnect instead of chip-to-chip interconnect (which has huge overhead in silicon area, power, and delay).

(Image courtesy of Xilinx)

The first question you might ask is – why?

On any given silicon process, as you make the individual die bigger and bigger, yield drops. Unfortunately, it drops exponentially. So, as you pack more logic onto a single chip, you hit a point – or more like a wall – where the chip is no longer economical to produce. By fabricating smaller monolithic “slices” and combining them into bigger FPGAs, the effective yield goes up. Higher yield means lower cost at the same density, or higher density at the same cost. Of course, interposers, through-silicon vias, and micro-bumps aren’t free, so we have to balance out the additional cost of the advanced packaging against the reduced cost due to increased yield. Xilinx says there are some additional benefits from the interposer layer – like solving some tricky problems with packaging high-k metal gate silicon.

What does it mean to the average FPGA user?

Xilinx claims that the design flow for these devices will be the same as the design flow for monolithic FPGAs. If we fuzz our eyes, we can see some basis for this claim. The company’s FPGAs have used a column-based architecture (called ASMBL) for the past couple of generations. WIth the ASMBL architecture, the design tools already had to deal with resources arranged in columns with varying types of interconnect required for various routes – depending on whether the connection was in the same CLB, in the same column, or elsewhere on the chip. The introduction of the interposer layer may add some additional cost functions to the placer and router, but it shouldn’t change the fundamental design approach.

However, even though 10,000 connections between slices is a lot, it’s probably still a lot less than the number of routes that would have existed between those columns on a monolithic device of the same density. Therefore, some additional intelligence is required in the placer, the router, and potentially from the user (via floorplanning) to be sure that the interposer isn’t overtaxed. Behind the scenes, life is likely to be more difficult for design tools trying to sort out issues like placement, global routing, and clock distribution. We expect there is the potential for lower utilization for some categories of designs that might not partition cleanly into the number of slices in a particular device. To help out with that problem, Xilinx is boosting their PlanAhead design planning tool – which should help with allocating resources to slices as well as the larger problem of managing the design of 2 million LUTs – a process that will generally require teams of designers instead of one super-engineer.

If all goes well, then, designing with one of these new devices may feel exactly like designing with a current-generation monolithic FPGA, only bigger. The benefits should be larger devices at lower prices – available sooner in the production cycle of the 28nm process. That is, assuming that Xilinx did the math correctly and that the effective yield increase more than balances the additional cost, complexity, and risk associated with 3D (or perhaps more accurately 2.5D) packaging techniques.

This brings up the question, then, of what is the new limit? If the economics of yield were the previous barrier in determining the biggest FPGA we could buy, and this technique takes yield off the table, what drives the new limit? Xilinx says that power/heat dissipation is the likely new villain. You can put only as many LUTs in one package as the heat you can dissipate. Of course, there are other limitations lurking out there too – like the number of pins you can put on a package.

Xilinx says this project is the result of five years of work – with test vehicles going back to their 90nm process. The company points out that – even though this is the first time all of these techniques have been used together in something like an FPGA – each step of the process is already mainstream in some application. Therefore, they say, the risk is minimized. Certainly putting together the supply chain was a huge challenge for Xilinx. Xilinx did the design work, of course, and TSMC is the foundry for both the 28nm FPGA slices and the silicon interposer. The package substrate comes from IIBIDEN, and the micro-bumps, die separation, CoC attach and assembly are completed by Amkor. Xilinx then tests the final, packaged part.

Customers with applications like ASIC prototyping and emulation should love the density increase and the density-per-cost improvement. In large prototyping boards and emulators, partitioning a design across multiple FPGAs is a major challenge. Making that partitioning transparent within a single device, or increasing the size of those partitions for multi-device platforms, will simplify their life considerably. Customers in other domains like high-performance computing and the traditional high-bandwidth networking space also stand to gain from the new technology, although it’s less clear whether or not the stacked die approach will run into limitations with their complex, highly-interconnected designs.

As intriguing as the technology seems today, there are future possibilities that are even more interesting. Clearly, Xilinx is taking a “walk before running” approach by using the interposer to connect identical or very-similar slices. However, once the packaging techniques are mastered, the possibility of stitching together different types of devices fabricated on different types of processes is even more exciting. It is easy to imagine FPGA slices mixed with memory, flash, and even other FPGA slices fabricated with different design trade-offs (high-speed versus low-power, for example). Also, by fabricating slices with varying mixtures of hard IP, one could create a broad range of FPGAs tailored for different applications – made from just a few mask sets. Finally, we could see dedicating slices to ASSP-like functions or high-performance processing subsystems, with FPGA fabric adding flexibility and hardware-programmability to the SoC. Like with any discontinuous innovation, the ideas that come to mind far outpace what the industry will be able to deliver any time soon.

Of course, by doing this, Xilinx adds risk and complexity to an already complicated and high-risk endeavor. Already locked in a perennial race with arch-rival Altera in getting each new process node to market, Xilinx has clearly put more obstacles in their own path by tackling the stacked silicon packaging challenge. The reward, they hope, is higher yields, bigger devices, and lower prices. Altera already has years of yield-improving experience with their own memory-like redundancy – which the company has long claimed gives them a competitive advantage in yield, particularly early in the run of a new process. Even though Xilinx and Altera are now using the same foundry (TSMC) the differences are somehow bigger than ever – with the two companies using different architectures, different yield enhancement techniques, and different packaging strategies.

With this announcement, the high-end of the FPGA market is seriously heating up. Xilinx and Altera are engaged in their usual war of words and wafers, and now with Achronix bringing a clearly un-invited Intel to the party [link to article], there is bound to be an unprecedented amount of interesting news in the FPGA space in the coming months. We can hardly wait.

Exquisite Chipse

Related

6 thoughts on “Exquisite Chipse”

Leave a Reply Cancel reply

featured video

How NV5, NVIDIA, and Cadence Collaboration Optimizes Data Center Efficiency, Performance, and Reliability

featured chalk talk