FP?A

In olden times, when digital dinosaurs roamed the vast plains of our circuit boards – when 22V10’s walked the earth in vast herds, programmable logic devices were essentially nothing more than routing. By creating an array of programmable interconnect, you could essentially hard-wire any complex combinational logic function. PLDs quickly evolved into FPGAs, however, as it became clear that more structural variation was required than and and/or matrix, and sequential behavior was highly desired from programmable logic.

FPGA, as we all know, stands for “Field Programmable Gate Array.” However, the elements arrayed in a typical FPGA are not “Gates” at all. Most FPGAs use some variant of a look-up table (LUT) as their basic building block. A LUT is a simple PROM (much like a tiny version of early PLDs) that maps every combination of input values to an output value based on a truth table. By loading various data into the truth table, a LUT can be made to emulate any combinational logic gate. Most FPGA LUT-cells also contain a register or flip-flop to store or latch the output value, and often some additional logic for structures like carry chains.

To build an FPGA, you plopped an array of LUTs down on your chip and stitched them together with a bunch of programmable switch-box style interconnect. You’d then surround the array with a flock of versatile I/O cells and stretch some clocking scheme across the whole thing. Once that architecture was in place, the big question became “What kind of LUT works best?” There was considerable research and debate on this topic in the case of SRAM-based FPGAs. The industry concluded early on that the 4-input LUT was the optimal structure for efficient implementation of most types of logic, with a reasonable tradeoff between resources consumed by logic and routing.

The choice of LUT4 was based on several things, however, including a homogeneous architecture, a particular silicon process node, and the performance of synthesis and placement-and-routing software of the day. Exploring to find the perfect building block, it turns out, is a complicated, somewhat empirical process. Most researchers start with a selection of netlists that represent “typical” designs, and then they modify FPGA synthesis and place-and-route tools to implement those designs in various (fictitious) architectures. For each architecture, the timing performance, routing completion rate, number of levels of logic, and resource utilization efficiency can be estimated. This process allows the merits of various designs to be tested without the tremendous expense of fabricating silicon. Still, it requires either access to the “internals” of commercial synthesis and routing software or the development of comparable in-house software for use in the test mule.

It is an interesting side note that the need for this “research” version of synthesis software is probably what motivated FPGA vendors to develop their own synthesis tools in the first place – the same synthesis tools that now compete directly with commercial EDA tools. Because EDA vendors were not willing or able to give access to the source-code level of internal mapping algorithms to FPGA vendors for research purposes, the vendors had to come up with another option. That “other option” was to build their own full-fledged synthesis system. Now life is tough for EDA companies selling FPGA synthesis tools because they must maintain a clear differentiation between their offering and the “free” tools included in the FPGA vendors’ design kits.

The ideal scenario for best performance and area efficiency requires close cooperation between the array architecture, the logic synthesis technology that maps generic logic to that architecture, and the placement and routing software that completes the physical design tasks. Both synthesis and place-and-route are NP-complete, so the algorithms and heuristics involved in them vary widely in performance and are heavily affected by the target architecture and choice of basic building blocks. Mapping algorithms that give fabulous results for a LUT3-based architecture (one with 3-input look-up tables as the basic cells) might give terrible results targeting LUT4- or LUT5-based fabrics.

Also, unlike in the ASIC world, the placement and routing landscape for FPGAs is highly non-linear. In ASIC, a given Manhattan distance (the rectilinear distance between two points on a chip – that approximates ideal routing) would have a fairly constant routing capacitance and corresponding net delay. In FPGA, however, the wide variety of types of paths in the interconnect fouled classical placement and routing algorithms, requiring design tool developers to create algorithms much more tailored to particular on-chip interconnect structures. Again, this more tightly wedded the trinity of basic logic cell, synthesis technology, and place-and-route.

Meanwhile, FPGA vendors such as Actel, who were producing FPGAs not based on SRAM technology, faced their own challenges. For their fabric, a tile something like a LUT3 cell (which could also be used as a register) turned out to be optimal. Getting high-quality mapping for this architecture for tools already optimized for SRAM’s LUT4 turned out to be a major challenge. These vendors had to fight for engineering attention from design tool suppliers to get mapping algorithms tuned for the vagaries of their architecture that would yield results of the quality that SRAM vendors were enjoying with their LUT4 architectures.

In more recent process geometries, SRAM vendors have had to revise their previous assumptions about tradeoff between LUT width and efficiency. Both Altera and Xilinx now use something like a 6-input LUT as the basic building block for their high-end FPGA fabric. Factors like faster gate delays, longer routing delays, higher leakage current in the configuration logic, and much larger arrays of objects have altered the trade-off space enough that more inputs now seems to work significantly better.

As geometries got smaller, another part of the equation changed dramatically. Now, although most FPGA devices have a single LUT structure for their basic fabric, the number and diversity of “objects” in the array has gone far beyond “gates” or even “LUTs.” Today’s FPGAs are arrays of LUTs, multipliers or DSP cells, memory objects, processors, and even analog components. While this heterogeneous mix of objects poses significant new challenges for synthesis and place-and-route software, the potential gains offered by hard-wired, optimized IP are too much to pass up. Most complex FPGA designs today utilize vast amounts of various hardened IP in addition to the LUT fabric. It makes quantification of the performance and capacity of modern FPGAs – and particularly comparison of various arrays – almost impossible. Of course, that’s just the way the marketing departments like for the world to work…

All of this questioning about the ideal block to array and the move toward arrays of heterogeneous objects has even brought a number of startups to challenge the fundamental assumption of LUT as building block. What if a different, higher-level object were used as the basic building block of a programmable array? Companies like Ambric, MathStar, and Stretch (just to name a few) have pursued that concept with arrays of various types of high-level elements ranging from highly specialized ALUs to full-blown von Neumann processors with customizable instruction sets.

For certain classes of applications, each of these approaches can demonstrate spectacular performance and efficiency. However, the challenge faced by each is not the question of the perfect building block, but the marriage between that hardware architecture and the design software that goes with it. Getting an application or algorithm from the engineer’s mind and desk into an efficient implementation in any hardware configuration requires a great deal of sophisticated software that must be honed over the course of hundreds to thousands of designs. For these architectures, the development of that software and of the ecosystem that surrounds it has just begun.

This is where traditional LUT-based FPGAs have an almost insurmountable advantage. The long history of engineering optimization of that architecture – and particularly of the design software that accompanies it – gives it a huge advantage over hardware designs that might appear, and it even benchmarks to show better results. Much like the legacy internal-combustion engine and its triumph over other, seemingly superior approaches, the advantages of a new design are negated by the long history of improvement of a less-optimal one.

Related

Leave a Reply Cancel reply

featured chalk talk