feature article
Subscribe Now

FP?A

The Quest for the Best Building Blocks

In olden times, when digital dinosaurs roamed the vast plains of our circuit boards – when 22V10’s walked the earth in vast herds, programmable logic devices were essentially nothing more than routing.  By creating an array of programmable interconnect, you could essentially hard-wire any complex combinational logic function.  PLDs quickly evolved into FPGAs, however, as it became clear that more structural variation was required than and and/or matrix, and sequential behavior was highly desired from programmable logic.

FPGA, as we all know, stands for “Field Programmable Gate Array.”  However, the elements arrayed in a typical FPGA are not “Gates” at all.  Most FPGAs use some variant of a look-up table (LUT) as their basic building block.  A LUT is a simple PROM (much like a tiny version of early PLDs) that maps every combination of input values to an output value based on a truth table.  By loading various data into the truth table, a LUT can be made to emulate any combinational logic gate.  Most FPGA LUT-cells also contain a register or flip-flop to store or latch the output value, and often some additional logic for structures like carry chains.

To build an FPGA, you plopped an array of LUTs down on your chip and stitched them together with a bunch of programmable switch-box style interconnect.  You’d then surround the array with a flock of versatile I/O cells and stretch some clocking scheme across the whole thing.  Once that architecture was in place, the big question became “What kind of LUT works best?”  There was considerable research and debate on this topic in the case of SRAM-based FPGAs. The industry concluded early on that the 4-input LUT was the optimal structure for efficient implementation of most types of logic, with a reasonable tradeoff between resources consumed by logic and routing.

The choice of LUT4 was based on several things, however, including a homogeneous architecture, a particular silicon process node, and the performance of synthesis and placement-and-routing software of the day.  Exploring to find the perfect building block, it turns out, is a complicated, somewhat empirical process.  Most researchers start with a selection of netlists that represent “typical” designs, and then they modify FPGA synthesis and place-and-route tools to implement those designs in various (fictitious) architectures.  For each architecture, the timing performance, routing completion rate, number of levels of logic, and resource utilization efficiency can be estimated.  This process allows the merits of various designs to be tested without the tremendous expense of fabricating silicon.  Still, it requires either access to the “internals” of commercial synthesis and routing software or the development of comparable in-house software for use in the test mule.

It is an interesting side note that the need for this “research” version of synthesis software is probably what motivated FPGA vendors to develop their own synthesis tools in the first place – the same synthesis tools that now compete directly with commercial EDA tools.  Because EDA vendors were not willing or able to give access to the source-code level of internal mapping algorithms to FPGA vendors for research purposes, the vendors had to come up with another option.  That “other option” was to build their own full-fledged synthesis system.  Now life is tough for EDA companies selling FPGA synthesis tools because they must maintain a clear differentiation between their offering and the “free” tools included in the FPGA vendors’ design kits.

The ideal scenario for best performance and area efficiency requires close cooperation between the array architecture, the logic synthesis technology that maps generic logic to that architecture, and the placement and routing software that completes the physical design tasks.  Both synthesis and place-and-route are NP-complete, so the algorithms and heuristics involved in them vary widely in performance and are heavily affected by the target architecture and choice of basic building blocks.  Mapping algorithms that give fabulous results for a LUT3-based architecture (one with 3-input look-up tables as the basic cells) might give terrible results targeting LUT4- or LUT5-based fabrics.

Also, unlike in the ASIC world, the placement and routing landscape for FPGAs is highly non-linear.  In ASIC, a given Manhattan distance (the rectilinear distance between two points on a chip – that approximates ideal routing) would have a fairly constant routing capacitance and corresponding net delay.  In FPGA, however, the wide variety of types of paths in the interconnect fouled classical placement and routing algorithms, requiring design tool developers to create algorithms much more tailored to particular on-chip interconnect structures.  Again, this more tightly wedded the trinity of basic logic cell, synthesis technology, and place-and-route.

Meanwhile, FPGA vendors such as Actel, who were producing FPGAs not based on SRAM technology, faced their own challenges.  For their fabric, a tile something like a LUT3 cell (which could also be used as a register) turned out to be optimal.  Getting high-quality mapping for this architecture for tools already optimized for SRAM’s LUT4 turned out to be a major challenge.  These vendors had to fight for engineering attention from design tool suppliers to get mapping algorithms tuned for the vagaries of their architecture that would yield results of the quality that SRAM vendors were enjoying with their LUT4 architectures.

In more recent process geometries, SRAM vendors have had to revise their previous assumptions about tradeoff between LUT width and efficiency.  Both Altera and Xilinx now use something like a 6-input LUT as the basic building block for their high-end FPGA fabric.  Factors like faster gate delays, longer routing delays, higher leakage current in the configuration logic, and much larger arrays of objects have altered the trade-off space enough that more inputs now seems to work significantly better.

As geometries got smaller, another part of the equation changed dramatically.  Now, although most FPGA devices have a single LUT structure for their basic fabric, the number and diversity of “objects” in the array has gone far beyond “gates” or even “LUTs.”  Today’s FPGAs are arrays of LUTs, multipliers or DSP cells, memory objects, processors, and even analog components.  While this heterogeneous mix of objects poses significant new challenges for synthesis and place-and-route software, the potential gains offered by hard-wired, optimized IP are too much to pass up.  Most complex FPGA designs today utilize vast amounts of various hardened IP in addition to the LUT fabric.  It makes quantification of the performance and capacity of modern FPGAs – and particularly comparison of various arrays – almost impossible.  Of course, that’s just the way the marketing departments like for the world to work…

All of this questioning about the ideal block to array and the move toward arrays of heterogeneous objects has even brought a number of startups to challenge the fundamental assumption of LUT as building block.  What if a different, higher-level object were used as the basic building block of a programmable array?  Companies like Ambric, MathStar, and Stretch (just to name a few) have pursued that concept with arrays of various types of high-level elements ranging from highly specialized ALUs to full-blown von Neumann processors with customizable instruction sets. 

For certain classes of applications, each of these approaches can demonstrate spectacular performance and efficiency.  However, the challenge faced by each is not the question of the perfect building block, but the marriage between that hardware architecture and the design software that goes with it.  Getting an application or algorithm from the engineer’s mind and desk into an efficient implementation in any hardware configuration requires a great deal of sophisticated software that must be honed over the course of hundreds to thousands of designs.  For these architectures, the development of that software and of the ecosystem that surrounds it has just begun.

This is where traditional LUT-based FPGAs have an almost insurmountable advantage.  The long history of engineering optimization of that architecture – and particularly of the design software that accompanies it – gives it a huge advantage over hardware designs that might appear, and it even benchmarks to show better results.  Much like the legacy internal-combustion engine and its triumph over other, seemingly superior approaches, the advantages of a new design are negated by the long history of improvement of a less-optimal one.

Leave a Reply

featured blogs
Sep 29, 2023
Our ultra-low-power SiWx917 Wi-Fi SoC with an integrated AI/ML accelerator simplifies Edge AI for IoT device makers. Accelerate your AIoT development....
Sep 29, 2023
Cadence has become a contributor-level member of the Automotive Working Group in the Universal Chiplet Interconnect Express (UCIe) Consortium. Last year, the Consortium ratified the UCIe specification, which was established to standardize a die-to-die interconnect for chiplet...
Sep 28, 2023
See how we set (and meet) our GHG emission reduction goals with the help of the Science Based Targets initiative (SBTi) as we expand our sustainable energy use.The post Synopsys Smart Future: Our Climate Actions to Reduce Greenhouse Gas Emissions appeared first on Chip Des...
Sep 27, 2023
On-device generative AI brings many exciting advantages, including cost, privacy, performance and personalization '“ offering significant enhancements in utility, productivity and entertainment with use cases across industries, from the commonplace to the creative....
Sep 21, 2023
Not knowing all the stuff I don't know didn't come easy. I've had to read a lot of books to get where I am....

Featured Video

Chiplet Architecture Accelerates Delivery of Industry-Leading Intel® FPGA Features and Capabilities

Sponsored by Intel

With each generation, packing millions of transistors onto shrinking dies gets more challenging. But we are continuing to change the game with advanced, targeted FPGAs for your needs. In this video, you’ll discover how Intel®’s chiplet-based approach to FPGAs delivers the latest capabilities faster than ever. Find out how we deliver on the promise of Moore’s law and push the boundaries with future innovations such as pathfinding options for chip-to-chip optical communication, exploring new ways to deliver better AI, and adopting UCIe standards in our next-generation FPGAs.

To learn more about chiplet architecture in Intel FPGA devices visit https://intel.ly/45B65Ij

featured paper

Intel's Chiplet Leadership Delivers Industry-Leading Capabilities at an Accelerated Pace

Sponsored by Intel

We're proud of our long history of rapid innovation in #FPGA development. With the help of Intel's Embedded Multi-Die Interconnect Bridge (EMIB), we’ve been able to advance our FPGAs at breakneck speed. In this blog, Intel’s Deepali Trehan charts the incredible history of our chiplet technology advancement from 2011 to today, and the many advantages of Intel's programmable logic devices, including the flexibility to combine a variety of IP from different process nodes and foundries, quicker time-to-market for new technologies and the ability to build higher-capacity semiconductors

To learn more about chiplet architecture in Intel FPGA devices visit: https://intel.ly/47JKL5h

featured chalk talk

Solving Design Challenges Using TI's Code Free Sensorless BLDC Motor Drivers
Designing systems with Brushless DC motors can present us with a variety of difficult design challenges including motor deceleration, reliable motor startup and hardware complexity. In this episode of Chalk Talk, Vishnu Balaraj from Texas Instruments and Amelia Dalton investigate two new solutions for BLDC motor design that are code free, sensorless and easy to use. They review the features of the MCF8316A and MCT8316A motor drivers and examine how each of these solutions can make your next BLDC design easier than ever before.
Oct 19, 2022
39,878 views