Let’s say you are looking for a new house for your family. You’ve got a couple of contenders. One has four bedrooms, three baths, a two-car garage, and 3,000 square feet of living area. The other has three bedrooms, three baths, a three-car garage, and 3,200 square feet of living area.
Lining the two data sheets up, the houses are comparable. One shows a bit more living area, the other has an additional bedroom (which you would just use for a guest room anyway), and the additional garage isn’t much of a factor, since your family owns only two cars.
Weighing the two choices based on the data sheet makes sense – until you start reading the fine print. House #1, it turns out, doesn’t actually have 3,000 square feet. To get that number, they included a section of the yard that is covered by a roof, and the square footage number is “effective square feet.” Another footnote says that they have estimated the effective square feet based on a “livability factor,” since they deem the living space to be extra-efficient.
Reading further on house #2, there is a footnote saying that the heating system will support only two of the bedrooms being occupied at any one time. And – one of the bathrooms actually contains a bed, so it is counted as both a bedroom and a bathroom in the info sheet.
Welcome to the wonderful world of FPGA product tables.
When you shop for an FPGA for your project, you’ll see that the FPGA companies generously provide product selectors that tell you what resources are available on their chips. The problem is the details that are hidden fine print – and the ones that are not in print at all. Let’s start with the capacity of the FPGA itself. One family boasts “up to 480,000 logic cells.” OK, cool. Drill down to the fine print and the answer changes to “up to 478,000 logic cells.” Drilling down yet another level, we are told the number of logic cells is actually 477,760. Well, that’s just rounding up, right? And, it’s less than 1% difference, so why be picky?
But, those 478,000 cells – absolutely do not exist. Looking over one column, we see that the device physically contains 74,650 “slices.” Dropping to the footnotes, we see that a slice is made up of four LUTs and eight flip-flops. Multiplying 74,650 slices times four LUTs we get – 298,600 actual LUTs. Whoa! OK, that’s not just rounding. How do we turn 298,600 LUTs into 480,000? Well, back in the (very) old days, FPGAs used four-input LUTs. Newer ones use something like six-input LUTs. So – if we (generously) scale the number of six input LUTs to an equivalent number of legacy four-input ones, we’d still get only 450,000 – and that’s assuming that we get a perfect utilization of the extra inputs. The plot thickens…
Now let’s say you want to try to use those LUTs. This may come as a shock, but you can never use 100% of the LUTs on your FPGA. Typically, the routing resources won’t support completely routing anywhere near that number. If you’re clocking them very fast, you’ll also bump into power limitations. In fact, many designers tell us that they don’t get more than 60%-70% utilization in practice. So if we took the favorable 70% number, we’re looking at around 210K actual usable physical LUTs – on a device marketed as 480K.
That’s 210K unless, of course, you want to use some for “distributed RAM.” You see, when they’re trying to pump up the memory stats, they allow that you might want to use some of the LUT fabric to make memory instead of LUTs. You can have the LUTs or the RAM, but not both at the same time.
Life is more than LUTs and RAM, though. Today’s FPGAs have a wealth of other resources included. Take DSP blocks, for example. You’ll see some pretty impressive GMAC numbers given for FPGAs used in digital signal processing. Unfortunately, most of those numbers are idealized figures that you’d never see in real life. For example, if an FPGA boasts 1000 DSP blocks (where each DSP block contains one or more hard-wired multipliers and some accumulator/arithmetic and carry circuitry) they typically calculate the published GMAC number by multiplying the number of multipliers by the maximum operating frequency of those multipliers. If you manage to craft a real, useful design that comes even close to that situation, a lot of people would love to talk to you about engineering employment opportunities.
How about IO, though? The vendors are always bragging about their huge bandwidth of SerDes. You’ll see large numbers of transceivers capable of blistering-fast speeds (up to 28Gbps each on the current 28nm generation of devices). The thing is, with all that data coming into the chip, you need to be able to do something useful with it. That means you need lots of fast internal resources like LUTs, memory, and DSP blocks. In many of today’s devices, the SerDes bandwidth exceeds what the rest of the FPGA is capable of, for anything but the most well-behaved, straightforward designs. All that SerDes looks good on paper, but if you can’t use them all, they’re just taking up expensive silicon area, increasing your cost, and leaking power.
Chatting with a number of designers, we hear that under-utilizing FPGAs is pretty much an industry norm. If you’ve been using FPGAs for a while, you tend to mostly ignore the datasheet numbers and plan your design based on experience and preliminary output from the tools. If the tools say you can route your design, take advantage of the resources you need, and hit your power budget, then you can feel pretty comfortable with your selection of devices.
But, why have all those resources there in the first place if you can’t use them?
Well, first there are bragging rights and the reality of competition between the vendors. If one vendor has a million-cell FPGA, the other one needs to have 1.1 million. Specsmanship is an important part of marketing. Also, because of the wide variation in designs, each design may leave a different set of resources on the table. One design may max out the DSP blocks but not need all the LUT fabric. Another may be limited by the amount of RAM. Many are at the mercy of total IO pins or bandwidth. FPGA companies spend an amazing amount of engineering just trying to find the right balance of resources that will best serve the widest possible audience.
One area that has long been an architectural Achilles’ heel, however, is that of routing resources. Putting more routing on the chip is expensive. If you design an FPGA with so much routing resource that you can always route 100%, you’ve wasted a tremendous amount of space. Balancing the available routing with the other resources requires exhaustive trial-and-error with a large number and variety of designs. FPGA companies typically iterate with their proposed architecture through a huge test suite, adjusting the balance of resources each time until they hit a point where they get acceptable utilization on a diverse set of realistic designs. Xilinx has announced that their upcoming family includes a major rework of routing resources – aimed at letting us hit much higher utilization numbers than with previous families.
Certainly, the language and norms on FPGA specifications have become distorted over the years. Simply having the capacity defined in terms of an anachronistic architecture as a pseudo industry standard is confusing enough. Add to that the reality that almost no design will be able to come close to a perfect, balanced utilization of the resources on any given FPGA, and the situation can be downright confusing. There seems to be hope, however, in the direction the FPGA companies are taking, both with their design and with their marketing messages. It would be wonderful to be out of the era of marketing-driven specsmanship and into a new age of useful metrics for choosing the best part for our design work.
Until that day, the only strategy is to use the tools to get realistic fit estimates. Your designs, with your constraints, are the best (and only) sure-fire models that will tell you whether you can succeed with a particular device.
The use of bloated marketing numbers to define the FPGA size is nothing new. And in fact the “logic cell” number is a better yardstick for most designs than the old “system gates” number. At least I can find a multiplier that gets me from logic cells to LUTs. Still you’re right about needing to run the design through the tools to get the final picture. Often it’s gotchas like clock routing (yeah you get 32 global clock buffers but only 16 can reach any section of a chip) or other shared routing resources (Oh, you wanted to attach two adjacent clock pins to two PLLs, or have two adjacent I/Os running DDR on different clocks?). It’s been a long time since I was able to rely on a data sheet to tell me everything I needed to know about programmable logic.