A few years ago, one FPGA vendor, Actel, was quietly shouting in the corner. “Hey! Single event upsets (SEUs) are a big problem for FPGAs!”
The other FPGA companies replied with a thoughtful technical analysis of the situation: “Hey, Actel – SHUT UP!”
OK, maybe that’s not exactly the way it went down, but the idea is basically right. You see, Actel’s history is in super-high-reliability FPGAs for use in space. Up in space, there are lots of tiny particles flying around with a lot of energy. When one of those particles hits a vulnerable part of an IC (like a storage element of some kind), it can flip the bit from one to zero or zero to one. As your razor-sharp digital design mind might be telling you right now, this is really bad.
Now, all digital design technologies contain some forms of memory elements – registers, flip flops, RAM… so you might wonder why this problem is particularly bad for FPGAs. In addition to all those “normal” uses of storage elements, FPGAs also use memory-like cells to store basic configuration information like ROUTING.
So, in addition to possibly modifying your data, an SEU in an FPGA can randomly alter your design itself. This is very, very bad news. For the regular memory elements like registers, there were already techniques in use that could mitigate the errors. Triple-module redundancy (TMR) for example, uses 3 memory elements to store 1 bit of information, and it has voting circuitry that detects and corrects errors when they occur. If one bit gets hit with an SEU, the other two out-vote it and your design continues without issue. Regular memory, of course, can be protected with established error-correcting code (ECC) techniques. State machines can be protected from bit-flipping in state registers by choosing fault-tolerant encoding schemes.
However, for FPGAs, the vast majority of vulnerable memory cells are used in the configuration logic for things like routing and look-up table programming. The reason Actel was shouting loudly about this is that their two FPGA technologies, antifuse and flash, have configuration elements that are basically not vulnerable to SEUs. When asked about this, the larger FPGA companies – whose FPGAs use SRAM-like cells for configuration – quickly replied “Hey, did you know that our FPGAs have 11.2% more equivalent look-up tables than our competitors’?”
The big FPGA companies didn’t ignore the problem completely, however. Xilinx, in particular, has product offerings for space, mil/aero, and other high-reliability applications. They even had FPGAs in the Mars rovers. Without much public noise, they were taking their FPGAs to the neutron firing range in New Mexico and blasting them with all kinds of particles to see what shook loose. They were then developing techniques to help mitigate these radiation effects, including a “readback” technique for finding and correcting configuration errors. In this technique, the device’s configuration is regularly read back and compared against a reference and the device is re-programmed any time a discrepancy is found.
While far from perfect, the radiation mitigation in conventional SRAM FPGAs has proven good enough for space use in many cases. There is a large number of SRAM FPGAs in the radiation-intensive environment of space at this very moment. Xilinx followed with special FPGAs designed specifically for radiation tolerance, so their confidence in their ability to curtail radiation effects in space seems high.
But what about down here on the ground?
On the ground, of course, there is much less radiation. This is a good thing, because if the atmosphere didn’t protect us, we’d be in all kinds of trouble – and not just with our electronic devices. However, even a small change in elevation can dramatically increase susceptibility to radiation effects. It turns out that an elevation of just a few thousand feet – like, say, that of Denver, Colorado – can 4x your chances of radiation-induced errors. Furthermore, it turns out that materials used in chip packaging can emit particles that can cause errors. Yep, your own solder blob can irradiate your device and cause a logic error. Best be careful with that lead-based solder!
Xilinx actually quietly studied this phenomenon in detail. They did a series of experiments starting somewhere around 2005 called “Rosetta” which consisted of putting a bunch of FPGAs on a big board and letting it run continuously for a long time – watching for radiation-induced errors. Based on their results, it looks like the worst generation for firm-error susceptibility in Xilinx FPGAs was the 130nm generation, and that progress has been made since then in mitigating these effects. No data is available yet for the 28nm products, as these tests take a lot of time.
Without hard work in layout and IC design tricks, each generation of ICs should be more susceptible to SEUs. Lower voltages mean that less energy is required to flip a bit, and increased density means there are more targets to hit in configuration logic. It’s a constant battle between IC design and technology progress to keep SEUs at ground level under control. With each generation, if the fabs and the FPGA designers can’t come up with a breakthrough, your SEU susceptibility could go through the roof without warning. My guess is that you don’t check for SEU tolerance before you design in an FPGA, and if you’re using the latest process generation, the FPGA company may not have any meaningful data to share with you anyway. You’re on your own.
Ahem. Hey Kevin, you buried the headline.
Right! I did. Sorry. If you’re concerned about reliability of systems at ground level, new hope is here from Synopsys. The most recent version (2012.03) of their Synplify Premier FPGA synthesis tool includes robust support for high-reliability design, including SEU mitigation. Up until now, even though there were established techniques for SEU mitigation, it was VERY hard to get them into your FPGA design. If you wanted TMR for a register, for example, you had to design it explicitly in your HDL. Then an overly helpful synthesis tool might come along and say “Hey, look at all these extra gates that don’t do anything for the logic. Let’s optimize them away.” Oops.
Synopsys high-rel support includes three things: First, it does automatic generation of TMR (when you want it) and suppression of the tool’s desire to automatically optimize the TMR away. Second, it will infer error-correcting (ECC) RAM. Finally, it can generate fault-tolerant finite state machines (FSMs) so your design won’t wander off down a dusty path if an SEU hits a state register at an inopportune time. The tool generates Hamming-3 encoded FSMs, which will provide a much safer landing if a state bit happens to flip.
Synopsys is not the first FPGA tool vendor to supply this sort of capability. Mentor Graphics Precision Hi-Rel first rolled out this kind of capability a couple of years ago in a product targeted at the high-rel Mil/Aero market. Synplify Premier is now bringing these capabilities to the broader market with Synplify Premier.
There is a cost to all this safety, of course. TMR, for example, is a very gate-intensive solution. It more than triples the amount of logic for a register, so you don’t want to go using it unless you have a serious concern about SEUs in your design. For the record, if you’re designing anything that goes in an airplane in which I’m flying, I vote for you using TMR, and the other high-rel features too. Thanks.
8 thoughts on “Solving the Big Secret”
FPGA companies don’t talk too much about SEUs, and with good reason. They don’t happen very often, but when they do they’re kinda scary. Luckily, there are now design tools that can help us mitigate the effects of radiation in our FPGA designs.
What do you think?
SEU don’t happen very often. Yes, may be, but most space companies in Russia still use Virtex-4 and sometimes Virtex-2(!!!).
I have no chances to bring new Zynq family with ARM instead Virtex-4 and PPC.