An EDA Foil Hat

We are all under attack. Don’t bother hiding the kids; there is no escape. Well, not much, anyway. A foil hat won’t be enough to protect them, and they’d be totally abused at school in a full-body foil outfit.

This constant bombardment isn’t news; it’s the familiar neutron (amongst other particles) assault that comes from space or the materials around us. And it’s just waiting to mess up the system you designed.

There’s a proliferation of “Df’s,” and, joining DfT and DfM (referring to design for test and manufacturing, respectively), we have DfR – Design for Reliability: making a system that has less of a chance of failing. There are no guarantees – it’s virtually (even literally?) impossible to create something that has exactly (not asymptotically) a zero failures-in-time (FIT) rate. It’s a matter of designing to a target FIT rate.

This is the process that iROC Technologies is trying to help, with tools to figure out how vulnerable a circuit is and then assist with addressing any issues.

We’ve all heard that there’s a problem; iROC provided a little more background on the nuclear physics behind the problem. They list three major effects that cause issues:

A neutron whacking Si can result in Mg or Al plus some leftover goodies:
- n + Si ? ²⁵Mg + ?
- n + Si ? ²⁸Al + p⁺
- n + Si ? ²⁴Mg + n + ?
Packaging impurities emit ? particles
Boron, a common dopant, can turn into Li
- n_th +¹⁰B ? ⁷Li + ?

The products of these reactions then knock out electrons, which, given enough of them, can cause mischief. Scaling of circuits has had a mixed effect on this. The energy required to perturb something has gone down, so smaller particle packets can cause more mischief. On the other hand, transistors have gotten smaller, reducing the target size along with the probability that any one will get hit. On the other hand, there are a lot more transistors, so the net risk is on the rise.

The result of all of this is lumped under the general category of “single event effects,” or SEEs. These have different specific names depending on what’s affected:

Flip-flops flipping: single-event upset (SEU)
Combinatorial glitch: single-event transient (SET)
Memory corruption: soft error rate (SER)

iROC is focusing on cloud farms as their primary target, since they’re predicted to have the highest silicon growth rate. (Mobile has the second highest growth rate, and mobile apps are driving cloud demand as well.) Their point is that, if Facebook fails due to a glitch in the underlying hardware, it affects millions of people. (Although the chance that a single failure would completely take Facebook down seems pretty remote, even by these standards, given the redundancy available in server farms…).

Their solution consists of analysis first, followed by mitigation on any problem circuits. And this is the gist of their recent announcement: they’ve released a couple of tools for analyzing circuits and systems. The first is called TFIT; the second SoCFIT. You can think of them as analogous to TCAD and EDA, respectively. They’ve actually had these tools for a while in beta form, so, technically, this is TFIT 2 and SoCFIT 3.

TFIT is intended to work on smaller circuit blocks, with up to 30 or 40 transistors. It works with an encrypted model that you get from your (yes, you have to be a customer) foundry once you demonstrate that you have a TFIT license – they apparently don’t give this stuff out to just any customer.

The tool analyzes the circuit in around 15 minutes (or 30-50 minutes for a multi-cell memory) using SPICE. It delivers a cell FIT rate and identifies which transistors are sensitive or may have caused an upset. Clearly the first goal is to harden these low-level cells, but, once that’s done, the accumulated output of this tool creates a database that can then be used with SoCFIT.

SoCFIT is – as you might have guessed – a larger-scale tool for SoCs. Its input is the design itself and test vectors. The trick here is that most such tools want to run at the gate level, but it’s really hard for designers to work at the gate level. It’s like asking software engineers to analyze assembly code that their compiler created. The ideal is to work at the RTL level.

The problem is that, when you have only RTL, you don’t have transistors yet since you haven’t gone through synthesis. So iROC has a way of “estimating” a netlist from RTL; they’ve gotten good correlation between results using this estimated netlist and the actual netlist, although the uncertainties on the FIT rates are higher.

The result of SoCFIT is a pareto of problem areas in the circuit. They also provide a colored map (kind of like a heat map) that shows where the problems are. The tool also does “derating analysis.” Derating is the cumulative effect of masking; masking is a situation where something in the circuit burps, but downstream logic prevents that burp from propagating and causing any damage. Perhaps a mux was choosing a different input, for instance, so the anomaly stopped at the mux. All of these masking effects result in an overall probabilistic derating on the FIT rate to account for the fact that not all upsets matter.

The first SoCFIT run is typically performed on the circuit in a plain-vanilla manner to get a general result. This is then followed up by further runs during which faults are injected to refine the original results.

These tools then give you an understanding of where the weaknesses in your design lie. The obvious follow-on question is, “Well, then what?” This is where mitigation comes in, and, at this time iROC doesn’t have specific IP that they have shrink-wrapped for sale as hardened cells. That may happen at some point, but, for now, they consult to help designers take those problem areas and improve them.

Because the tools start with a target FIT rate, you end up reducing the risk of over-design, since the tools will point out only issues with respect to achieving that FIT rate. It’s not a one-size design-everything-for-deep-space approach. That’s important both from the obvious standpoint of not wanting to spend more time designing than necessary and from a cost reduction standpoint: many mitigation techniques increase the area of the circuit, raising the cost. Such additions must be made judiciously.

So, while you can’t stop the particle attack, you can toughen up your hide and make it harder for the little buggers to have their way with you or your circuit. And iROC is betting that there’s lots of toughening up to be done.

It sure beats looking like a nutcase in a foil hat.

More info:

iROC