feature article
Subscribe Now

Gearing Up for Rain

New Soft-Error-Tolerant Circuits

We don’t seem to mind the rain much. I mean, yeah, we’ll carry an umbrella if we think it might rain (except for places like Seattle where an umbrella is the signature of the tourist). But, to that point, really… what’s the harm of a little water?

A lot of water can be a problem. We’re not fish, after all. But a few drops in the hair is a long way from a long, cool drink in Lake Superior. Heck, we’ll even survive a dousing with a few gallons of Gatorade; pour it on!

Imagine, however, that you’re an ant. (If you have trouble visualizing this, Pixar’s A Bug’s Life gives a good depiction.) Getting hit with a drop of rain would be more or less like one of us having an entire dump truck full of Gatorade landing on us at once. In fact, if the ants are small enough and packed together, each drop is going to wipe out more than one ant.

And, at that level, bigger isn’t necessarily better. A bigger ant is simply a bigger target, more likely to get hit. It gets better only once the ant is big enough to be able to shrug off a direct hit.

Now picture each of those ants as tiny transistors packed together cheek by jowl in a leading-edge SoC. And, instead of April showers, picture a cosmic hailstorm: neutrons from afar hurtling through space, smashing headlong into one or more of those transistors. A particle collision creates a charge packet, and, if that packet is greater than a certain critical level (called Qcrit), then, well, bad things can happen. And because of the lower voltages we have today, there’s less margin for error.

Such a cataclysm is given the rather unsensational name “single event upset” (SEU). And these upsets can be caused by more than just peregrine neutrons. In fact, until packaging chemistry was improved, the biggest source was from alpha particles originating in the encapsulating plastic. These days, it’s the neutron that’s finding its time in the spotlight.

Particles having origins in space become even more of an issue the farther you get from Earth’s surface, since, the closer to space, the less atmosphere there is to temper the onslaught. So, while this is a consideration for earth-bound systems, it’s critical (and has been for decades) for aerospace systems. Not only are those systems more vulnerable, but the consequences of failure are also greater in terms of life (if an airplane goes awry) and/or cost (if an expensive satellite fails).

The ability to market SEU-tolerance is, you can imagine, important for anyone trying to participate in this market. And there are lots of tried-and-true techniques that have carried us this far. Examples are triple-module redundancy (translation: do everything three times, and, if one screws up, the other two can out-vote it) and the extensive use of error-correcting codes (ECC). Notably, these techniques don’t prevent SEUs; they just let a system recover. Actel (now Microsemi) has long had success marketing their anti-fuse technology as inherently more robust than SRAM technology for FPGAs for users that really need that robustness.

But nowadays we’re at the point where, to paraphrase Heimlich, all our transistors look like little ants. And we can’t always count on correcting after the error happens: that can be expensive at the least, and it has to be done right there in hardware; it can’t be bailed out by software later on.

So there was a session at this year’s isQED devoted to developments in SEU-tolerant circuits. And most of the discussion dealt with storage elements, since that’s been viewed as the biggest problem. If something is supposed to store a 1 and, when called on, serves up a 0 because it got goosed while sitting there, well, that just won’t do.

The conventional wisdom has been that combinatorial circuitry is less of an issue because SEUs can be “masked.” Logical masking involves adding redundant logic (like masking glitches caused by logic hazards) so that one errant node is backed up by some other node. Electrical masking relies on the fact that, as a glitchy signal goes through several buffer stages, it is gradually dampened and may never pose a problem.

One team, however, felt that it’s getting harder to count on these techniques for mainstream high-performance logic. Warin Sootkaneung and Kewal K. Saluja of the University of Wisconsin, Madison pointed out that, in highly pipelined architectures, the layer of combinatorial logic in each stage is intentionally very thin. Redundant logic adds cost and slows things down, and there is less chance for a glitch to disappear before it hits the input of the next flip-flop. So masking is less effective.

Their approach wasn’t so much one of avoiding the problems outright, but of minimizing the impact of transistor sizing. At this level, bigger transistors are bigger targets, so their philosophy is: upsize only what really needs to be bigger, and don’t upsize more than absolutely necessary. They presented two algorithms, one for each of those considerations.

The other papers focused on storage elements of one kind or another. A consistent theme is the fact that much of the problem is with static circuits just holding data. In that case, you’re dealing with high-impedance nodes, which can be easily bounced around. During reading and writing, things are more aggressively driven, potentially overwhelming any bogus charge packets.

Sudipta Sarkar et al of the India Institute of Science, Bangalore, the University of Wisconsin, Madison, and the University of Tokyo presented an SRAM cell that has four times the Qcrit of a standard circuit.

They focus on the susceptibility of a transistor when the drain and body are at different voltages. This reduces the number of circuit states that must be protected: if the drain and body are at the same voltage (say, when a PMOS pull-up is on), then that node isn’t vulnerable.

They designed a 10-T cell that places one extra PMOS transistor with a grounded gate between the pull-up and pull-down on each side. This added transistor forms a resistor divider with the pull-down NMOS, and the point between them is used to drive the mirror pull-down. By carefully sizing the transistors, they guarantee that that node can never rise high enough to turn on the NMOS.

Of course, you might think that would be a problem if you actually want to flip the state. And this is one of those situations where it protects that node only when sitting around holding state. A different pull-up provides sufficient voltage when writing. They also do careful sizing elsewhere to ensure that all nodes are protected when holding data and when reading and writing.

Sandeep Sriram et al of the Illinois Institute of Technology focused on a basic latch that operates safely near the sub-threshold region. Central to their solution was what they called an alternative-feedback stacked CMOS (AFSC) circuit. Picture your basic CMOS inverter. Then add another P and another N to the stack, and drive those with inverted feedback from the output. Now the stack can’t turn on unless the input and output are in opposite states. If a glitch occurs that puts the input and output in the same state, the stack won’t turn on.

This circuit was inserted into the feedback paths that are normally used to maintain the state of the latch; one was also added between the “storage” and the output. Conceptually, they break these paths, since glitches on one side can’t propagate to the other. But they still hold those nodes in place (rather than letting them float, which would happen if you literally broke the feedback path).

Here again, this helps while the circuit is just sitting around holding state. The AFSCs prevent any SEU from propagating through the feedback to change the state of the cell. Instead, the original state will be restored once the glitch dissipates. When writing a new value, however, the desired nodes are driven by signals coming through transmission gates, bypassing the AFSC blockages.

Finally, David Li et al from the University of Waterloo addressed both SEUs and metastability in a new flip-flop design. From an SEU standpoint, they showed two existing circuits, called a DICE cell and a Quatro cell, respectively, that help with SEU tolerance. While configured differently, they basically consist of four transistors, and, if one of them is perturbed, the other three help to restore the original state.

It was combining this with the metastability-hardening, in the context of a low-power, high-performance flip-flop, that was the challenge. Fundamentally, they made the master deal with metastability and had the slave deal with SEU tolerance. Because the Quatro cell does better with differential circuits, it was used in the slave rather than the DICE cell.

They had two versions of their proposed flip-flop. One was called a “pre-discharge flip-flop with soft-error protection” (PDFF-SE), and the other was called a “sense-amp transmission gate [flip-flop] with soft-error protection” (SATG-SE). Both make use of a cross-coupled master, which they say allows them to size the transistors up for both high performance and high transconductance, the latter being critical to reducing susceptibility to metastability.

The reason for two different configurations is that the PDFF-SE version does better when performance is critical; the SATG-SE version does better when power is the primary concern. The balancing required between metastability, speed, and power led them to propose two new figures of merit: the metastability-delay product (MDP) and the metastability-power-delay product (MPDP). The metastability in the formula is represented by τ, an exponential parameter influential in the metastability resolving time.

The team analyzed their circuits in comparison to several other variants (including some sporting the venerable C2MOS technology) and found that theirs beat all of them on both MDP and MPDP. In fact, they did very respectably even with respect to plain-old power-delay product.

Together, these proposals provide our transistor ants either with better ways of avoiding the rain or with better waterproof helmets that let them survive the occasional drop.

I refer you to the actual papers for the gory details… (unfortunately, they’re not available for free if you didn’t attend).

Leave a Reply

featured blogs
Jul 17, 2018
In the first installment, I wrote about why I had to visit Japan in 1983, and the semiconductor stuff I did there. Today, it's all the other stuff. Japanese Food When I went on this first trip to Japan, Japanese food was not common in the US (and had been non-existent in...
Jul 16, 2018
Each instance of an Achronix Speedcore eFPGA in your ASIC or SoC design must be configured after the system powers up because Speedcore eFPGAs employ nonvolatile SRAM technology to store the eFPGA'€™s configuration bits. Each Speedcore instance contains its own FPGA configu...
Jul 12, 2018
A single failure of a machine due to heat can bring down an entire assembly line to halt. At the printed circuit board level, we designers need to provide the most robust solutions to keep the wheels...