feature article
Subscribe Now

To Err Is Universal

Listening to Alexander Pope (and even Seneca and Cicero long before), you would think that erring is a particularly human trait. Taken out of the original context (which places human fallibility in contrast to the divinity of forgiveness or the diabolicalness of persisting in erring), you might feel singled out by nature as a particularly unreliable entity.

In fact, we are surrounded by errors. They happen all the time. We are but a wee part of a huge, interconnected, error-prone system. Our very existence is a testament to the ability of living forms to survive any number of errors that might occur. The more we study genetics and the geological history of our planet, the more we learn about the range of redundant systems and adaptations that have evolved to bring us this far. (Or, looked at from an anthropic perspective, that had to evolve to get us this far.)

With a few exceptions, however, the machines we humans create have no such adaptability. Whereas the natural world abides variation and randomness, we create deterministic systems that assume a relatively narrow range of operating conditions such that a given input will guarantee a correct output.

And, so far, this has worked pretty well. But, in reality, what we’re doing is approximating the real world and, basically, rounding off the error as insignificant. The problem is, this error is becoming more significant.

Reliability engineers will talk in terms of Failures In Time (FITs); one FIT equals one failure in 109 hours – roughly 115,000 years. Long enough to where a lawyer will have a hard time pinning a failure on you and penalizing your progeny.

So how many failures can we tolerate? As described by Synopsys’s Yervant Zorian, for most of the world, this has been about 1000 FITs – especially for networking. But for certain critical areas, it’s much lower: medical systems require 50-FIT levels, and automotive demands a miniscule 0.1 FITs. If you wonder why medical isn’t the lowest, I suppose we can interpret the lower automotive rate as the proverbial ounce of prevention to medical’s pound of cure.

Note that these are system failure rates. The system fails when any one (or more) of its components fails. And the one component that has gotten more attention than any other has been the memory.

I suppose you could debate whether memory is more important than any other component, although if we think about it in our own terms, losing our own memories can eventually be more debilitating than the loss of some other functions (with the exception of the ability to forget that particular night during that particular pub crawl; such loss can perhaps be considered yet another of nature’s error-coping mechanisms: in this case, one of mercy).

The use of an error-correcting code (ECC) in memory is nothing new, and there’s a fair bit of IP available. But much of the focus of ECC has been for handling manufacturing errors and, typically, single-bit errors. What’s changing is that soft errors in the field, while always possible, are becoming more of an issue, and they affect more than one bit at a time. There are three contributors to this:

  • voltages are coming down, reducing the level of margin;
  • the amount of charge stored in a memory cell is lower, meaning less energy is required to upset the cell;
  • and the size of cells is smaller, meaning that an alpha particle can disturb more than one cell at a time.

Couple this with the huge increases in memory going into systems, and it starts to get very hard to keep FIT rates within manageable levels.

Synopsys’s approach to this is manifest in the form of their newly released DesignWare STAR ECC product (courtesy of their acquisition of Virage). This is essentially a memory compiler that allows engineers to build in a level of correction through additional bits and proper coding and then evaluate the FIT rates that would result. The goal is to achieve the desired FIT rate using the smallest area possible.

It’s not just a matter of adding bits to the bytes, however. Scrambling – that is, mixing up which physical cells belong to which logical bytes – becomes more important so that, when multiple cells are disturbed by a single event, those cells will be less likely to cohabitate within a single byte or word. Multiple bytes with single-bit errors are easier to correct than a single byte with multiple-bit errors. The STAR ECC compiler allows you to enter information about the physical cell adjacencies using a language called MASIS; this lets the tool figure out how to scramble the bits.

Using this methodology, you start by designing your memory; then you use the STAR ECC compiler to create the field-correction redundancy. You can then insert the STAR wrapper for handling manufacturing and testing issues.

Meanwhile, in a completely different corner of the continent, another company is taking a completely different approach to ECC – and to logic. As startup Lyric Semiconductor’s Dave Reynolds says, “2+2=4 is a solved problem.” There are numerous other problems where there isn’t necessarily a right or wrong answer – there’s a “most likely” answer (with, of course, a possibility of error).

Lyric refers to this as “probability processing.” They point out a number of applications – gene sequencing, search, and ECC to name some – that are essentially probability problems to be solved.

And, while they have a general-purpose probability processor in the works for the future, ECC is the first problem they are attacking with a purpose-built piece of IP for handling high-end ECC (for things like solid-state disks) to compete with traditional digital approaches.

And, while they are not disclosing details about their circuits at this point, they are decidedly not digital. In fact, in the general case, they take an analog input and provide a digital output. They don’t use “bits”; they use “pbits” (probability bits). They don’t use Boolean gates, they use Bayesian gates. They describe the pbits as flowing multi-directionally through the Bayesian gates such that every variable talks to every other variable. The effect is that of highly parallel computing.

Each cell resolves its input to a 1 or 0 in an iterative fashion; each iteration takes 3-8 cycles. While this introduces some latency, as long as the latency is less than the time it takes to fetch the next word, it is hidden.

To write programs for this kind of processor, they have developed a language they call PSBL (Probability Synthesis to Bayesian Logic). Looking at some snippets of code, it appears rather opaque, but they acknowledge this initial impression and say that, once you’ve learned the syntax, it’s actually easy to program in.

They provide some comparisons to the standard digital approach (for an ECC of similar quality) to drive home the benefits they see. From a circuit size standpoint, they claim to be 30X smaller at 1 Gbps and 70X smaller at 6 Gbps. They claim that power is reduced by 12X. (And yes, for any pedants, I’m aware of the theoretically-confusing concept of “nX smaller”… file it under “Y” for “You know what I mean.”)

The other comparison they provide is that their circuit at 180 nm is smaller and lower-power than the equivalent digital circuit at 45 nm.

And so we have two very different ways of approaching ECC; it will presumably be some time before a winner is decided. But, based on these efforts, we may have to provide a variation on our opening theme (with apologies to Pope and the Romans), as “To err is too much of a pain in the butt; to correct is essential.”

Yeah, you’re right… not sure they’ll be quoting that one centuries from now…

 

More info:

Synopsys DesignWare STAR ECC

Lyric Semiconductor

Leave a Reply

featured blogs
Jun 18, 2021
It's a short week here at Cadence CFD as we celebrate the Juneteenth holiday today. But CFD doesn't take time off as evidenced by the latest round-up of CFD news. There are several really... [[ Click on the title to access the full blog on the Cadence Community sit...
Jun 17, 2021
Learn how cloud-based SoC design and functional verification systems such as ZeBu Cloud accelerate networking SoC readiness across both hardware & software. The post The Quest for the Most Advanced Networking SoC: Achieving Breakthrough Verification Efficiency with Clou...
Jun 17, 2021
In today’s blog episode, we would like to introduce our newest White Paper: “System and Component qualifications of VPX solutions, Create a novel, low-cost, easy to build, high reliability test platform for VPX modules“. Over the past year, Samtec has worked...
Jun 14, 2021
By John Ferguson, Omar ElSewefy, Nermeen Hossam, Basma Serry We're all fascinated by light. Light… The post Shining a light on silicon photonics verification appeared first on Design with Calibre....

featured video

Kyocera Super Resolution Printer with ARC EV Vision IP

Sponsored by Synopsys

See the amazing image processing features that Kyocera’s TASKalfa 3554ci brings to their customers.

Click here for more information about DesignWare ARC EV Processors for Embedded Vision

featured paper

What is a Hall-effect sensor?

Sponsored by Texas Instruments

Are you considering a Hall-effect sensor for your next design? Read this technical article to learn how Hall-effect sensors work to accurately measure position, distance and movement. In this article, you’ll gain insight into Hall-effect sensing theory, topologies, common use cases and the different types of Hall-effect sensors available today: Hall-effect switches, latches and linear sensors.

Click to read more

Featured Chalk Talk

Transforming 400V Power for SELV Systems

Sponsored by Mouser Electronics and Vicor

Converting from distribution-friendly voltages like 400V down to locally-useful voltages can be a tough engineering challenge. In SELV systems, many teams turn to BCM converter modules because of their efficiency, form factor, and ease of design-in. In this episode of Chalk Talk, Amelia Dalton chats with Ian Masza of Vicor about transforming 400V into power for SELV systems.

Click here for more information about Products by Vicor