feature article
Subscribe Now

To Err Is Universal

Listening to Alexander Pope (and even Seneca and Cicero long before), you would think that erring is a particularly human trait. Taken out of the original context (which places human fallibility in contrast to the divinity of forgiveness or the diabolicalness of persisting in erring), you might feel singled out by nature as a particularly unreliable entity.

In fact, we are surrounded by errors. They happen all the time. We are but a wee part of a huge, interconnected, error-prone system. Our very existence is a testament to the ability of living forms to survive any number of errors that might occur. The more we study genetics and the geological history of our planet, the more we learn about the range of redundant systems and adaptations that have evolved to bring us this far. (Or, looked at from an anthropic perspective, that had to evolve to get us this far.)

With a few exceptions, however, the machines we humans create have no such adaptability. Whereas the natural world abides variation and randomness, we create deterministic systems that assume a relatively narrow range of operating conditions such that a given input will guarantee a correct output.

And, so far, this has worked pretty well. But, in reality, what we’re doing is approximating the real world and, basically, rounding off the error as insignificant. The problem is, this error is becoming more significant.

Reliability engineers will talk in terms of Failures In Time (FITs); one FIT equals one failure in 109 hours – roughly 115,000 years. Long enough to where a lawyer will have a hard time pinning a failure on you and penalizing your progeny.

So how many failures can we tolerate? As described by Synopsys’s Yervant Zorian, for most of the world, this has been about 1000 FITs – especially for networking. But for certain critical areas, it’s much lower: medical systems require 50-FIT levels, and automotive demands a miniscule 0.1 FITs. If you wonder why medical isn’t the lowest, I suppose we can interpret the lower automotive rate as the proverbial ounce of prevention to medical’s pound of cure.

Note that these are system failure rates. The system fails when any one (or more) of its components fails. And the one component that has gotten more attention than any other has been the memory.

I suppose you could debate whether memory is more important than any other component, although if we think about it in our own terms, losing our own memories can eventually be more debilitating than the loss of some other functions (with the exception of the ability to forget that particular night during that particular pub crawl; such loss can perhaps be considered yet another of nature’s error-coping mechanisms: in this case, one of mercy).

The use of an error-correcting code (ECC) in memory is nothing new, and there’s a fair bit of IP available. But much of the focus of ECC has been for handling manufacturing errors and, typically, single-bit errors. What’s changing is that soft errors in the field, while always possible, are becoming more of an issue, and they affect more than one bit at a time. There are three contributors to this:

  • voltages are coming down, reducing the level of margin;
  • the amount of charge stored in a memory cell is lower, meaning less energy is required to upset the cell;
  • and the size of cells is smaller, meaning that an alpha particle can disturb more than one cell at a time.

Couple this with the huge increases in memory going into systems, and it starts to get very hard to keep FIT rates within manageable levels.

Synopsys’s approach to this is manifest in the form of their newly released DesignWare STAR ECC product (courtesy of their acquisition of Virage). This is essentially a memory compiler that allows engineers to build in a level of correction through additional bits and proper coding and then evaluate the FIT rates that would result. The goal is to achieve the desired FIT rate using the smallest area possible.

It’s not just a matter of adding bits to the bytes, however. Scrambling – that is, mixing up which physical cells belong to which logical bytes – becomes more important so that, when multiple cells are disturbed by a single event, those cells will be less likely to cohabitate within a single byte or word. Multiple bytes with single-bit errors are easier to correct than a single byte with multiple-bit errors. The STAR ECC compiler allows you to enter information about the physical cell adjacencies using a language called MASIS; this lets the tool figure out how to scramble the bits.

Using this methodology, you start by designing your memory; then you use the STAR ECC compiler to create the field-correction redundancy. You can then insert the STAR wrapper for handling manufacturing and testing issues.

Meanwhile, in a completely different corner of the continent, another company is taking a completely different approach to ECC – and to logic. As startup Lyric Semiconductor’s Dave Reynolds says, “2+2=4 is a solved problem.” There are numerous other problems where there isn’t necessarily a right or wrong answer – there’s a “most likely” answer (with, of course, a possibility of error).

Lyric refers to this as “probability processing.” They point out a number of applications – gene sequencing, search, and ECC to name some – that are essentially probability problems to be solved.

And, while they have a general-purpose probability processor in the works for the future, ECC is the first problem they are attacking with a purpose-built piece of IP for handling high-end ECC (for things like solid-state disks) to compete with traditional digital approaches.

And, while they are not disclosing details about their circuits at this point, they are decidedly not digital. In fact, in the general case, they take an analog input and provide a digital output. They don’t use “bits”; they use “pbits” (probability bits). They don’t use Boolean gates, they use Bayesian gates. They describe the pbits as flowing multi-directionally through the Bayesian gates such that every variable talks to every other variable. The effect is that of highly parallel computing.

Each cell resolves its input to a 1 or 0 in an iterative fashion; each iteration takes 3-8 cycles. While this introduces some latency, as long as the latency is less than the time it takes to fetch the next word, it is hidden.

To write programs for this kind of processor, they have developed a language they call PSBL (Probability Synthesis to Bayesian Logic). Looking at some snippets of code, it appears rather opaque, but they acknowledge this initial impression and say that, once you’ve learned the syntax, it’s actually easy to program in.

They provide some comparisons to the standard digital approach (for an ECC of similar quality) to drive home the benefits they see. From a circuit size standpoint, they claim to be 30X smaller at 1 Gbps and 70X smaller at 6 Gbps. They claim that power is reduced by 12X. (And yes, for any pedants, I’m aware of the theoretically-confusing concept of “nX smaller”… file it under “Y” for “You know what I mean.”)

The other comparison they provide is that their circuit at 180 nm is smaller and lower-power than the equivalent digital circuit at 45 nm.

And so we have two very different ways of approaching ECC; it will presumably be some time before a winner is decided. But, based on these efforts, we may have to provide a variation on our opening theme (with apologies to Pope and the Romans), as “To err is too much of a pain in the butt; to correct is essential.”

Yeah, you’re right… not sure they’ll be quoting that one centuries from now…

 

More info:

Synopsys DesignWare STAR ECC

Lyric Semiconductor

Leave a Reply

featured blogs
Dec 6, 2023
Optimizing a silicon chip at the system level is crucial in achieving peak performance, efficiency, and system reliability. As Moore's Law faces diminishing returns, simply transitioning to the latest process node no longer guarantees substantial power, performance, or c...
Dec 6, 2023
Explore standards development and functional safety requirements with Jyotika Athavale, IEEE senior member and Senior Director of Silicon Lifecycle Management.The post Q&A With Jyotika Athavale, IEEE Champion, on Advancing Standards Development Worldwide appeared first ...
Nov 6, 2023
Suffice it to say that everyone and everything in these images was shot in-camera underwater, and that the results truly are haunting....

featured video

Dramatically Improve PPA and Productivity with Generative AI

Sponsored by Cadence Design Systems

Discover how you can quickly optimize flows for many blocks concurrently and use that knowledge for your next design. The Cadence Cerebrus Intelligent Chip Explorer is a revolutionary, AI-driven, automated approach to chip design flow optimization. Block engineers specify the design goals, and generative AI features within Cadence Cerebrus Explorer will intelligently optimize the design to meet the power, performance, and area (PPA) goals in a completely automated way.

Click here for more information

featured paper

3D-IC Design Challenges and Requirements

Sponsored by Cadence Design Systems

While there is great interest in 3D-IC technology, it is still in its early phases. Standard definitions are lacking, the supply chain ecosystem is in flux, and design, analysis, verification, and test challenges need to be resolved. Read this paper to learn about design challenges, ecosystem requirements, and needed solutions. While various types of multi-die packages have been available for many years, this paper focuses on 3D integration and packaging of multiple stacked dies.

Click to read more

featured chalk talk

Secure Authentication ICs for Disposable and Accessory Ecosystems
Sponsored by Mouser Electronics and Microchip
Secure authentication for disposable and accessory ecosystems is a critical element for many embedded systems today. In this episode of Chalk Talk, Amelia Dalton and Xavier Bignalet from Microchip discuss the benefits of Microchip’s Trust Platform design suite and how it can provide the security you need for your next embedded design. They investigate the value of symmetric authentication and asymmetric authentication and the roles that parasitic power and package size play in these kinds of designs.
Jul 21, 2023
16,822 views