feature article
Subscribe Now

To Err Is Universal

Listening to Alexander Pope (and even Seneca and Cicero long before), you would think that erring is a particularly human trait. Taken out of the original context (which places human fallibility in contrast to the divinity of forgiveness or the diabolicalness of persisting in erring), you might feel singled out by nature as a particularly unreliable entity.

In fact, we are surrounded by errors. They happen all the time. We are but a wee part of a huge, interconnected, error-prone system. Our very existence is a testament to the ability of living forms to survive any number of errors that might occur. The more we study genetics and the geological history of our planet, the more we learn about the range of redundant systems and adaptations that have evolved to bring us this far. (Or, looked at from an anthropic perspective, that had to evolve to get us this far.)

With a few exceptions, however, the machines we humans create have no such adaptability. Whereas the natural world abides variation and randomness, we create deterministic systems that assume a relatively narrow range of operating conditions such that a given input will guarantee a correct output.

And, so far, this has worked pretty well. But, in reality, what we’re doing is approximating the real world and, basically, rounding off the error as insignificant. The problem is, this error is becoming more significant.

Reliability engineers will talk in terms of Failures In Time (FITs); one FIT equals one failure in 109 hours – roughly 115,000 years. Long enough to where a lawyer will have a hard time pinning a failure on you and penalizing your progeny.

So how many failures can we tolerate? As described by Synopsys’s Yervant Zorian, for most of the world, this has been about 1000 FITs – especially for networking. But for certain critical areas, it’s much lower: medical systems require 50-FIT levels, and automotive demands a miniscule 0.1 FITs. If you wonder why medical isn’t the lowest, I suppose we can interpret the lower automotive rate as the proverbial ounce of prevention to medical’s pound of cure.

Note that these are system failure rates. The system fails when any one (or more) of its components fails. And the one component that has gotten more attention than any other has been the memory.

I suppose you could debate whether memory is more important than any other component, although if we think about it in our own terms, losing our own memories can eventually be more debilitating than the loss of some other functions (with the exception of the ability to forget that particular night during that particular pub crawl; such loss can perhaps be considered yet another of nature’s error-coping mechanisms: in this case, one of mercy).

The use of an error-correcting code (ECC) in memory is nothing new, and there’s a fair bit of IP available. But much of the focus of ECC has been for handling manufacturing errors and, typically, single-bit errors. What’s changing is that soft errors in the field, while always possible, are becoming more of an issue, and they affect more than one bit at a time. There are three contributors to this:

  • voltages are coming down, reducing the level of margin;
  • the amount of charge stored in a memory cell is lower, meaning less energy is required to upset the cell;
  • and the size of cells is smaller, meaning that an alpha particle can disturb more than one cell at a time.

Couple this with the huge increases in memory going into systems, and it starts to get very hard to keep FIT rates within manageable levels.

Synopsys’s approach to this is manifest in the form of their newly released DesignWare STAR ECC product (courtesy of their acquisition of Virage). This is essentially a memory compiler that allows engineers to build in a level of correction through additional bits and proper coding and then evaluate the FIT rates that would result. The goal is to achieve the desired FIT rate using the smallest area possible.

It’s not just a matter of adding bits to the bytes, however. Scrambling – that is, mixing up which physical cells belong to which logical bytes – becomes more important so that, when multiple cells are disturbed by a single event, those cells will be less likely to cohabitate within a single byte or word. Multiple bytes with single-bit errors are easier to correct than a single byte with multiple-bit errors. The STAR ECC compiler allows you to enter information about the physical cell adjacencies using a language called MASIS; this lets the tool figure out how to scramble the bits.

Using this methodology, you start by designing your memory; then you use the STAR ECC compiler to create the field-correction redundancy. You can then insert the STAR wrapper for handling manufacturing and testing issues.

Meanwhile, in a completely different corner of the continent, another company is taking a completely different approach to ECC – and to logic. As startup Lyric Semiconductor’s Dave Reynolds says, “2+2=4 is a solved problem.” There are numerous other problems where there isn’t necessarily a right or wrong answer – there’s a “most likely” answer (with, of course, a possibility of error).

Lyric refers to this as “probability processing.” They point out a number of applications – gene sequencing, search, and ECC to name some – that are essentially probability problems to be solved.

And, while they have a general-purpose probability processor in the works for the future, ECC is the first problem they are attacking with a purpose-built piece of IP for handling high-end ECC (for things like solid-state disks) to compete with traditional digital approaches.

And, while they are not disclosing details about their circuits at this point, they are decidedly not digital. In fact, in the general case, they take an analog input and provide a digital output. They don’t use “bits”; they use “pbits” (probability bits). They don’t use Boolean gates, they use Bayesian gates. They describe the pbits as flowing multi-directionally through the Bayesian gates such that every variable talks to every other variable. The effect is that of highly parallel computing.

Each cell resolves its input to a 1 or 0 in an iterative fashion; each iteration takes 3-8 cycles. While this introduces some latency, as long as the latency is less than the time it takes to fetch the next word, it is hidden.

To write programs for this kind of processor, they have developed a language they call PSBL (Probability Synthesis to Bayesian Logic). Looking at some snippets of code, it appears rather opaque, but they acknowledge this initial impression and say that, once you’ve learned the syntax, it’s actually easy to program in.

They provide some comparisons to the standard digital approach (for an ECC of similar quality) to drive home the benefits they see. From a circuit size standpoint, they claim to be 30X smaller at 1 Gbps and 70X smaller at 6 Gbps. They claim that power is reduced by 12X. (And yes, for any pedants, I’m aware of the theoretically-confusing concept of “nX smaller”… file it under “Y” for “You know what I mean.”)

The other comparison they provide is that their circuit at 180 nm is smaller and lower-power than the equivalent digital circuit at 45 nm.

And so we have two very different ways of approaching ECC; it will presumably be some time before a winner is decided. But, based on these efforts, we may have to provide a variation on our opening theme (with apologies to Pope and the Romans), as “To err is too much of a pain in the butt; to correct is essential.”

Yeah, you’re right… not sure they’ll be quoting that one centuries from now…

 

More info:

Synopsys DesignWare STAR ECC

Lyric Semiconductor

Leave a Reply

featured blogs
Apr 16, 2024
In today's semiconductor era, every minute, you always look for the opportunity to enhance your skills and learning growth and want to keep up to date with the technology. This could mean you would also like to get hold of the small concepts behind the complex chip desig...
Apr 11, 2024
See how Achronix used our physical verification tools to accelerate the SoC design and verification flow, boosting chip design productivity w/ cloud-based EDA.The post Achronix Achieves 5X Faster Physical Verification for Full SoC Within Budget with Synopsys Cloud appeared ...
Mar 30, 2024
Join me on a brief stream-of-consciousness tour to see what it's like to live inside (what I laughingly call) my mind...

featured video

MaxLinear Integrates Analog & Digital Design in One Chip with Cadence 3D Solvers

Sponsored by Cadence Design Systems

MaxLinear has the unique capability of integrating analog and digital design on the same chip. Because of this, the team developed some interesting technology in the communication space. In the optical infrastructure domain, they created the first fully integrated 5nm CMOS PAM4 DSP. All their products solve critical communication and high-frequency analysis challenges.

Learn more about how MaxLinear is using Cadence’s Clarity 3D Solver and EMX Planar 3D Solver in their design process.

featured chalk talk

PIC® and AVR® Microcontrollers Enable Low-Power Applications
Sponsored by Mouser Electronics and Microchip
In this episode of Chalk Talk, Amelia Dalton and Marc McComb from Microchip explore how Microchip’s PIC® and AVR® MCUs are a game changer when it comes to low power embedded designs. They investigate the benefits that the flexible signal routing, core independent peripherals, and Analog Peripheral Manager (APM) bring to modern embedded designs and how these microcontroller families can help you avoid a variety of pitfalls in your next design.
Jan 15, 2024
12,956 views