feature article
Subscribe Now

To Err Is Universal

Listening to Alexander Pope (and even Seneca and Cicero long before), you would think that erring is a particularly human trait. Taken out of the original context (which places human fallibility in contrast to the divinity of forgiveness or the diabolicalness of persisting in erring), you might feel singled out by nature as a particularly unreliable entity.

In fact, we are surrounded by errors. They happen all the time. We are but a wee part of a huge, interconnected, error-prone system. Our very existence is a testament to the ability of living forms to survive any number of errors that might occur. The more we study genetics and the geological history of our planet, the more we learn about the range of redundant systems and adaptations that have evolved to bring us this far. (Or, looked at from an anthropic perspective, that had to evolve to get us this far.)

With a few exceptions, however, the machines we humans create have no such adaptability. Whereas the natural world abides variation and randomness, we create deterministic systems that assume a relatively narrow range of operating conditions such that a given input will guarantee a correct output.

And, so far, this has worked pretty well. But, in reality, what we’re doing is approximating the real world and, basically, rounding off the error as insignificant. The problem is, this error is becoming more significant.

Reliability engineers will talk in terms of Failures In Time (FITs); one FIT equals one failure in 109 hours – roughly 115,000 years. Long enough to where a lawyer will have a hard time pinning a failure on you and penalizing your progeny.

So how many failures can we tolerate? As described by Synopsys’s Yervant Zorian, for most of the world, this has been about 1000 FITs – especially for networking. But for certain critical areas, it’s much lower: medical systems require 50-FIT levels, and automotive demands a miniscule 0.1 FITs. If you wonder why medical isn’t the lowest, I suppose we can interpret the lower automotive rate as the proverbial ounce of prevention to medical’s pound of cure.

Note that these are system failure rates. The system fails when any one (or more) of its components fails. And the one component that has gotten more attention than any other has been the memory.

I suppose you could debate whether memory is more important than any other component, although if we think about it in our own terms, losing our own memories can eventually be more debilitating than the loss of some other functions (with the exception of the ability to forget that particular night during that particular pub crawl; such loss can perhaps be considered yet another of nature’s error-coping mechanisms: in this case, one of mercy).

The use of an error-correcting code (ECC) in memory is nothing new, and there’s a fair bit of IP available. But much of the focus of ECC has been for handling manufacturing errors and, typically, single-bit errors. What’s changing is that soft errors in the field, while always possible, are becoming more of an issue, and they affect more than one bit at a time. There are three contributors to this:

  • voltages are coming down, reducing the level of margin;
  • the amount of charge stored in a memory cell is lower, meaning less energy is required to upset the cell;
  • and the size of cells is smaller, meaning that an alpha particle can disturb more than one cell at a time.

Couple this with the huge increases in memory going into systems, and it starts to get very hard to keep FIT rates within manageable levels.

Synopsys’s approach to this is manifest in the form of their newly released DesignWare STAR ECC product (courtesy of their acquisition of Virage). This is essentially a memory compiler that allows engineers to build in a level of correction through additional bits and proper coding and then evaluate the FIT rates that would result. The goal is to achieve the desired FIT rate using the smallest area possible.

It’s not just a matter of adding bits to the bytes, however. Scrambling – that is, mixing up which physical cells belong to which logical bytes – becomes more important so that, when multiple cells are disturbed by a single event, those cells will be less likely to cohabitate within a single byte or word. Multiple bytes with single-bit errors are easier to correct than a single byte with multiple-bit errors. The STAR ECC compiler allows you to enter information about the physical cell adjacencies using a language called MASIS; this lets the tool figure out how to scramble the bits.

Using this methodology, you start by designing your memory; then you use the STAR ECC compiler to create the field-correction redundancy. You can then insert the STAR wrapper for handling manufacturing and testing issues.

Meanwhile, in a completely different corner of the continent, another company is taking a completely different approach to ECC – and to logic. As startup Lyric Semiconductor’s Dave Reynolds says, “2+2=4 is a solved problem.” There are numerous other problems where there isn’t necessarily a right or wrong answer – there’s a “most likely” answer (with, of course, a possibility of error).

Lyric refers to this as “probability processing.” They point out a number of applications – gene sequencing, search, and ECC to name some – that are essentially probability problems to be solved.

And, while they have a general-purpose probability processor in the works for the future, ECC is the first problem they are attacking with a purpose-built piece of IP for handling high-end ECC (for things like solid-state disks) to compete with traditional digital approaches.

And, while they are not disclosing details about their circuits at this point, they are decidedly not digital. In fact, in the general case, they take an analog input and provide a digital output. They don’t use “bits”; they use “pbits” (probability bits). They don’t use Boolean gates, they use Bayesian gates. They describe the pbits as flowing multi-directionally through the Bayesian gates such that every variable talks to every other variable. The effect is that of highly parallel computing.

Each cell resolves its input to a 1 or 0 in an iterative fashion; each iteration takes 3-8 cycles. While this introduces some latency, as long as the latency is less than the time it takes to fetch the next word, it is hidden.

To write programs for this kind of processor, they have developed a language they call PSBL (Probability Synthesis to Bayesian Logic). Looking at some snippets of code, it appears rather opaque, but they acknowledge this initial impression and say that, once you’ve learned the syntax, it’s actually easy to program in.

They provide some comparisons to the standard digital approach (for an ECC of similar quality) to drive home the benefits they see. From a circuit size standpoint, they claim to be 30X smaller at 1 Gbps and 70X smaller at 6 Gbps. They claim that power is reduced by 12X. (And yes, for any pedants, I’m aware of the theoretically-confusing concept of “nX smaller”… file it under “Y” for “You know what I mean.”)

The other comparison they provide is that their circuit at 180 nm is smaller and lower-power than the equivalent digital circuit at 45 nm.

And so we have two very different ways of approaching ECC; it will presumably be some time before a winner is decided. But, based on these efforts, we may have to provide a variation on our opening theme (with apologies to Pope and the Romans), as “To err is too much of a pain in the butt; to correct is essential.”

Yeah, you’re right… not sure they’ll be quoting that one centuries from now…

 

More info:

Synopsys DesignWare STAR ECC

Lyric Semiconductor

Leave a Reply

featured blogs
Dec 8, 2022
You will notice a big change when you try to download the latest version of Innovus, Genus, or Joules on our Cadence download site, downloads.cadence.com . Instead of the expected INNOVUS221 or GENUS221, or JOULES221 releases, you will find DDI221, which includes the 22.1 ver...
Dec 7, 2022
We explore hyperscale datacenters & internet traffic's impact on climate change and discuss how energy-efficient system design shapes a sustainable future. The post How the Electronics Industry Can Shape a More Sustainable, Energy-Efficient World appeared first on From ...
Dec 7, 2022
By Karen Chow When Infineon needed to select a field solver for the development of their next-generation power semiconductor products,… ...
Nov 18, 2022
This bodacious beauty is better equipped than my car, with 360-degree collision avoidance sensors, party lights, and a backup camera, to name but a few....

featured video

Maximizing Power Savings During Chip Implementation with Dynamic Refresh of Vectors

Sponsored by Synopsys

Drive power optimization with actual workloads and continually refresh vectors at each step of chip implementation for maximum power savings.

Learn more about Energy-Efficient SoC Solutions

featured chalk talk

Chipageddon: What's Happening, Why It's Happening and When Will It End

Sponsored by Mouser Electronics and Digi

Semiconductors are an integral part of our design lives, but supply chain issues continue to upset our design processes. In this episode of Chalk Talk, Ronald Singh from Digi and Amelia Dalton investigate the variety of reasons behind today’s semiconductor supply chain woes. They also take a closer look at how a system-on-module approach could help alleviate some of these issues and how you can navigate these challenges for your next design.

Click here for more information about DIGI ConnectCore 8M Mini