Avoiding Failure Analysis Paralysis

Back when I was a product engineer working on bipolar PALs (oops – I mean, PAL® devices), one of my main activities was figuring out what was wrong. That was most of the job of a product engineer: fix what’s broken. You don’t spend any time working on the stuff that’s working, you work on what isn’t working. Assuming it’s a chip that’s wrong, the process would typically start with a trip into the testing area to put a part on the tester and datalog it to see some evidence of things going awry. Armed with that, the next job was to spread a big schematic out on a table and start looking at the circuits, figuring out what could be causing the problem. You’d come up with a couple scenarios, and next you’d have to look in the actual chip.

Of course, in order to look at the chip, we had to spread a big layout sheet on a table to trace out where the circuits were physically located. Then we’d know where to look. The chip would have to be decapped – I could do that myself if it was a CERDIP (ceramic packaging, where you could pop off the top); otherwise you needed to go to one of those scary guys that knew just a bit too much about chemistry (and whom you wanted to keep happy with occasional gifts of jerky or sunflower seeds) to have a hole etched in the plastic. Hopefully that was enough, and then you could go into the lab and use microscopes and microprobes and oscilloscopes and such to poke through dielectric layers, perhaps cut a metal line to get to something below, and with any luck you’d identify a problem that could be fixed. In the worst case you had to go back to Scary Guy for more delayering, or perhaps a SEM session. Or – yikes – chemical analysis. It was all seat-of-the-pants, using forensic techniques worthy of CSI – Jurassic Edition, and you let your data and observations tell you what the next step should be.

Unfortunately, a few things have changed to complicate this serene pastoral picture of the past. Start with, oh, about a thousand more pins on the chip. Shrink the features way down, and multiply the number of transistors by, oh, say, a lot. Throw on a few extra layers of metal for good measure, and, well, you gotcherself a problem.

Diagnosing failing dice and then turning the information into useful measures for improving yields on current and future circuits is no trivial matter anymore. Not only have technical issues become more thorny, but even business issues have intruded into the picture. The urgency has also grown with the focus on Design for Manufacturing (DFM), an admittedly somewhat ill-defined series of technologies for improving the manufacturability of sub-wavelength chips (and whose real benefit is still subject to debate).

Following up on a presentation at the isQED conference, I was able to sit down with some of the folks from Cadence to get their view of what life looks like now. The process boils down to something that sounds rather straightforward and familiar: develop hypotheses about possible failure modes; gather lots of manufacturing data to support or weaken some of those hypotheses, and then narrow down the range of options for physical failure analysis (done by the modern-day scary guys – in the gender-neutral sense – that actually tear the stuff apart).

The challenges are partly those of scale. It’s no longer an easy matter to unroll a paper schematic onto the table in the Board Room. We’ve now gone paperless, and, even so, there are just too many things going on in a circuit to try to trace them by hand. That’s where tools can come in and identify, through simulation, all the logical scenarios that could contribute to the observed failure. The kinds of issues to be reviewed could include not only the traditional stuck-at faults, but also timing problems. An observed behavior could originate in any of a number of logic nodes, and having the candidates automatically identified can give you a solid set of candidate problems in a shorter time. Those candidates are pareto-ranked by level of confidence.

The next step was something of a problem for a period of time. This kind of yield analysis is most useful in the early days of a process. But let’s face it: with masks costing what they do, you have to have an opportunity for a huge yield increase to warrant new masks on a current product. As a result, while testing and inspection procedures for current products may benefit, many times the kinds of design improvements you learn about will apply only to future products. This kind of learning makes far more sense early in the lifetime of a given process.

But making sense out of the failures requires manufacturing data. Lots of it. And early in the life of a process, manufacturing data doesn’t look so good. And historically, fabs have been reluctant to let the data out. This wasn’t a problem before foundries were routine; you owned your own fab (as “real men” did back in the day), and you went to talk to your colleagues there. With the fab now owned by a different company, and that company not wanting to look bad compared to other foundries, there was much resistance to being open with data.

This issue is now more or less behind us; there really is no way to do solid engineering without having access to manufacturing data, so that business hurdle has been cleared. Resulting in the availability of data. Lots and lots of data. Tons of data. File it under “B” for, “Be careful what you ask for.” The next challenge then becomes making sense of all of that data as it relates to the particular failure scenarios under consideration. You can now, for example, look at a wafer, or series of wafers, to figure out where possible yield hot spots are. You can narrow in on some dice of interest and look at the test and manufacturing data from those parts. The idea is to correlate the possible failure modes with actual observances to further refine the list of promising hypotheses. The once daunting roster of things that might be wrong can be narrowed down, and the physical failure analysis folks will get a more focused list of things to look at for final evidence of what the issue is.

Many of the tools for this flow have been around for a while. For example, Cadence has had their Encounter Diagnostics tool since 2004. One of the missing links has been a means of viewing all of the manufacturing data in a coordinated manner; right now you sort of have to look at the data in more or less an ad hoc fashion. Cadence has been working on a tool that they’ve used in some select situations to help bridge the analysis of manufacturing data back to the original design; they’re still in the productization stage, but intend that this be a key piece of automation in a feedback loop that can refine the design and manufacturing rules.

So while the concepts driving yield enhancement haven’t changed, the motivations have gotten stronger, and tools have become critical for managing the complexity and the amount of data required for thorough analysis. On the one hand, it kinda makes you pine for a simpler time, when you just kind of rolled up your sleeves and sleuthed around. On the other hand, you can now focus more energy on those parts of the process where the human brain is the key tool, and let the EDA tools take care of some of the more mundane work.