Doing More with Less

If you’re Kool-Aid or Coca-Cola, you’re likely going to want to make your drinkage available to lots of people in lots of places. So you need to manufacture it somewhere and then truck it somewhere else. Problem is, most of what’s in the drink is water – and water is heavy and takes a lot of space. Lugging it all over the country is an onerous task. But even though your product is mostly water, the part where you add value is the flavor. And water can be had almost anywhere. So life can be a lot more efficient if you ship only the part that you uniquely make, the flavor, by itself – say, in powder or syrup form – and then add water locally where the drink will be consumed. After all, you care about the flavor; you really don’t care (much) about the water.

So what does this have to do with technology? As we’ll see, erstwhile traditional testing schemes shipped a lot of water from the tester into the device being tested, and ways have been found – and will need improving – to stop shipping that water and ship only the flavoring.

Older approaches to testing digital logic on ICs used plain-ol’ scan chains. A scan chain taps important circuit nodes so you can set up test conditions, execute a test, capture the result, and scan it back out. But let’s say you’re trying to test out a block of combinatorial logic that has seven inputs and three outputs. If your scan chain has 5000 cells, then you’ll be scanning in 5000 bits – seven of which you care about (“care” bits), the rest of which are irrelevant (“don’t-care” bits). And once you capture the result, you’ll scan out 5000 bits, three of which you care about. That’s an efficiency on the order of 0.1% – which equates to 99.9% overhead. Think of the care bits as flavoring and the don’t-care bits as water: you’re shipping a lot of water.

Statistics show that the vast bulk of bits in a test pattern – over 90% – are don’t-care bits. To address this, compression techniques in place today allow as high as 200x compression. But the International Technology Roadmap for Semiconductors (ITRS) is forecasting the need for an almost 2000x compression ratio in the number of bits by 2015 – how the heck are we going to get there?

First of all, a really important question: Why does this matter? For one simple reason: test cost. Testers are expensive, and the longer it takes to test a chip, the more of the tester cost gets ascribed to that chip, and the more expensive the chip is to produce. We won’t go into the whole economic question of “how much testing is enough” here; suffice it to say that, however much testing you decide to do, doing it faster is better. The ITRS sees test requirements increasing to such an extent that much higher compression will be needed to keep costs constant.
The old pre-compression test approach feeds test patterns directly from a tester to the chip being tested. Those test patterns are created by an automatic test pattern generator (ATPG) program that analyzes the circuit according to specific fault models. Older testing typically focused on “stuck-at” faults; today additional faults – bridging, environment-dependent, and transition faults to name a few – are being added to the mix, contributing more test patterns to an already large set.

Because any given test pattern has only a few useful bits to be applied, the rest of the bits are filled with random ones and zeros just to fill out the pattern (or fill up the scan chain). That test pattern set is then provided to the test equipment, which physically scans in the data serially, scans out the result, and compares the few result bits that matter to a desired result to see if the test passed.

One well-used scan architecture is the IEEE 1149.1 JTAG setup, which specifies a 20-MHz clock. At this rate, scanning in 5000 bits and scanning them back out requires 10,000 scan clock cycles at 50 ns each, for a total of 500 µs just for loading and unloading the test (without even performing the test). This is actually conservative, since unloading one scan and loading the next scan can be overlapped, cutting the time more or less in half. But even so, if you consider the bulk of test patterns that would be required to provide good coverage on a multi-million-gate design, the practical empirical fact is that test times can easily be several expensive seconds.

So here we have one huge problem – which is one huge opportunity: we’re using a ton of time to load and unload bits, over 90% of which are useless. Those don’t-care bits also have to be stored in pattern memory on a tester. As the test suites grow, so does the required tester memory. If the memory requirement grows beyond what a tester can handle, then you require multiple loads from hard disk – a completely untenable time adder to your test. Yes, tester memory can be expected to grow. But there are a lot of expensive testers that aren’t going to be thrown out anytime soon. And who wants to pay for more memory just to hold data that’s over 90% useless? It’s like having to provide 9x more truck capacity to ship water that has a bit of nice flavoring.

Break the chain

A first step to reducing the scan chain load time is breaking up the chains to have lots of small chains rather than a few long ones. You can’t go crazy with this, since, as Chris Allsup of Synopsys cautions, having too many chains creates routing congestion on the chip. But even so, having more chains works only if you have more external pins available in parallel to load them – if you’re multiplexing through a single pin, you haven’t really gained anything.
You might simply add lots of scan input pins, but adding to the pin count isn’t a good thing, especially since you can’t allocate many dedicated test pins. If you’re going to have more pins, you generally have to double-up the functions of some pins, making a few normal input or output pins into test pins when you enter a test mode. But as Mentor’s Greg Aldrich points out, on many applications there are fewer simple digital pins available because performance and integration are turning more pins into high-speed serial pins or analog pins.

Mr. Aldrich also notes that another way to improve testing efficiency – and therefore cost – is to test more than one chip at a time on the tester – so-called “multi-site” testing. But a tester has only so many test signals that it can manipulate: if it has to share those amongst multiple chips, and if each chip requires lots of pins, then you can’t test as many chips at the same time. Bottom line, there are lots of reasons to keep the pin count as low as practical.
The real key to saving both load time and tester memory is to relieve the tester of the requirement for loading the don’t-care bits. Instead, add a circuit to the chip itself that can generate the don’t-cares internally. And if it can drive many shorter scan chains in parallel, then you can load the few care bits from the outside on a few pins and have everything else done internally in fewer clock cycles. And voilà: you have “test compression.” Everyone wins. At the cost of some die area for the additional circuitry.

The general architecture of SoCs that use test compression is straightforward: test inputs feed a decompression engine, which feeds a bunch of scan chains in parallel, which are collected back up at a compactor where the results are sent back out. But of course, details matter. There are a number of approaches to making this work, and it’s pretty involved stuff. Especially if you want to wade into the math. Are you one of those people that becomes slightly suspicious at seeing polynomials in the context of digital logic? Have linear feedback shift registers (LFSRs) ever made you just a tad queasy? Well gird up, because much of what goes on here involves variants on those. But we won’t go crazy; my eyes glaze over at least as early as yours do.

Time to decompress

Let’s focus for now on the decompression part, where signals are taken from the tester and converted into useful scan chains. The approach used by Synopsys’ DFTmax is conceptually a bit more accessible, since it involves what they call their “Adaptive Scan” approach. They have a few pins that deliver the care bits through a series of multiplexers onto the appropriate scan chains. These multiplexers can be changed dynamically; the test pattern can tweak them on a shift-by-shift basis. Because each external pin can be directed to multiple internal scan chains, each care bit may also be loaded onto many other scan chains as a don’t-care bit, so in this manner, the relatively small number of care bits satisfy both the needs of the care and don’t-care data. Synopsys claims up to 100x compression possible with this circuit, with a die area cost that is less than some alternative schemes.

Mentor’s approach is somewhat more involved. They use what they call their “Embedded Deterministic Test” (EDT) system, which involves two circuits: a “ring generator” that generates care and don’t-care bits and a phase shifter that distributes them out to the scan chains. This whole system reminds me of a Rubik’s cube, where you move one square somewhere specific by taking all of the squares through a series of manipulations that, when you least expect it, get the square you care about into the right place. And in fact, for a given number and length of scan chains, Mentor refers to the block of bits in those chains as a “test cube.”

The ring generator is a special kind of LFSR that’s folded back on itself, reducing fanout and routing requirements that are apparently a challenge for conventional LFSRs. Such a circuit would normally be used to generate pseudo-random input, but instead they have added several critical points where they “inject” deterministic “seeds.” These seeds get munged through the ring generator, sent through the phase shifter, and magically create the care bits in the right places on the scan chains.

OK, maybe not magically. In fact, the process is rolled backwards by the Mentor’s TestKompress software, knowing which care bits are needed where in the scan chains. The software effectively takes all the steps of the phase shifter and ring generator in reverse to figure out which bits to inject when to make it all work. In so doing, it creates this test cube filled with a few care bits and lots of pseudo-randomly-created don’t-care bits. Mentor claims compression ratios in the 100x – 200x range using this technology.

With either scheme, the don’t-care bits will serendipitously help cover other faults. The compression software can try to take advantage of that, turning those bits from “don’t-care” to “care” for additional coverage, and tweaking other don’t-care bits so that more tests are accomplished in the single test pattern. You now have compression due not only to the reduction of bits, but also to the greater number of faults covered for each test.

But you can’t take this too far. In the Synopsys case, you still have to be able to conform to the configurations that the multiplexers allow; not all arbitrary combinations of bits in a test pattern will be possible. In the Mentor case, the creation of the test cube is essentially a problem of simultaneous equations. The more care bits you have, the more constraints you’re placing on the system of equations. At some point, if you have too many care bits, there’s no way to get it to work – there’s no solution. It’s like trying to get a Rubik’s cube into a configuration that’s physically impossible with any combination of moves.

There are other tools and approaches out there as well; SynTest’s VirtualScan decompressor uses an XOR logic block, and the decompressor used by Cadence’s Encounter Test Architect can be either an XOR block or an On-Circuit Multi-Input Signature Register (OCMISR), another flavor of LFSR.

Putting the squeeze on results

Once you’ve gotten the data into the chip and conducted the test, the other important job is to look at the result. To save cycles here, rather than reading out all bits of all scan chains, the results of all scan chains are combined together, typically using an XOR tree (Mentor, Synopsys, SynTest, and one Cadence option) or a MISR (another Cadence option). This provides an n:1 compaction, where n is the number of scan chains.

But there are some glitches with this. First, multiple faults may get lost. For example, with a simple XOR tree, multiple faults will be detected only if there are an odd number of them. If the number is even, they cancel each other out and everything looks peachy when it isn’t. This is alleviated by the fact that a given fault may be detected in more than one test pattern, making it less likely that the fault escapes entirely.

This also makes it impossible to diagnose a specific failure, since you get a single bit that’s the combination of all scan chains. A bypass around the compactor can be provided for this so that when a failure is detected, the test can be repeated in bypass mode to see the individual bit(s) that failed.

Another well-documented bugaboo is the fact that not all portions of the circuit are in well-known states. Memories, floating busses, and numerous other sources can contribute to unknown states, or “X-states.” Problem is, if you can’t predict all the values being compacted, then you can’t figure out what a successful test looks like. To combat this, numerous X-state masking approaches have been added to allow a scan chain with an X-state to be swapped out of the compaction, replaced by some known constant value.

Mentor recently announced an improved Xpress compactor that allows more flexible X-state masking. Older versions reduced visibility only to single faults when masking; the newer version allows multiple-fault detection. In addition, they use a two-stage compactor, which loads up several cycles’ worth of output from the regular compactor and then compacts that again, increasing the compaction ratio of the result.

One thing you may notice is the fact that test pattern creation is closely tied to the test circuitry. Once you pick a test structure from a given vendor, you must also use their ATPG and compaction software. To address this, Accellera has created the Open Compression Interface (OCI) standard, which was submitted to IEEE last year as IEEE 1450.6-1, a work in progress. This standard provides a way to create a file that describes such things as the nature of the compression and compaction circuits so that, in theory, any ATPG software package could read it and generate compressed tests for it.

Where do we go from here?

So… given this background, where are the opportunities for further compression in the future? If current ratios are in the 50-200x range, practically speaking, what’s going to get us better than a 10x improvement? Seems like the low-hanging fruit has already been picked. Presumably, new ideas will continue to unfold in the seven years remaining until the ITRS forecast comes due; a few examples give an idea of the kinds of opportunities that remain. They’re not all data compression per se, but all aim to reduce test time.

Simultaneous testing of identical blocks. Many SoCs have numerous instances of the same block. Sending the same test at the same time to all of them can save cycles and time.
Over-clocking the test circuitry. The internal decompression engine can be clocked faster than the data being loaded, resulting in a faster loading of the scan chains.
Faster scan frequency. Loading data faster makes things go faster. Duh. But at some point high-speed serial interconnect may be required.
More data-munging ideas. One example is to include run-length encoding. If there are 64 1s in a row, don’t load 64 1s; just specify “64 1s,” which requires less data. OK, it’s a tad more complicated than that, and you need some die space to include a run-length-expanding circuit, but you get the picture.

In the end, compression is a statistical metric. Some test sets will compress better than others. And some designs will lend themselves to certain compression techniques more than others. So there’s no one right answer; it’s going to have to be a combination of compression ideas and options that allow, on average, for us to get that next order of magnitude compression we’re going to need.

There is one other consideration to bear in mind. The ITRS tries to look far ahead to anticipate future needs. The forecast has uncertainty (they revise their projections yearly), and they likely look out further than most chip designers look. While additional compression may be indeed be needed in the future, Mr. Allsup noted that “the majority of our customers have not indicated the need for greater compression levels,” but rather have other priorities for improving testing – things like improved consideration of power and timing in test patterns. This will obviously affect the priority given to compression schemes as test technology evolves.

Links:
Mentor
Synopsys
Cadence
SynTest

Doing More with Less

Related

Leave a Reply Cancel reply

featured paper

Quickly and accurately identify inter-domain leakage issues in IC designs

featured chalk talk