Verification is the science (and art) of asking the question, “What could possibly go wrong?”

If you’ve done a good job – you think – then you would expect that the answer would be, well, not very many things. If you start musing a bit harder, you might come up with some scenarios you hadn’t initially thought of. And, depressingly, the more you think about it, the more problems you can probably come up with.

It’s one thing to identify all the things that could go right or wrong in a design; it’s yet another to itemize all of the conditions leading to those good or bad outcomes, especially when it comes to a full SoC (which, mercifully, isn’t done all by you, which means there are other people you can point to if things go awry). But, in a wide-ranging discussion with Breker’s CEO Adnan Hamid, we discussed the scope of the problem – especially when taking a high-level system view.

Here are some numbers to chew on that get to the size of the space we’re talking about. **Breker’s tool**, which you may recall generates C-level tests for stress-testing an SoC, feeds on graphs that describe scenarios; each node in the graph is “something to do,” and a node may abstract a lower-level graph. Designers and architects put these scenarios together, and, ideally, the collection of all scenarios completely defines the intended behaviors of the design. So how many “states” might exist in a full analysis of all possible graphs?

Let’s start with an IP block, which is much more manageable. Now, I will readily acknowledge that “IP” is a broad term and can include anything from a simple DMA block to a monstrously complex protocol. So one number won’t speak to them all, but we’re talking orders of magnitude here, and let’s not get hung up on trivial examples. According to Mr. Hamid, the number of graph states in an IP block is on the order of… are you ready for this? 10^{30}. Yes: 1,000,000,000,000,000,000,000,000,000,000.

Dare you ask how many are in an entire SoC? How about in the range of 10^{300}-10^{1000}. Forgive me for not writing those out. Dude, that bottom number is a googol cubed! And this doesn’t even include concurrency issues in a multicore system.

I won’t bother with the math that determines how long that would take to run. Oh, the heck with that: let’s do it! At the low end of that range, if it took 1 ns to analyze one graph state, then it would take <snicker> 3.2×10^{283} years to do the whole thing. A mere (two and a quarter x 10^{291}) times the current rough estimate of the age of our universe. The upper end? I guess you’ll have to do the math; Excel can’t – its brain exploded with that number.

So, after that merry little divertimento, let’s come back to what we already knew: we can’t test that much stuff. Mr. Hamid suggests that exercising something more on the order of 10^{9} samples, done properly, can ferret out most of the errors. An enormously tiny (if I may be so contradictory) fraction. And yet, he says that most verification plans cover only on the order of 10^{6} states.

So for those of you gunning for 100% coverage, nuh-uh: not gonna happen. Granted, this notion of coverage is far more expansive than what logicians may consider when using the far more tractable stuck-at fault model.

Stuck-at faults are great – they’re so manageable and traceable. But they reflect only low-level logic and only a portion of what can go wrong. Stuck-to-another-signal (bridging), for example, won’t be caught. More comprehensive models have been introduced to expand the stuck-at notion, but they still operate on a low level.

Such faults are typically analyzed in the realm of abstract logic, where a signal is considered the same everywhere. If you want to take possible manufacturing faults into account for testing, there’s a whole different universe of things that can go wrong. For example, if a signal (which is considered to be a single entity in RTL) has a physical trace that splits, with the two branches going to opposite sides of the die, then faults could occur that have one branch at a different value from the other branch – a condition that the RTL description would say is impossible.

The potential gotchas Breker is trying to test for involve higher-level operations than this, although the abstract/manufacturing fault differences still apply. Things like, oh, reading from this memory while that DMA over there is moving stuff around in a couple of other memories. The problem is constrained by the universe of things that are possible to do in C. (Although, even without taking inline assembly into account, any of us who have programmed in C at less than a professional level will probably agree that that universe includes a few things you can do right and a whole lot of things you can do wrong.)

So how in the world do you figure out what to test? Looked at statistically, you need to identify a sample of around 10^{9} points out of 10^{500} to prove that a design is worthy. To call that a drop in the bucket is like calling the Death Star just a big Sputnik. Where in the heck do you start?

And this is where methodology comes in. So-called “constrained random” testing starts from inputs, applying (constrained) random values and recording the expected outputs. This is a feed-forward approach.

Breker’s approach is the opposite: it’s a “feed-back” method. They take these graphs, these scenarios, and, rather than working from the inputs to outputs to trace issues, they start with outcomes and work back up the – well, at the RTL level you’d say “cone of logic,” but let’s generalize here as the “cone of influence.” You then randomly select some sets of inputs that result in the outcome.

Mr. Hamid uses a simple example of an adder to describe the impact of sampling approach. If you look at the overall scope of what an adder can do, it can have one of three basic outcomes: a normal sum, an overflow, or an underflow. Many different add operations result in a “normal” sum. There are comparatively few that result in over- and underflow.

The typical sampling approach would randomly pick from the universe of possible input combinations – the vast majority of which result in normal sums. In other words, few tests (perhaps even none) would end up testing over/underflow. But from a “stress testing” standpoint, over/underflows represent more of the “exceptional” sort of output that might cause a problem, Breker’s tool builds a sample that represents over/underflows far more than a constrained random approach would do. That’s because, in the outcomes “space,” normal, overflow, and underflow results get equal weighting.

Let’s make this more concrete: The Breker tool takes an output and traverses the graph backwards towards the inputs, selecting some random set out of the universe of inputs that lead to the outcome. Let’s assume we’re going to generate 30 tests for our adder. The tool starts with, say, the “normal sum” outcome and works back to the inputs, randomly coming up with 10 different input combinations that cause a normal sum. Then, using a similar technique, the overflow outcome is explored, yielding 10 tests that result in an overflow condition; and likewise for underflow.

So you have equal weighting of normal, over-, and underflow outcomes. A test can now include these in a larger test that says, oh, “What happens when an interrupt fires right when an overflow occurs?” This would assume, of course, that the conditions resulting in an interrupt have also been analyzed and the appropriate test conditions identified.

Note that, for simplicity, I said that you pick some random set of inputs leading to an outcome. You actually have more control than that; there are a number of knobs that you can use to bias the test generation. For example, when creating scenarios, you can give some branches in a graph more weight than others, meaning more tests will be generated for the branches having higher weight.

I have to confess to having tied myself in some knots with this, which Breker helped to undo. And it largely had to do with one word that tends to be used for analyzing the graph: “solver.” There’s a very basic realm where “cones of influence” and “solvers” come together, and that’s formal analysis. So I assumed that’s what was going on here. And the more I thought about that, the more it didn’t make sense.

Well, it turns out that there’s more than one kind of solver. I was thinking “sat (or satisfiability) solver,” which can be used to prove conditions or outcomes or provide counterexamples if the proof fails. Useful stuff, but that’s not what’s going on here. This is a “graph solver” – perhaps better called a graph “explorer,” they said. Nothing is being proven.

The tool also never sees the actual design – the RTL. All the tool sees is the scenario graphs; it allows tests to be generated that can then be applied to the design to identify issues. It also doesn’t go beyond the scenarios that have been drawn up. If that set of scenarios has big holes in it, then so will the set of tests. The tests can give a good representation of the behaviors identified in the scenarios, but they can’t help you find behaviors that have been left out of the scenarios.

So the burden of “what could possibly go wrong” still falls on you, at least at a block level, although by creating the scenarios, you’re really documenting how things are supposed to work. The tool can then pull these scenarios together to see what happens when the system gets stressed.

And the next time the test or verification folks complain that you have too many tests, run those numbers for them. If they understand that you’re sampling at a rate of about 10^{-491}, they may end up asking for more tests.

More info:

Breker says that the universe of all possible SoC tests is beyond enormous, and that what we actually verify is a tiny sample – and that optimizing that sample is key. What’s your reaction to this?