feature article
Subscribe Now

Breaking the System

Good Pieces Don’t Always Make a Good Whole

It’s seductive logic. If the pieces are good, then the whole, which is but an assemblage of known-good pieces, must be good.

I used that same logic as a kid. Orange juice is good; Cheerios are good. Ergo, using orange juice instead of milk should provide a delicious breakfast.

Wrong. That was a lesson I remember to this day. I could never quite put my finger on exactly why those two things didn’t work together – there was no obvious reason; they just didn’t.

Or here’s another example: take, oh, eight nice, well-behaved dogs. Put them together with no human around (or at least not one that they respect) and let them roam free. Assuming they don’t fight amongst themselves – in fact, especially if they don’t fight, if they all work well together – then, as a pack, they’re likely to create all kinds of havoc that none of them would consider individually. While that has everything to do with dog pack mentality, it reinforces the fact that combinations of good ingredients don’t always result in good outcomes.

I’m sure we’ve all learned that lesson one way or another. And yet it’s easy to fall back on that logic when necessary. Complex SoCs are created from combinations of blocks that are individually tested. Even the interconnect is tested, so, the story should go, putting it all together will work.

Frankly, we may know that this is a bit of a cop out, but, given the complexity of these systems and no obvious way to deal with it, it’s almost like we need this little sleight of hand in order to ship a system without going completely nuts. But the fact remains that the “sum-of-good-things equals good-thing” identity is unproven. In fact, it’s worse than that: it’s all too often proven wrong, as anyone who has received that “we’ve got a problem with the new silicon” phone call can attest.

There’s one primary element that’s new to chip design that contributes a new complexity dimension: software. Software has to be tested before tape-out – and that’s no easy feat, given how many lines of code are required simply to boot Linux. This is where virtual platforms, emulation, and hardware simulation acceleration help – you can run real software to check out how it works in the system.

But, according to Breker’s Adnan Hamid, the typical software that gets tested has some limitations. For the most part, it’s application software – which is good; wringing out the known major use cases is obvious and important. But it tends not to stress the architecture or the hardware. Individual stress tests might challenge the processor or some other localized element, but Breker’s contention is that very few tests work from the chip inputs through the entire system, of which the processor is just a piece, and on to the actual outputs.

This kind of holistic stress testing is what Breker’s TrekSoC tool is about. Given a set of scenarios, the idea is to have the tool automatically create a suite of tests that challenge the corners of operation. The scenarios are where the work is done up front – and this can be done (and is probably best done) long before you approach tape-out. It can be put to use during architectural testing as well, when you’re still working at the transaction level to prove out the system conceptually before implementation starts.

You can think of it as a hierarchy of constrained-random elements. At the bottom are low-level “services” like memory management, interrupts and polling, scheduling, and register access. Randomizable scenarios can be built from these to serve the next level, where drivers will make use of them.

These drivers will also have randomizable scenarios, and they can place constraints on the low-level scenarios so that you don’t end up wasting tests (or getting failing results) on nonsensical situations. The driver scenarios feed application scenarios, and the application scenarios feed performance scenarios.

Given these models, tests are created by positing an outcome (i.e., a set of outputs) resulting from an input and then using a Boolean solver to create tests. The tests will have nothing to do with the actual applications that will be run – they’re simply there to exercise the system as a whole and find any weaknesses.

An example that Mr. Hamid uses is one where a piece of data is written to memory just before an interrupt fires. Because the code causing the data write has been executed prior to the interrupt, it would normally be assumed that the data was, in fact, written and that the interrupt service routine can count on the correct data being in place. But in a system with a network-on-chip interconnect structure, the latency may be such that the data write instruction was requested but did not complete prior to the interrupt. And now you have a data consistency problem.

They also target multicore issues like deadlocks, livelocks, and cache coherency. Such areas are hard to test with application code, which may be very well-behaved. TrekSoC is badly-behaved: it can specifically create tests that will zero in on these kinds of problems to see whether or not the system survives intact.

The test code itself is run on bare metal, and any OS calls are trapped and rerouted, with TrekSoC acting as an “evil” OS in servicing those calls. This keeps the code as close as possible to the subsystems being tested, but it can also save valuable testing time since no OS has to boot.

The form the tests take depends on the target. Early on, when doing architectural evaluations, they can be turned into transactions for IP unit testing. Later in the flow, they generate RTL with offloads and a testbench as well as the code to be run. The offloads, implemented as native code libraries, handle memory observation, “printf” commands, input stimulus, and output checks.

From an abstract testing standpoint, the complete model provides stimulus, results checking, and coverage statistics, making it completely self-contained. The actual flow of the tests – in particular, the randomized nature – is determined at test generation time, not at run time. Any resulting coverage gaps can be addressed by directing the tool to create tests for a specific uncovered node on the flow graph used to depict the scenarios.

If this all works, then it gives you a way of ensuring that you won’t be throwing out breakfast and starting over. Or that the neighbors won’t all be complaining that their trash is turned over and that their cats are all stuck up in trees. And best yet, you won’t be getting that dreaded phone call after first silicon.

 

More info:

Breker

One thought on “Breaking the System”

Leave a Reply

featured blogs
Apr 19, 2021
Cache coherency is not a new concept. Coherent architectures have existed for many generations of CPU and Interconnect designs. Verifying adherence to coherency rules in SoCs has always been one of... [[ Click on the title to access the full blog on the Cadence Community sit...
Apr 19, 2021
Samtec blog readers are used to hearing about high-performance design. However, we see an increase in intertest in power integrity (PI). PI grows more crucial with each design iteration, yet many engineers are just starting to understand PI. That raises an interesting questio...
Apr 15, 2021
Explore the history of FPGA prototyping in the SoC design/verification process and learn about HAPS-100, a new prototyping system for complex AI & HPC SoCs. The post Scaling FPGA-Based Prototyping to Meet Verification Demands of Complex SoCs appeared first on From Silic...
Apr 14, 2021
By Simon Favre If you're not using critical area analysis and design for manufacturing to… The post DFM: Still a really good thing to do! appeared first on Design with Calibre....

featured video

Learn the basics of Hall Effect sensors

Sponsored by Texas Instruments

This video introduces Hall Effect, permanent magnets and various magnetic properties. It'll walk through the benefits of Hall Effect sensors, how Hall ICs compare to discrete Hall elements and the different types of Hall Effect sensors.

Click here for more information

featured paper

Understanding Functional Safety FIT Base Failure Rate Estimates per IEC 62380 and SN 29500

Sponsored by Texas Instruments

Functional safety standards such as IEC 61508 and ISO 26262 require semiconductor device manufacturers to address both systematic and random hardware failures. Base failure rates (BFR) quantify the intrinsic reliability of the semiconductor component while operating under normal environmental conditions. Download our white paper which focuses on two widely accepted techniques to estimate the BFR for semiconductor components; estimates per IEC Technical Report 62380 and SN 29500 respectively.

Click here to download the whitepaper

Featured Chalk Talk

Selecting the Right MOSFET: BLDC Motor Control in Battery Applications

Sponsored by Mouser Electronics and Nexperia

An increasing number of applications today rely on brushless motors, and that means we need smooth, efficient motor control. Choosing the right MOSFET can have a significant impact on the performance of your design. In this episode of Chalk Talk, Amelia Dalton chats with Tom Wolf of Nexperia about MOSFET requirements for brushless motor control, and how to chooser the right MOSFET for your design.

More information about Nexperia PSMN N-Channel MOSFETs