feature article
Subscribe Now

When Lives Depend on Your Machine-Learned Model

How Do You Prove It’s Correct?

“How to validate training data is an open question that might be addressed by some combination of characterizing the data as well as the data generation or data collection process.”
‘Challenges in Autonomous Vehicle Testing and Validation’, Koopman and Wagner, 2016

Last summer, we published a piece on safety-critical capabilities in EDA. But we left an open, unanswered question at the end: how do you validate a model created by machine learning (ML)? It wasn’t a trivial topic at the time, and so we postponed covering that aspect. So today, we take it back up. And, as you can see from the quote above, it’s still not a trivial topic.

Why Is This Hard?

There are a number of considerations that make this a vexing subject. After all, you’re trying to verify that some model is a good (or at least a good-enough) representation of some reality. But we do that by training the model with some data set. Change the data – even by replacing good data points with different good data points – and the model will change. Change the order of the data points during training? The model will likely also change. Maybe not in effect, but in the dirty details.

So, from a given bunch of data, you could get an enormous number of different models – all valid (at least with respect to the training data). Would they all perform the same when tested against data points that weren’t in the training set? That’s not clear. But let’s come back to the “how to do this” shortly.

Overfitting

Remember design of experiments (DoE)? It’s important to randomize any variables that you’re not testing for. Otherwise your experiments might show false correlations to surprising influences – ones that would disappear with a broader data set.

Well, a similar thing can happen with machine learning. If you’re trying to teach a system to recognize pedestrians, then hair color shouldn’t matter – it shouldn’t be part of the model. But if all the training samples happen to have brunette hair, then that might get built into the model. It effectively becomes a false correlation.

The problem is that overfitting is hard to test for. If there are false correlations, you might not know – unless the testing data set happens to have lots of blond-haired people that don’t get identified as pedestrians. You can’t always count on something that convenient.

The solution – just like with DoE – is to randomize hair color (and anything else that shouldn’t matter). Now… what if you think it shouldn’t matter but, in fact, it does? Then, even by randomizing, that relationship will show up in the model. For example, if all hair colors are included but it turns out that results significantly favor those with brunette hair? Then you might get a model similar to the brunette-only data set – except that this time it’s not a false correlation.

Edge Cases

In the EDA world, we speak often of “corner cases.” The reason they’re a “corner” is that they involve multiple variables intersecting in some far-flung corner of the dataspace. In the ML world, they speak of “edge cases.” It’s an edge instead of a corner because it involves only one variable.

The challenge here is finding the edges. In the hair color example, what are the edges? Jet black and white? We might think of them as such, but will the model play out that way? You could synthesize or simulate various data points for training, but you run the risk of working any artifacts from the synthesis or simulation engine into the model – overfitting some unintended algorithmic predilections.

Reading the Resulting Model

One final obvious issue that affects everything, including the prior two issues, is the fact that the generated models aren’t legible or human-understandable. They consist of a number of weights in a matrix, or some other such abstracted form. You can’t go in and tease out insights by looking at the model. You can’t see any overfitting, and you can’t figure out where edges are or whether they’ve been correctly incorporated into the model. It’s a model by and for machines, not by or for humans.

So What to Do?

Given that ML models aren’t really testable in the manner we’re used to, there are a couple of ways to handle it: accommodate the uncertainty or try to reduce the uncertainty – to the satisfaction of some adjudicator in the safety-critical world..

In the first case, you assume that your model is inherently untestable to the levels needed for high-ASIL or other safety-critical designs. So, instead of losing sleep over what might seem like an unsolvable problem, you built a cross-check into the design. You design a high-ASIL monitor that looks for any out-of-bounds results that the ML model might generate. You haven’t proven the model in that case; you’ve just made arrangements for its failures.
You can improve the model’s performance, however, by taking great care with the training data and how you test the model. But doing this means distinguishing between two types of error: training error and testing error.

Training error is determined by taking the exact same data used to train the model and running it through the model. You might think that’s dumb: if you trained with the data, then the model should always give the correct results, right? The thing is, training engines are software, and they aren’t perfect. So it’s possible to train with a dataset and yet have that model fail in some of those cases of the dataset. So the training error is really testing the quality of training. Which I guess would be why they call it training error.

Testing error results from training on one set of data and testing with a completely different set of data. Now you’re testing the model’s ability to work with data points that it hasn’t seen before. Given a total of 100 data points (you’d typically need many, many more than this), you might pick 70 for training, keeping the remaining 30 as the test points.

Sounds simple, but how do you make that split? What if you do a random split but, quite by chance, the training set ends up with all brunette hair and the test set has all blond hair? That model will overfit for hair color – that’s the bad news. The good news in this case is that the test set will catch it. But it still means you have a faulty model.

You could also do a number of random splits and average the results. But, here again, you could end up with unintended artifacts. What if 20 of the data points randomly ended up always in the training set and never in the testing set? Those 20 points will influence the model more than other data points – like ones that might always happen to end up in test sets.

The solution to this is called k-fold cross-validation. k refers to the number of groups into which you will split your data. A typical number might be 5. In that case, you have 5 subsets of your data, and you create 5 models. For each model created, you use four of the subsets to train the model and then the remaining 1 for testing. You do that 5 times, each time switching which data set is used for testing. In that manner, every data point is used for testing exactly once and training exactly four times. You don’t have the overlaps and biases generated by random splits.

Ultimately, these are steps you can take to create the best possible model. But, in the end, you’re not validating the model: at best, you’re validating your training and testing methodology. It’s up to the safety adjudicators to decide whether your model passes based on how you created it and the test error results. But there will still be no way to outright prove the model through the notion of coverage that we use when we validate circuits in the EDA world.

 

More info:

I consulted a number of references on this piece, but these two papers were particularly helpful.

Challenges in Autonomous Vehicle Testing and Validation”, Philip Koopman & Michael Wagner, Carnegie Mellon University; Edge Case Research LLC

How to Correctly Validate Machine Learning Models”, RapidMiner whitepaper (with embedded PDF)

One thought on “When Lives Depend on Your Machine-Learned Model”

Leave a Reply

featured blogs
Jun 22, 2021
Have you ever been in a situation where the run has started and you realize that you needed to add two more workers, or drop a couple of them? In such cases, you wait for the run to complete, make... [[ Click on the title to access the full blog on the Cadence Community site...
Jun 21, 2021
By James Paris Last Saturday was my son's birthday and we had many things to… The post Time is money'¦so why waste it on bad data? appeared first on Design with Calibre....
Jun 17, 2021
Learn how cloud-based SoC design and functional verification systems such as ZeBu Cloud accelerate networking SoC readiness across both hardware & software. The post The Quest for the Most Advanced Networking SoC: Achieving Breakthrough Verification Efficiency with Clou...
Jun 17, 2021
In today’s blog episode, we would like to introduce our newest White Paper: “System and Component qualifications of VPX solutions, Create a novel, low-cost, easy to build, high reliability test platform for VPX modules“. Over the past year, Samtec has worked...

featured video

Reduce Analog and Mixed-Signal Design Risk with a Unified Design and Simulation Solution

Sponsored by Cadence Design Systems

Learn how you can reduce your cost and risk with the Virtuoso and Spectre unified analog and mixed-signal design and simulation solution, offering accuracy, capacity, and high performance.

Click here for more information about Spectre FX Simulator

featured paper

An FPGA-Based Solution for a Graph Neural Network Accelerator

Sponsored by Achronix

Graph Neural Networks (GNN) drive high demand for compute and memory performance and a software only based implementation of a GNN does not meet performance targets. As a result, there is an urgent need for hardware-based GNN acceleration. While traditional convolutional neural network (CNN) hardware acceleration has many solutions, the hardware acceleration of GNN has not been fully discussed and researched. This white paper reviews the latest GNN algorithms, the current status of acceleration technology research, and discusses FPGA-based GNN acceleration technology.

Click to read more

Featured Chalk Talk

Accelerate the Integration of Power Conversion with microBUCK® and microBRICK™

Sponsored by Mouser Electronics and Vishay

In the world of power conversion, multi-chip packaging, thermal performance, and power density can make all of the difference in the success of your next design. In this episode of Chalk Talk, Amelia Dalton chats with Raymond Jiang about the trends and challenges in power delivery and how you can leverage the unique combination of discrete MOSFET design, IC expertise, and packaging capability of Vishay’s microBRICK™and microBUCK® integrated voltage regulators.

Click here for more information about Vishay microBUCK® and microBRICK™ DC/DC Regulators