feature article
Subscribe Now

When Lives Depend on Your Machine-Learned Model

How Do You Prove It’s Correct?

“How to validate training data is an open question that might be addressed by some combination of characterizing the data as well as the data generation or data collection process.”
‘Challenges in Autonomous Vehicle Testing and Validation’, Koopman and Wagner, 2016

Last summer, we published a piece on safety-critical capabilities in EDA. But we left an open, unanswered question at the end: how do you validate a model created by machine learning (ML)? It wasn’t a trivial topic at the time, and so we postponed covering that aspect. So today, we take it back up. And, as you can see from the quote above, it’s still not a trivial topic.

Why Is This Hard?

There are a number of considerations that make this a vexing subject. After all, you’re trying to verify that some model is a good (or at least a good-enough) representation of some reality. But we do that by training the model with some data set. Change the data – even by replacing good data points with different good data points – and the model will change. Change the order of the data points during training? The model will likely also change. Maybe not in effect, but in the dirty details.

So, from a given bunch of data, you could get an enormous number of different models – all valid (at least with respect to the training data). Would they all perform the same when tested against data points that weren’t in the training set? That’s not clear. But let’s come back to the “how to do this” shortly.

Overfitting

Remember design of experiments (DoE)? It’s important to randomize any variables that you’re not testing for. Otherwise your experiments might show false correlations to surprising influences – ones that would disappear with a broader data set.

Well, a similar thing can happen with machine learning. If you’re trying to teach a system to recognize pedestrians, then hair color shouldn’t matter – it shouldn’t be part of the model. But if all the training samples happen to have brunette hair, then that might get built into the model. It effectively becomes a false correlation.

The problem is that overfitting is hard to test for. If there are false correlations, you might not know – unless the testing data set happens to have lots of blond-haired people that don’t get identified as pedestrians. You can’t always count on something that convenient.

The solution – just like with DoE – is to randomize hair color (and anything else that shouldn’t matter). Now… what if you think it shouldn’t matter but, in fact, it does? Then, even by randomizing, that relationship will show up in the model. For example, if all hair colors are included but it turns out that results significantly favor those with brunette hair? Then you might get a model similar to the brunette-only data set – except that this time it’s not a false correlation.

Edge Cases

In the EDA world, we speak often of “corner cases.” The reason they’re a “corner” is that they involve multiple variables intersecting in some far-flung corner of the dataspace. In the ML world, they speak of “edge cases.” It’s an edge instead of a corner because it involves only one variable.

The challenge here is finding the edges. In the hair color example, what are the edges? Jet black and white? We might think of them as such, but will the model play out that way? You could synthesize or simulate various data points for training, but you run the risk of working any artifacts from the synthesis or simulation engine into the model – overfitting some unintended algorithmic predilections.

Reading the Resulting Model

One final obvious issue that affects everything, including the prior two issues, is the fact that the generated models aren’t legible or human-understandable. They consist of a number of weights in a matrix, or some other such abstracted form. You can’t go in and tease out insights by looking at the model. You can’t see any overfitting, and you can’t figure out where edges are or whether they’ve been correctly incorporated into the model. It’s a model by and for machines, not by or for humans.

So What to Do?

Given that ML models aren’t really testable in the manner we’re used to, there are a couple of ways to handle it: accommodate the uncertainty or try to reduce the uncertainty – to the satisfaction of some adjudicator in the safety-critical world..

In the first case, you assume that your model is inherently untestable to the levels needed for high-ASIL or other safety-critical designs. So, instead of losing sleep over what might seem like an unsolvable problem, you built a cross-check into the design. You design a high-ASIL monitor that looks for any out-of-bounds results that the ML model might generate. You haven’t proven the model in that case; you’ve just made arrangements for its failures.
You can improve the model’s performance, however, by taking great care with the training data and how you test the model. But doing this means distinguishing between two types of error: training error and testing error.

Training error is determined by taking the exact same data used to train the model and running it through the model. You might think that’s dumb: if you trained with the data, then the model should always give the correct results, right? The thing is, training engines are software, and they aren’t perfect. So it’s possible to train with a dataset and yet have that model fail in some of those cases of the dataset. So the training error is really testing the quality of training. Which I guess would be why they call it training error.

Testing error results from training on one set of data and testing with a completely different set of data. Now you’re testing the model’s ability to work with data points that it hasn’t seen before. Given a total of 100 data points (you’d typically need many, many more than this), you might pick 70 for training, keeping the remaining 30 as the test points.

Sounds simple, but how do you make that split? What if you do a random split but, quite by chance, the training set ends up with all brunette hair and the test set has all blond hair? That model will overfit for hair color – that’s the bad news. The good news in this case is that the test set will catch it. But it still means you have a faulty model.

You could also do a number of random splits and average the results. But, here again, you could end up with unintended artifacts. What if 20 of the data points randomly ended up always in the training set and never in the testing set? Those 20 points will influence the model more than other data points – like ones that might always happen to end up in test sets.

The solution to this is called k-fold cross-validation. k refers to the number of groups into which you will split your data. A typical number might be 5. In that case, you have 5 subsets of your data, and you create 5 models. For each model created, you use four of the subsets to train the model and then the remaining 1 for testing. You do that 5 times, each time switching which data set is used for testing. In that manner, every data point is used for testing exactly once and training exactly four times. You don’t have the overlaps and biases generated by random splits.

Ultimately, these are steps you can take to create the best possible model. But, in the end, you’re not validating the model: at best, you’re validating your training and testing methodology. It’s up to the safety adjudicators to decide whether your model passes based on how you created it and the test error results. But there will still be no way to outright prove the model through the notion of coverage that we use when we validate circuits in the EDA world.

 

More info:

I consulted a number of references on this piece, but these two papers were particularly helpful.

Challenges in Autonomous Vehicle Testing and Validation”, Philip Koopman & Michael Wagner, Carnegie Mellon University; Edge Case Research LLC

How to Correctly Validate Machine Learning Models”, RapidMiner whitepaper (with embedded PDF)

One thought on “When Lives Depend on Your Machine-Learned Model”

Leave a Reply

featured blogs
Apr 24, 2024
Diversity, equity, and inclusion (DEI) are not just words but values that are exemplified through our culture at Cadence. In the DEI@Cadence blog series, you'll find a community where employees share their perspectives and experiences. By providing a glimpse of their personal...
Apr 23, 2024
We explore Aerospace and Government (A&G) chip design and explain how Silicon Lifecycle Management (SLM) ensures semiconductor reliability for A&G applications.The post SLM Solutions for Mission-Critical Aerospace and Government Chip Designs appeared first on Chip ...
Apr 18, 2024
Are you ready for a revolution in robotic technology (as opposed to a robotic revolution, of course)?...

featured video

How MediaTek Optimizes SI Design with Cadence Optimality Explorer and Clarity 3D Solver

Sponsored by Cadence Design Systems

In the era of 5G/6G communication, signal integrity (SI) design considerations are important in high-speed interface design. MediaTek’s design process usually relies on human intuition, but with Cadence’s Optimality Intelligent System Explorer and Clarity 3D Solver, they’ve increased design productivity by 75X. The Optimality Explorer’s AI technology not only improves productivity, but also provides helpful insights and answers.

Learn how MediaTek uses Cadence tools in SI design

featured paper

Designing Robust 5G Power Amplifiers for the Real World

Sponsored by Keysight

Simulating 5G power amplifier (PA) designs at the component and system levels with authentic modulation and high-fidelity behavioral models increases predictability, lowers risk, and shrinks schedules. Simulation software enables multi-technology layout and multi-domain analysis, evaluating the impacts of 5G PA design choices while delivering accurate results in a single virtual workspace. This application note delves into how authentic modulation enhances predictability and performance in 5G millimeter-wave systems.

Download now to revolutionize your design process.

featured chalk talk

Advanced Gate Drive for Motor Control
Sponsored by Infineon
Passing EMC testing, reducing power dissipation, and mitigating supply chain issues are crucial design concerns to keep in mind when it comes to motor control applications. In this episode of Chalk Talk, Amelia Dalton and Rick Browarski from Infineon explore the role that MOSFETs play in motor control design, the value that adaptive MOSFET control can have for motor control designs, and how Infineon can help you jump start your next motor control design.
Feb 6, 2024
11,065 views