feature article
Subscribe Now

When Lives Depend on Your Machine-Learned Model

How Do You Prove It’s Correct?

“How to validate training data is an open question that might be addressed by some combination of characterizing the data as well as the data generation or data collection process.”
‘Challenges in Autonomous Vehicle Testing and Validation’, Koopman and Wagner, 2016

Last summer, we published a piece on safety-critical capabilities in EDA. But we left an open, unanswered question at the end: how do you validate a model created by machine learning (ML)? It wasn’t a trivial topic at the time, and so we postponed covering that aspect. So today, we take it back up. And, as you can see from the quote above, it’s still not a trivial topic.

Why Is This Hard?

There are a number of considerations that make this a vexing subject. After all, you’re trying to verify that some model is a good (or at least a good-enough) representation of some reality. But we do that by training the model with some data set. Change the data – even by replacing good data points with different good data points – and the model will change. Change the order of the data points during training? The model will likely also change. Maybe not in effect, but in the dirty details.

So, from a given bunch of data, you could get an enormous number of different models – all valid (at least with respect to the training data). Would they all perform the same when tested against data points that weren’t in the training set? That’s not clear. But let’s come back to the “how to do this” shortly.

Overfitting

Remember design of experiments (DoE)? It’s important to randomize any variables that you’re not testing for. Otherwise your experiments might show false correlations to surprising influences – ones that would disappear with a broader data set.

Well, a similar thing can happen with machine learning. If you’re trying to teach a system to recognize pedestrians, then hair color shouldn’t matter – it shouldn’t be part of the model. But if all the training samples happen to have brunette hair, then that might get built into the model. It effectively becomes a false correlation.

The problem is that overfitting is hard to test for. If there are false correlations, you might not know – unless the testing data set happens to have lots of blond-haired people that don’t get identified as pedestrians. You can’t always count on something that convenient.

The solution – just like with DoE – is to randomize hair color (and anything else that shouldn’t matter). Now… what if you think it shouldn’t matter but, in fact, it does? Then, even by randomizing, that relationship will show up in the model. For example, if all hair colors are included but it turns out that results significantly favor those with brunette hair? Then you might get a model similar to the brunette-only data set – except that this time it’s not a false correlation.

Edge Cases

In the EDA world, we speak often of “corner cases.” The reason they’re a “corner” is that they involve multiple variables intersecting in some far-flung corner of the dataspace. In the ML world, they speak of “edge cases.” It’s an edge instead of a corner because it involves only one variable.

The challenge here is finding the edges. In the hair color example, what are the edges? Jet black and white? We might think of them as such, but will the model play out that way? You could synthesize or simulate various data points for training, but you run the risk of working any artifacts from the synthesis or simulation engine into the model – overfitting some unintended algorithmic predilections.

Reading the Resulting Model

One final obvious issue that affects everything, including the prior two issues, is the fact that the generated models aren’t legible or human-understandable. They consist of a number of weights in a matrix, or some other such abstracted form. You can’t go in and tease out insights by looking at the model. You can’t see any overfitting, and you can’t figure out where edges are or whether they’ve been correctly incorporated into the model. It’s a model by and for machines, not by or for humans.

So What to Do?

Given that ML models aren’t really testable in the manner we’re used to, there are a couple of ways to handle it: accommodate the uncertainty or try to reduce the uncertainty – to the satisfaction of some adjudicator in the safety-critical world..

In the first case, you assume that your model is inherently untestable to the levels needed for high-ASIL or other safety-critical designs. So, instead of losing sleep over what might seem like an unsolvable problem, you built a cross-check into the design. You design a high-ASIL monitor that looks for any out-of-bounds results that the ML model might generate. You haven’t proven the model in that case; you’ve just made arrangements for its failures.
You can improve the model’s performance, however, by taking great care with the training data and how you test the model. But doing this means distinguishing between two types of error: training error and testing error.

Training error is determined by taking the exact same data used to train the model and running it through the model. You might think that’s dumb: if you trained with the data, then the model should always give the correct results, right? The thing is, training engines are software, and they aren’t perfect. So it’s possible to train with a dataset and yet have that model fail in some of those cases of the dataset. So the training error is really testing the quality of training. Which I guess would be why they call it training error.

Testing error results from training on one set of data and testing with a completely different set of data. Now you’re testing the model’s ability to work with data points that it hasn’t seen before. Given a total of 100 data points (you’d typically need many, many more than this), you might pick 70 for training, keeping the remaining 30 as the test points.

Sounds simple, but how do you make that split? What if you do a random split but, quite by chance, the training set ends up with all brunette hair and the test set has all blond hair? That model will overfit for hair color – that’s the bad news. The good news in this case is that the test set will catch it. But it still means you have a faulty model.

You could also do a number of random splits and average the results. But, here again, you could end up with unintended artifacts. What if 20 of the data points randomly ended up always in the training set and never in the testing set? Those 20 points will influence the model more than other data points – like ones that might always happen to end up in test sets.

The solution to this is called k-fold cross-validation. k refers to the number of groups into which you will split your data. A typical number might be 5. In that case, you have 5 subsets of your data, and you create 5 models. For each model created, you use four of the subsets to train the model and then the remaining 1 for testing. You do that 5 times, each time switching which data set is used for testing. In that manner, every data point is used for testing exactly once and training exactly four times. You don’t have the overlaps and biases generated by random splits.

Ultimately, these are steps you can take to create the best possible model. But, in the end, you’re not validating the model: at best, you’re validating your training and testing methodology. It’s up to the safety adjudicators to decide whether your model passes based on how you created it and the test error results. But there will still be no way to outright prove the model through the notion of coverage that we use when we validate circuits in the EDA world.

 

More info:

I consulted a number of references on this piece, but these two papers were particularly helpful.

Challenges in Autonomous Vehicle Testing and Validation”, Philip Koopman & Michael Wagner, Carnegie Mellon University; Edge Case Research LLC

How to Correctly Validate Machine Learning Models”, RapidMiner whitepaper (with embedded PDF)

One thought on “When Lives Depend on Your Machine-Learned Model”

Leave a Reply

featured blogs
Oct 26, 2020
Do you have a gadget or gizmo that uses sensors in an ingenious or frivolous way? If so, claim your 15 minutes of fame at the virtual Sensors Innovation Fall Week event....
Oct 26, 2020
'Virtuoso Meets Maxwell' is a blog series aimed at exploring the capabilities and potential of Virtuoso RF Solution and Virtuoso MultiTech. So, how does Virtuoso meets Maxwell? Now, the... [[ Click on the title to access the full blog on the Cadence Community site....
Oct 23, 2020
Processing a component onto a PCB used to be fairly straightforward. Through-hole products, or a single or double row surface mount with a larger centerline rarely offer unique challenges obtaining a proper solder joint. However, as electronics continue to get smaller and con...
Oct 23, 2020
[From the last episode: We noted that some inventions, like in-memory compute, aren'€™t intuitive, being driven instead by the math.] We have one more addition to add to our in-memory compute system. Remember that, when we use a regular memory, what goes in is an address '...

featured video

Demo: Inuitive NU4000 SoC with ARC EV Processor Running SLAM and CNN

Sponsored by Synopsys

See Inuitive’s NU4000 3D imaging and vision processor in action. The SoC supports high-quality 3D depth processor engine, SLAM accelerators, computer vision, and deep learning by integrating Synopsys ARC EV processor. In this demo, the NU4000 demonstrates simultaneous 3D sensing, SLAM and CNN functionality by mapping out its environment and localizing the sensor while identifying the objects within it. For more information, visit inuitive-tech.com.

Click here for more information about DesignWare ARC EV Processors for Embedded Vision

featured paper

Fundamentals of Precision ADC Noise Analysis

Sponsored by Texas Instruments

Build your knowledge of noise performance with high-resolution delta-sigma ADCs. This e-book covers types of ADC noise, how other components contribute noise to the system, and how these noise sources interact with each other.

Click here to download the whitepaper

Featured Chalk Talk

AVX Supercapacitors: PrizmaCap

Sponsored by Mouser Electronics and AVX

If your application requires a supercapacitor, there are a lot of options. You need the right form factor, temperature range, weight, and capacitance, of course. In this episode of Chalk Talk, Amelia Dalton chats with Eric DeRose of AVX about choosing the right supercapacitor and about PrizmaCap - a new supercapacitor with low height, high temperature, and lightweight.

Click here for more information AVX PrizmaCap™