“How to validate training data is an open question that might be addressed by some combination of characterizing the data as well as the data generation or data collection process.”
‘Challenges in Autonomous Vehicle Testing and Validation’, Koopman and Wagner, 2016
Last summer, we published a piece on safety-critical capabilities in EDA. But we left an open, unanswered question at the end: how do you validate a model created by machine learning (ML)? It wasn’t a trivial topic at the time, and so we postponed covering that aspect. So today, we take it back up. And, as you can see from the quote above, it’s still not a trivial topic.
Why Is This Hard?
There are a number of considerations that make this a vexing subject. After all, you’re trying to verify that some model is a good (or at least a good-enough) representation of some reality. But we do that by training the model with some data set. Change the data – even by replacing good data points with different good data points – and the model will change. Change the order of the data points during training? The model will likely also change. Maybe not in effect, but in the dirty details.
So, from a given bunch of data, you could get an enormous number of different models – all valid (at least with respect to the training data). Would they all perform the same when tested against data points that weren’t in the training set? That’s not clear. But let’s come back to the “how to do this” shortly.
Remember design of experiments (DoE)? It’s important to randomize any variables that you’re not testing for. Otherwise your experiments might show false correlations to surprising influences – ones that would disappear with a broader data set.
Well, a similar thing can happen with machine learning. If you’re trying to teach a system to recognize pedestrians, then hair color shouldn’t matter – it shouldn’t be part of the model. But if all the training samples happen to have brunette hair, then that might get built into the model. It effectively becomes a false correlation.
The problem is that overfitting is hard to test for. If there are false correlations, you might not know – unless the testing data set happens to have lots of blond-haired people that don’t get identified as pedestrians. You can’t always count on something that convenient.
The solution – just like with DoE – is to randomize hair color (and anything else that shouldn’t matter). Now… what if you think it shouldn’t matter but, in fact, it does? Then, even by randomizing, that relationship will show up in the model. For example, if all hair colors are included but it turns out that results significantly favor those with brunette hair? Then you might get a model similar to the brunette-only data set – except that this time it’s not a false correlation.
In the EDA world, we speak often of “corner cases.” The reason they’re a “corner” is that they involve multiple variables intersecting in some far-flung corner of the dataspace. In the ML world, they speak of “edge cases.” It’s an edge instead of a corner because it involves only one variable.
The challenge here is finding the edges. In the hair color example, what are the edges? Jet black and white? We might think of them as such, but will the model play out that way? You could synthesize or simulate various data points for training, but you run the risk of working any artifacts from the synthesis or simulation engine into the model – overfitting some unintended algorithmic predilections.
Reading the Resulting Model
One final obvious issue that affects everything, including the prior two issues, is the fact that the generated models aren’t legible or human-understandable. They consist of a number of weights in a matrix, or some other such abstracted form. You can’t go in and tease out insights by looking at the model. You can’t see any overfitting, and you can’t figure out where edges are or whether they’ve been correctly incorporated into the model. It’s a model by and for machines, not by or for humans.
So What to Do?
Given that ML models aren’t really testable in the manner we’re used to, there are a couple of ways to handle it: accommodate the uncertainty or try to reduce the uncertainty – to the satisfaction of some adjudicator in the safety-critical world..
In the first case, you assume that your model is inherently untestable to the levels needed for high-ASIL or other safety-critical designs. So, instead of losing sleep over what might seem like an unsolvable problem, you built a cross-check into the design. You design a high-ASIL monitor that looks for any out-of-bounds results that the ML model might generate. You haven’t proven the model in that case; you’ve just made arrangements for its failures.
You can improve the model’s performance, however, by taking great care with the training data and how you test the model. But doing this means distinguishing between two types of error: training error and testing error.
Training error is determined by taking the exact same data used to train the model and running it through the model. You might think that’s dumb: if you trained with the data, then the model should always give the correct results, right? The thing is, training engines are software, and they aren’t perfect. So it’s possible to train with a dataset and yet have that model fail in some of those cases of the dataset. So the training error is really testing the quality of training. Which I guess would be why they call it training error.
Testing error results from training on one set of data and testing with a completely different set of data. Now you’re testing the model’s ability to work with data points that it hasn’t seen before. Given a total of 100 data points (you’d typically need many, many more than this), you might pick 70 for training, keeping the remaining 30 as the test points.
Sounds simple, but how do you make that split? What if you do a random split but, quite by chance, the training set ends up with all brunette hair and the test set has all blond hair? That model will overfit for hair color – that’s the bad news. The good news in this case is that the test set will catch it. But it still means you have a faulty model.
You could also do a number of random splits and average the results. But, here again, you could end up with unintended artifacts. What if 20 of the data points randomly ended up always in the training set and never in the testing set? Those 20 points will influence the model more than other data points – like ones that might always happen to end up in test sets.
The solution to this is called k-fold cross-validation. k refers to the number of groups into which you will split your data. A typical number might be 5. In that case, you have 5 subsets of your data, and you create 5 models. For each model created, you use four of the subsets to train the model and then the remaining 1 for testing. You do that 5 times, each time switching which data set is used for testing. In that manner, every data point is used for testing exactly once and training exactly four times. You don’t have the overlaps and biases generated by random splits.
Ultimately, these are steps you can take to create the best possible model. But, in the end, you’re not validating the model: at best, you’re validating your training and testing methodology. It’s up to the safety adjudicators to decide whether your model passes based on how you created it and the test error results. But there will still be no way to outright prove the model through the notion of coverage that we use when we validate circuits in the EDA world.
I consulted a number of references on this piece, but these two papers were particularly helpful.
“Challenges in Autonomous Vehicle Testing and Validation”, Philip Koopman & Michael Wagner, Carnegie Mellon University; Edge Case Research LLC
“How to Correctly Validate Machine Learning Models”, RapidMiner whitepaper (with embedded PDF)