“It’s practically impossible to look at a penguin and feel angry.” – Joe Moore
The headline was straight out of click-bait heaven. “Hungry penguins keep self-driving cars safe from hackers.” It had everything: cute animals, cyber-threats, and scary autonomous vehicles. Throw in a few UFOs and a long-dead mobster and you’d have the perfect Facebook meme.
The reality proved somewhat less sensationalistic, though more useful to actual programmers. A university research team led by Prof. Yiannis Papadopoulos had been exploring ways to make safety-critical software safer and more reliable, and they had found a useful example in… penguins.
Here’s the deal. Safety-critical code is often developed, designed, and regulated under strict guidelines that (we hope) help to prove the code’s reliability. We want to make sure that all possible failures have been anticipated and accounted for. That’s hard enough when you’re banging out “Hello, world!,” but it’s darn near impossible when you’ve got dozens of programmers working on millions of lines of code, all running on different pieces of hardware. How can you possibly prove that everything will work as expected? And, even if it does, how can you ensure that your system fails gracefully in the event of a component failure?
These aren’t hypothetical problems. Programmers working on cars and aircraft and other safety-critical systems deal with them all the time. Thus, we have safety frameworks like ISO 26262 and others. They don’t tell us exactly how to write our code, but they do help to cut the problem into bite-sized pieces so that we can sensibly allocate risk. The standards allow us to ask questions like, “If this piece over here fails, how bad is the outcome?” or “How many things have to fail simultaneously for the system to crash?”
The concept is simple enough – divide and conquer – but the implementation can be devilishly difficult. It doesn’t take a very big program before the number of potential failure points (i.e., software subroutines) skyrockets. Exploring each and every potential weak spot becomes a challenge all by itself – never mind fixing the problems your analysis uncovers.
ISO 26262 and other standards employ the concept of safety integrity levels (SILs). Basically, every component of your system (either a software “component” or a physical hardware component) is assigned a level of importance, from SIL 1 through SIL 4. The failure of an SIL 1 component would be a nuisance, but not catastrophic, while an SIL 4 failure would be life-threatening. Again, simple enough, at least in concept.
But that’s just for single-point failures. Plenty of real-world problems arise only when two or more components fail simultaneously. It’s the kind of “Come on; you’ve got to be kidding!” failure that we often can’t predict. You get a divide-by-zero error at exactly midnight when the power supply glitches and a black cat breaks a mirror under a ladder. That kind of once-in-a-lifetime failure that’s impossible to anticipate. Now the problem gets thorny.
Assigning the risk and predicting the relative likelihood of multiple-point failures is tricky, but, again, ISO 26262 comes to our rescue. It defines an algebraic method of combining the SILs from multiple components to come up with a composite SIL for interrelated groups. Is a failure of Component A more serious than simultaneous failures of Components B and C? And which is more likely to happen? Now you’ve got some calculus on your hands.
Normally, you’d do an exhaustive and recursive analysis through all the branches of the fault tree. It’s just math, right? We have computers for that. Okay, but that just gives you a depressingly large number that enumerates your problems. What you really want to know is how to minimize that number. Or, better yet, how to partition your system in a way that reduces the potential sources of failure. Does combining these two subsystems result in a safer system overall? And if so, is it worth the effort to do that? Decisions, decisions…
Cue the penguins.
A research team at the University of Hull (UK) and the University of Mohammed Cherif Messaadia (Algeria) wanted to find a way to reduce the bone-crushing complexity of searching the entire fault tree looking for optimal solutions. Real-world developers have given up on simple recursive traversals, since those methods don’t scale well with large numbers of nodes. Instead, so-called “meta-heuristic” optimizations like Genetic Algorithm and Tabu Search have cut analysis times by an order of magnitude. But maybe there’s an even better way?
Turns out, penguin colonies do much the same thing all the time. Penguins dive into the water in groups, searching for fish to eat. Maybe they’ll find fish; maybe they won’t. But, as a group, they’re quite efficient at either finding fish or moving on to better hunting grounds. They abandon fish-free areas quickly without wasting a lot of time before moving on to more abundant waters.
Individual penguins are also good at communicating their success (or failure) to other penguins, which helps to rapidly diffuse knowledge to the group. These characteristics make a good template for a weighted risk-assessment algorithm. No potential faults here? Move on quickly. Lots of potential faults over here? Keep digging, at least for a while, searching for even better “fish.”
Moreover, the penguins’ hunting strategy is qualitative, which maps nicely onto the SILs and the analysis of alternatives. The quality of the fishing isn’t binary; some areas have more fish than others, but some are also harder to exploit than others. If the fish are particularly deep or fast-swimming, the penguins have to expend more precious oxygen to catch them. Less fecund waters might be better for the penguins if the fish there are easier to catch. Similarly, when two or more approaches to system partitioning produce equally safe and reliable solutions, one might be cheaper/easier to implement because it involves fewer components.
The resulting algorithm is called Penguins Search Optimization Algorithm (PeSOA).
To test their hypothesis, the research team used an already-available model of an electric car’s hybrid braking system to compare their simulated penguins to popular alternative methods.
(It’s “hybrid” because the simulated car combines conventional mechanical (friction) brakes with electric motors. Normally, the motors power the car, but, when the car is braking, the motors reverse their function, becoming generators that create mechanical resistance (drag) on the wheels, while also conveniently generating a bit of electricity that can be put back into the car’s battery. Why use friction brakes at all and not just the motor/generators exclusively? Two reasons. First, the generators are nearly useless if the battery is already fully charged; and second, the electrical drag isn’t enough to stop a heavy vehicle from high speed. Electric braking is fine for gradual deceleration but not for panic stops.)
This was a test of simple brake function, not exotic auto-navigation or collision avoidance. You just press the brake pedal, the car slows. Simple, right?
Not at all. The researchers found that in the simulated braking system that they and previous researchers had used, there were 60 potential component failures that could cause a fault, most of them serious. Those might represent broken wires, loose connectors, faulty semiconductors, buggy code, and so on.
Furthermore, a fault-tree analysis revealed 11,573 “cut sets,” meaning the number of different ways that two or more components could fail simultaneously to create a fault, even when a single failure wouldn’t.
The bad news? The number of potential re-factorings (that is, the number of different ways the system could be partitioned to deliver the same functionality) meant that something in the neighborhood of 1041 different solutions had to be explored. A recursive search was clearly not feasible.
The good news? The penguins triumphed. The table below shows how the four different algorithms fared using four different weighting methods, measured in relative CPU execution time.
PeSOA was generally about an order of magnitude faster than either the GA or TS algorithms, which were, in turn far faster than a simple recursive minimum-cost search. Each row represents a different way of weighting equivalent solutions. For example, are two SIL 1 components better or worse than a single SIL 2 component, or is one SIL 3 equal to an SIL 1 plus an SIL 2? Different designers will apply different weights to these decisions based on local requirements.
Significantly, all three advanced algorithms (PeSOA, GA, and TS) did find optimal solutions. That is, you would end up with an unequivocal answer to the question of, “What is the best way to partition this complex system?” And you won’t have to go diving in icy water to find it. That ought to be worth a few fish.