November 27, 2013
Cars, Coding, and Carelessness
Sloppy Coding Practices Led to a Fatal Crash
You’ve probably heard by now about the lawsuit against Toyota regarding its electronic engine control. The jury found the automaker guilty of writing fatally sloppy code, and, based on what the software forensics experts found, I’d have to agree.
This case is fundamentally different from the “unintended acceleration” fiasco that embroiled a certain German carmaker back in 1986. That scare was entirely bogus and made-up, and it was fueled by an ill-considered “60 Minutes” exposé that aired in the days when Americans watched only three TV channels. Sales of the affected cars plummeted, and it took more than two decades for the company to recover. An engineering spokesman for the carmaker told reporters, “I’m not saying that we can’t find the problem with the cars. I’m saying there is no problem with the cars.” He was dead right – there was no problem with the cars – but the remark was viewed as arrogant hubris, and it just made the situation worse.
In reality, a few drivers had simply been pressing the wrong pedal, which is a surprisingly common mistake. It happens all the time, in all types of cars. Naturally, nobody wants to admit that they just ran over the family cat (or worse, their own child) through momentary stupidity, so they blame the equipment. “I didn’t run over Fluffy. The damn car did it!”
Back then, throttle controls were mechanical. There was a direct mechanical connection (usually a cable) from the gas pedal to the carburetors or fuel-injection system of the car. Unless gremlins got under the hood (no AMC jokes, please), there wasn’t much chance of that system going wrong.
Now cars’ throttles are mostly electronic, not mechanical, and the “drive by wire” system has come under new scrutiny. Unlike a basic steel cable, there are a whole lot of things that can go wrong between the sensor under the gas pedal and the actuator in the fuel injector. Any number of microcontrollers get their grubby mitts on that signal, or the connection itself could go bad. It’s just an embedded real-time system, after all, with all the pros and cons that that implies.
Reset to today. After a years-long legal battle involving an Oklahoma driver whose passenger was killed when their car suddenly accelerated when it wasn’t supposed to, the courts ruled in favor of the plaintiff. In other words, the car was defective and its maker, Toyota, was found to be liable.
There was no smoking gun in this case; no dramatically buggy subroutine that caused the fatal crash. Instead, there’s only supposition. But what a careful examination of the car’s firmware showed is that it could have failed in the way described in the case, not necessarily that it did fail. That was enough to convince the jury and penalize the carmaker at least $3 million.
For embedded programmers, the case was both enlightening and cautionary. For years, experts pored over Toyota’s firmware, and what they found was not comforting. Legal cases often bring out dirty laundry, the things we casually accept every day but would rather leave covered or private. In a liability case, privacy is not an option. Every single bit (literally) of Toyota’s code was scrutinized, along with the team’s programming practices. And the final conclusion was: they got sloppy.
It’s not that Toyota’s code was bad, necessarily. It just wasn’t very good. The software team repeatedly hacked their way around safety standards and ignored their own in-house rules. Yes, there were bugs – there will always be bugs. But is that okay in a safety-critical device? It’s nice for novices to say that there should never be bugs in such an important system; that we should never ship a product like a car or a pacemaker until it’s proven to be 100% bug-free. But, in reality, that means the product will never ship. Is that really what we want? If it’s going to be my car or pacemaker, yes. If it’s the car or pacemaker I’m designing… maybe that’s too high a bar. But there is some minimum level of quality and reliability that we as customers have a right to expect.
Toyota’s developers used MISRA-C and the OSEK operating system, both good choices for a safety-critical real-time system. But then they ignored, sidestepped, or circumvented many of the very safety features they are designed to enforce. For example, MISRA-C has 93 mandatory coding rules and 34 suggested rules; Toyota observed only 11 of those rules, and still violated five of them. Oh, and they ignored error codes thrown by the operating system. You can’t trust a smoke alarm if you remove the battery every time it beeps.
Stack overflows got close scrutiny, because they’re the cause of many a malfunctioning system. Contrary to the developers’ claims that less than half of the allocated stack space was being used, the code analysis showed it was closer to 94%. That’s not a grievous failure in and of itself, but the developers wrote recursive code in direct violation of MISRA-C rules, and recursion, of course, eats stack space. To make matters worse, the Renesas V850 microcontroller they used has no MMU, and thus no hardware mechanism to trap or contain stack overflows.
OSEK is common in automotive systems, almost a de facto standard. It’s portable, it’s widely available, and it’s designed to work on a variety of processors, including ones without an MMU. But because it’s a safety-critical software component, each OSEK implementation must be certified. How else can you tell a good and compliant OSEK implementation from a bad one? Toyota used a bad one. Or, at least, an uncertified one.
Structured-programming aficionados will cringe to learn that Toyota’s engine-control code had more than 11,000 global variables. Eleven thousand. Code analysis also revealed a rat’s nest of complex, untestable, and unmaintainable functions. On a cyclomatic-complexity scale, a rating of 10 is considered workable code, with 15 being the upper limit for some exceptional cases. Toyota’s code had dozens upon dozens of functions that rated higher than 50. Tellingly, the throttle-angle sensor function scored more than 100, making it completely and utterly untestable.
Although the Toyota system technically had watchdog timers, they were trivially simple fail-safes in name only. The list goes on and on, but it’s a familiar litany for anyone working in software development. We know better, we’re embarrassed by it, but we do it anyway. Right up until we get caught, and Toyota’s programmers got caught. And people died.
All the basics were there. As far as the legal and code experts could determine, the engine-control system would have worked if more of the safety, reliability, and code-quality features had been observed. And, obviously, the car does work most of the time. It’s not noticeably faulty code. And that’s the problem: it appears to work, even after millions of hours of real-world testing. But those lurking bugs are always there, allowed to creep in through cavalier attitudes about code hygiene, software rules, standards, and testing. Other, more conscientious developers did the hard work of creating MISRA-C, OSEK, and good coding practices. All we have to do is actually follow the rules.
Posted on November 27, 2013 at 9:46 AM"I think you will find it is more complex than that"
Interestingly at least two of the accidents with Toyota involved very senior drivers and it could not be proved that they had not stamped on the wrong pedal.
The code was sloppy but the expert witness was not able to prove that it caused the accident- just that it might have done
Toyota appears to have taken an economic decision to pay up rather than appeal.
Are courts the right venue for disentangling these events? I don't think so, and tomorrow will be discussing his - so watch this space.