This article has been in production for some time. It was going to be so simple: chat to two of the leading pundits on system safety and pull together a quick piece of “compare and contrast.” Just to add to the timeliness, there has been a very genteel firefight over the role of the IEC 61508 standard on the leading system safety newsgroup (http://www.cs.york.ac.uk/hise/sc_list.php), and, sadly, Air France flight 447 has disappeared, leading to intense speculation as to whether the cause was related to the fly-by-wire systems that the Airbus 330’s (and other Airbus models) use extensively.
However, the conversation with the experts and the subsequent reading had me reaching for my “I think you will find it’s a bit more complicated than that” t-shirt. In fact, in the words of the old joke, “If you want to get to there, I wouldn’t start from here.”
Let’s start with bridges – no electronic content, but a well-understood technology for building bridges that stay up and carry traffic. Tens of thousands of them around the world, and most of them just sit there and do the job. Just occasionally a new bridge goes wrong in some way, and then there is an investigation. The investigation is not looking to assign blame and is not carried out behind closed doors; instead, the investigating team will include a range of professionals, possibly including competitors of the original development team, and the results are quickly put into the public domain so that all future bridge designers learn from this.
Is this what happens when an electronics system goes wrong?
Martyn Thomas, who has been trying to get us to build safe software for several decades, draws a clear distinction between software development and engineering. In his view, while engineering learns from its mistakes, software developers are still making the same mistakes that were being made twenty or more years ago.
Engineering disciplines normally start as a craft. As the tasks they undertake become more complex, then they begin to use mathematics and science to understand what is happening and to provide an underpinning of theory to the craft skills. Software development, except in some restricted areas, has not yet reached this phase. Instead, we have systems developed according to procedural rules (such as IEC 61508), and, providing the rules are followed and the software passes testing, then the system is certified as “safe.”
Thomas feels strongly that merely following the process and then testing does not produce good software. In fact, he claims, there is evidence that there is no relationship between following development processes and software quality. And testing doesn’t guarantee software quality: studies show that, for example, MCDC (Modified Condition/Decision Coverage) testing, mandated for DO-178B, the most rigorous of the defence software standards, “does not significantly increase the probability of detecting any serious defects that remain in the software.”
In Thomas’s view, the task of the programmer is to identify what has to be done, and, after writing the program, demonstrate that it fulfils the requirement. This can only be done un-ambiguously through formal methods: mathematically based techniques that demonstrate that the program as written matches the program as defined. There is a perception that formal methods are both expensive and cumbersome. (There is also strong resistance among programmers to anything that can be seen as constraining their creative skills.) But, more importantly, there are few programming languages that lend themselves to formal methods.
Thomas sees the economics as the reason why we are at the stage where, with certain exceptions, we still accept mediocre software and have not made the transition from craft to engineering. Moore’s law has provided the driver that gives the end-users tools that they didn’t have before. (The Internet and all its applications, including Google and email; digital photography and desk-top editing; digital TV; cell phones – the list is endless.) In return, the users accept that there are occasional problems with these tools (network issues; sites crashing; PC applications hanging; cell phones dropping calls). Until we reach the stage where the users are not prepared to tolerate the results of poor software, there is no economic incentive for developers to improve their products.
It is going to take accidents to change attitudes. Accidents drive the changes in engineering processes, and it will be an increase in accidents that will drive the changes that will move software development as a whole into an engineering discipline.
Much of the work of Nancy Leveson, of MIT, is related to analysing accidents. Traditionally accident analyses look for a single cause. Leveson argues that, with modern systems, it is not possible to identify a single cause; there are multiple elements that interact, including human beings. Safety, or lack of it, is culturally defined, whether within an organisation or in wider society.
An example that didn’t occur when talking to Leveson, but has always intrigued me, is the acceptance of railway/railroad risk in the United States compared to Britain. The 52 miles of Caltrain track sees an average of 10-15 deaths a year – a death for roughly every 4 miles. At that rate, the 22,000 miles of British track, with a high proportion of the trains running at over 100 miles an hour, would see 5,500 deaths a year: instead, the total number of deaths, of all causes, is well under 300. Equally, the US has nearly three times the car fatalities per thousand of population than Britain.
Leveson’s web site has a vast range of resources on accidents and the contributory factors. One extreme example, again not primarily electronics or software related, was the Union Carbide plant accident in Bhopal, India, the worst industrial accident in history. Leveson ascribes this to a chain of decisions, which meant that, if the accident hadn’t happened in December 1984, then it would have happened sooner or later. It is also clear that the effects of the accident were far worse than they need to have been. It would be seriously misleading to blame the individual operator who made the mistake for the many thousands of deaths.
An accident that did have electronic and software issues was the Mars Polar Lander. It is generally accepted that the probable cause of the failure of the Lander in December 1999 was that software prematurely shut down the engines when a sensor mistook the vibration of the leg being deployed for the vibration caused by the leg touching the Martian surface. If this was the case, every element of the software worked according to its specification, nothing failed, yet millions of dollars worth of Lander still disappeared.
Leveson argues that we have to start looking at safety differently and move organisational and societal culture away from focusing on preventing failure to focus instead on safe behaviour. Organisations should concentrate on making it easier to do things right rather than stopping people from doing things that are wrong. This requires buy-in from the very top of an organisation and feed-back loops to monitor that the systems are working.
Leveson comments that safety is expensive if you try to do it the wrong way, as we do today. And even when safety adds costs, accidents are very expensive. So, in the long term, it is better to get it right in the first place.
This brings us back to Thomas. While formal methods of software development are already being deployed in some areas — large areas of aerospace, for example — he feels that we need an effort to build a new generation of programming languages and tools to create safe systems. It should start with basic language definitions of syntax and semantics from which different flavours of a language amenable to formal methods could be derived. This would also be used to build the other software elements, such as operating systems, database engines, web servers, browsers etc, that are required for complete systems. Tony Hoare, a significant pioneer in many areas of software development, is now a Principal Researcher with Microsoft’s UK labs, where he is working on formal methods. He has suggested that such a project could be comparable to the human genome project, where, over a period of 11 years, scientists across the world cooperated openly in discovering the structure of DNA. Thomas feels that the tool development should be open source, to provide the maximum possible trust. The money needed to kick start the project would be in the low billions; an organisation like the European Union could provide the funding as an investment which would, over time, be amply repaid as more projects are completed successfully and are reliable and safe in service.
Some years ago, I had to visit a number of advertising agencies across Europe to introduce a new set of corporate design rules. Advertising people, like programmers, are, often rightly, proud of their skills, regarding themselves as creative free spirits. The reaction to the new design rules from those agencies for which I already had the highest regard was generally “Just another set of constraints – we will work within them and produce good work.” And they did. There was one agency who threw their toys out of their pram, claiming that even these simple guidelines were an impossible limitation to their creative abilities. Their work was the least inspiring of any of the agencies.
Back to bridges. All bridges have to meet certain constraints, including the laws of physics. A great engineer can use these laws to create a bridge that is an object of wonder in its own right. Look, for example, at le Viaduc de Millau, an exciting bridge over the Tarn river in southern France. Equally, a good programmer should be able to work within the constraints of a formal approach to create a product that is elegant, yet carries out the functions needed.