In Britain a four-year-old boy was allowed to starve to death, and his body wasn’t found for two years. After his mother was sentenced to prison a few months ago, an inquiry was held into how the various local agencies and the police had dealt with the matter. I have no knowledge as to how competent the inquiry was, but when the report was published, it was violently attacked by the press and by government sources because no individual was blamed.
“So what has this to do with electronics?” I can hear you asking. Yesterday my colleague Jim Turley wrote about the problems with Toyota software. The trigger for the article was, you will recall, a court case where a non-technical jury and judge ruled against Toyota on the grounds that the software was faulty. My concern isn’t whether or not the evidence on the quality of the software prepared by the noted industry figure Michael Barr was correct. Nor am I certain whether the jury was swayed by a more charismatic lawyer than Toyota could manage to find. My concern is that, as with the child tragedy, today, whenever an accident takes place, there is a search for someone to blame. And the software industry, by the way it approaches system development, is potentially putting itself in the firing line for future judgments.
Many people of a certain age in Britain will remember a monologue, written in 1930, called the Lion and Albert. Albert Ramsbottom sees Wallace, a lion, dozing in a zoo and, in an attempt to get a reaction from Wallace, pushes his stick with a horse’s head handle “The finest that Woolworth’s could sell” into the lion’s ear. Not unreasonably, Wallace “swallowed the little lad… whole!” His mother’s reaction was that, “someone’s got to be summonsed.” At a court hearing, the judge “gave his opinion, that no one was really to blame”. Today Mrs. Ramsbottom’s reaction is universal, and the judge’s verdict would be unacceptable.
The big problem with automatically reaching for a lawyer when something goes wrong is that, for complex systems, like cars, assigning blame does not help to discover what caused the problem and what can be done to stop it happening in the future.
The need to understand, rather than blame, has for many years been behind the way in which governments investigate accidents involving transport. A typical example of how this is done is the UK’s Rail Accident Investigation Board. It is an independent agency that investigates every rail accident that involves serious damage or loss of life, real or potential. Every one of their reports on “an occurrence” (to use their jargon) includes the rubric
- The purpose of a Rail Accident Investigation Branch (RAIB) investigation is to improve railway safety by preventing future railway accidents or by mitigating their consequences. It is not the purpose of such an investigation to establish blame or liability.
- Accordingly, it is inappropriate that RAIB reports should be used to assign fault or blame, or determine liability, since neither the investigation nor the reporting process has been undertaken for that purpose.
Other investigations may take place alongside that of the RAIB to see if there are grounds for prosecution under Health and Safety legislation or criminal law. And if there is a death, a formal coroner’s inquest will take place.
The reports follow a standard structure, normally including a narrative of the occurrence, with detailed supporting evidence, and a comparison with other occurrences. It then identifies causes, broken into an immediate cause (“the condition, event or behaviour that directly resulted in the occurrence”) and causal factors – other things rather than the immediate cause that contributed to the occurrence. Then the board makes recommendations for things that could be done that would lessen or remove completely the chances of a similar accident occurring in the future and may also include learning points – safety information not covered by the recommendations.
This whole structure is intended to inform the railway industry and its regulators as to what actually happened without any emotion or blame. Similar reports are prepared by the Marine Accident Investigation Board and the Air Accident Investigation Board.
In the US, the National Transportation Safety Board has a similar role, and equivalent bodies exist around the world.
This objective approach is increasingly coming up against software and having problems in dealing with it. The same thing is happening in other areas – such as medicine. The problem for investigators in these disciplines is that there is no single objective professional engineering approach to software development. There are indeed dozens of standards, but these are frequently domain-specific and vary in how they see safety as being achieved. Some concentrate on testing and review of software, others on a broad approach to the process of software development. Some try to define safety levels (SILs) but they do not agree with each other on the definition of a SIL and use different metrics to calculate how to evaluate what SILs actually mean. (A good review of the issues is a paper from the University of York – now about ten years old but sadly still horribly relevant – Software Safety: Why is there no Consensus)
This lack of consensus is the sign of an immature industry. Safety guru Martyn Thomas has for many years compared software development to the activities of the engineering professions: his argument, simplifying greatly, is that if a civil engineering project goes wrong, the profession as a whole gets together to try solve the problem. But if a software project goes wrong, then the whole thing is covered up. This means that subsequent projects don’t learn from the failure. To see that this is not an exaggeration you have only to look at the Health Insurance Marketplace online application system – the Obamacare website. It appears to have suffered from being put together against a deadline and from not being thoroughly tested before going live, both of which are known to contribute to software failures.
What is your reaction when you see the phrase “government software projects”? Many people’s reaction is to run screaming to get as far away as possible, remembering the many failures, often on a massive and hugely expensive scale. But we know of these because they are public. We hear of the massive software project failures in the commercial sector only when there is a lawsuit. And there is evidence that there are many software failures that never come to light. Despite at least 40 years’ experience of software project failures, the same problems keep occurring. Insufficient initial specification, often caused by not involving those who will use the software in the specification, which result in leading to “mission creep”, is a classic problem. An insistence on designing from scratch rather than looking at how existing services can be used is another. Political pressures to get a system live against an inappropriate schedule is yet another. A Google search for “software projects failures” produced 12,800,000 results with discussion papers and news items spanning many years – all identifying the same contributory factors to failure.
I have just looked at a British computer science degree course, which has extremely demanding entrance requirements and boasts that it meets the academic requirements for being a chartered IT professional. Less than 5% of the course is about “software engineering”. The implication is that the academic requirements for being a chartered IT professional do not include knowing how to run a project. A graduate of that course told me that writing code is a creative art and can’t be constrained.
There is a strong push in Britain for all school students to be “taught to code” so that Britain remains competitive in high technology. The same movement is happening in the US with code.org saying “Bringing computer science to every kid is the gift the tech industry needs to give back to America.”
What is seriously worrying is that these initiatives are being backed by industry people like Bill Gates and Mark Zuckerberg. And none of them seem to involve any tools more sophisticated than a compiler or interpreter – if the code runs, then it is correct.
This all ignores the fact that code is only the way to implement the solution to a problem – it is not the way to solve the problem. If you look at any development methodology, the V model, waterfall or whatever, programming occurs only after a long process of analysis and design, and before a process of test and verification. And, I hate to say it, but Raspberry Pi and its competitors are just going to reinforce the idea that hacking together code is the solution to any problem.
Right on cue, as I was finalising this article, I received a press release about a Christmas gift – a package described as an ideal first step to encourage kids ages 8+ in exploring the world of the Raspberry Pi and programming, with colourful step-by-step guides to help them at every stage. And the first project is a traffic light control system. Develop safety-critical embedded systems in your own bedroom.
This focus on code writing ignored the fact that there are many projects around the world that don’t have programmers. There are large projects that start by using modelling tools and then use code generators to create code that is correct by construction. Other projects, particularly in the embedded sector, use graphical programming languages such as LabVIEW for complex systems. Of course, underlying both these approaches is code that eventually runs on a processor, but developing systems using these approaches does not require coders.
What is also worrying is that there is little or no control over who develops systems or writes code. Jack Gansle pointed out in his newsletter that in Maryland, you need a license to cut hair, but not to write the code for a nuclear reactor.
After the Therac-25 accidents in the 1980s, where a radiation therapy machine killed people and seriously injured others by overdosing, you would think that any company in related areas would be hyper-cautious, yet in January 2013 it was reported that a “software bug as a result of an upgrade” caused a patient to receive 13 times the normal amount of X-rays during a routine screening.
Many other reported issues with medical devices have resulted from poor interface design, again an area where being able to write code is not any help at all but programmers still go ahead and devise the interface. With all due respect to the many wonderful people I know that write code, their view of the system cannot be the same as that of the person who is going to use it.
Is it going to take an incident where poor system development creates a system that fails with massive loss of life and causes governments to create a regulatory authority with rules about how and by whom software should be developed? Or are the different organisations, both technical and professional, going to get their act together and create an environment in which there is a common understanding of what is needed to create safe systems?
Footnote. There are reports that the BART train system in the San Francisco Bay area stopped working on Friday November 22nd. A BART spokesman described the technical problem as BART computer systems in central control not communicating properly with the track switches. In this case the system, as railway systems should, failed safe. But what if it had not? And who would have been to blame?
Do you think that a blame culture gets in the way of solving technical issues? or do you think that the adversarial nature of the Anglo-saxon courts of the U.S. and the UK are a fair way of deciding what has happened?