Biting Bugs Back

In 1949, there was only one processor, and it was in a computer laboratory. Today, processors are ubiquitous; they are found in cars, phones, planes, satellites, routers, phone-switches, toys, cameras, refrigerators, and almost everything else. Inside those processors is an exploding quantity of software that is breaking the existing debugging methodologies. Late last year, Toyota recalled its Prius hybrid cars due to a software bug. Satellites are lost due to software bugs. Even heart pacemakers cause problems and fail due to software bugs. In 2002, the US National Institute of Science and Technology estimated that software quality problems were costing the US economy $60B annually. It is clearly time to do some things differently.

A New Development Landscape

The design process of complex electronic systems is going through a major change, as an ever-increasing fraction of embedded electronics systems are being implemented in software. Developers are consequently faced with new challenges to debug more complex software without compromising quality or schedule. The impending arrival of multicore embedded processors further complicates the debugging process.

The old methodologies, based around debugging and testing the software on hardware, have been unchanged for decades and are no longer adequate. New solutions are required to meet these challenges, based around moving the entire development process into software on every engineer’s desktop workstation. The key underlying technology for this is full-system simulation. Instead of running the software on the actual hardware to debug code or to run regression tests, a simulation of the hardware is used. Simulation technology is now so accurate that it can run production binaries unchanged, and with performance close to and often exceeding that of the real target hardware.

Software: The Critical Path to Delivery

The critical path for the delivery of a typical working system has moved from finalizing the hardware design to finishing the development of the embedded software. Creating a complex electronic system consists of two main parts: designing the hardware, and writing and debugging the embedded software. Designing the hardware has historically dominated the cost, schedule, and risk of the project, because it was such a complicated and difficult task. However, it is now more economical to implement more functionality in software rather than hardware. As a result, embedded software development of a modern electronic system now encompasses the longest timeline, the bulk of the cost, and most of the risk.

Chip design productivity has improved ten thousand times compared with what it was 20 years ago, due to the immense time and investment that electronic design automation (EDA) companies have spent creating chip-design tools. With a wealth of resources at hand, designing a chip is now a difficult but predictable task in all but the most aggressive circumstances. Similar, but less dramatic, improvements have taken place in the development of board-based systems.

In contrast, the development productivity of the system’s embedded software component has improved only by perhaps 10 times what it was 20 years ago. Due to a historically narrow focus on hardware, software development has been chronically starved of the investment necessary to reduce the risk and cost of development tools. The results show in cost overruns, missed schedules, and poor-quality software. As the catchphrase goes, “Failure is not an option; it comes bundled with the software.”

As with much of engineering, the pressures at the back end of development are to get started earlier. Full-system simulation does not need hardware to be available or even for the design to be complete. It is straightforward to produce a model of the hardware and then use that model for almost the entire software development and test process. The simulation is so faithful that the system integration phase, when the software is finally brought together with the real hardware, is a quick validation step.

What’s Driving Software Complexity?

The basic semiconductor technology roadmap, Moore’s Law, is the driving factor behind the dramatic increase in the software complexity of an electronics system. As semiconductors get faster, more of a system is most conveniently implemented as software running on a microprocessor. Very few specifications of a chip are able to take advantage of faster silicon by delivering higher performance as opposed to using the faster silicon to deliver the same capability more economically.

For example, if a feature of a system is implemented as a specially created block on the chip, then higher performance silicon tends to simply reduce the duty cycle of the block, which is a poor use of expensive silicon area. However, if it is implemented in software, several functions can be shared on one processor, making more effective use of silicon real estate. However, for power reasons, applications that have traditionally used single processors will now have to be parallelized over multiple processors in order to keep increasing their performance. This change in the landscape will put further pressure on software designers to get good performance out of the new architectures.

A second driving increase in software content is the economics of semiconductor manufacturing. The high cost of design and masks and the move to 30-cm wafers increase the minimum production volume necessary for a chip design to be viable, so some form of aggregation is necessary for designs in lower volumes. Even field-programmable gate arrays (FPGAs), which act as a blank canvas, are too power-hungry and often too expensive for many applications. An intermediate level of aggregation creates a standard platform, such as a network processor, and puts the differentiation into software. Different systems ship with different behavior, but the chip is manufactured in efficiently large volumes.

Due to power considerations, increasing the speed of embedded electronic processes no longer means an increased speed of embedded microprocessors. Instead, faster silicon is used to deliver increased computing power by delivering more processor cores rather than increased clock speeds. In the past, multiple microprocessors were uncommon, and the software development problem was simplified by statically partitioning the algorithm. With multicore, the software is split into threads and run on any available core. This adds another level of complexity to the software, along with a host of new potential failures if the locking is not handled correctly.

A New Approach to Debugging Embedded Systems

The upshot of these trends is that the size of the software component of embedded electronics systems has increased dramatically. To make matters more difficult, the move to multicore is further increasing the complexity of these systems. Analyst firm Embedded Market Forecasters found the top four problems listed by embedded programmers are: limited ability to see into the whole system, limited ability to trace, limited ability to control, and significant intrusiveness. In the face of these challenges, companies are changing the way they approach the development process by adding new tools and methodologies in the developer’s toolkit to deliver software with a predictable schedule and a high level of quality .

One new approach to software development is to move from executing software on the real hardware, which lacks control and observability, to executing software on a simulated system model of the hardware that is both fast and accurate. People tend to assume that simulation is too slow to be used as the underlying execution and debug engine for software development because of experience with lower-level simulators such as Spice or Verilog. But advances in system simulation technology, coupled with fast and inexpensive workstations, mean that it is possible to simulate the complex systems at speeds measured in billions of simulated instructions per second — performance well-suited for the edit-compile-debug loop that comprises a large part of a programmer’s daily work.

Full-system simulation technology is accurate at the chip-software boundary and consequently can run production binaries unchanged. However, with silicon it is difficult to control precisely what the hardware does and to o bserve all the details of what is happening. This is especially true for multicore processors, where it is not always possible to set a breakpoint that deterministically stops all the processors at the same place each time. Simulation simply has capabilities not available in real hardware.

Simulation gives the developer full control over the environment, which enables powerful debugging techniques not available through other approaches. For example, simulation can run code backward or reverse execute. This is extremely powerful for debugging since it becomes possible to simply wait for an error to occur and then run backward to determine the root cause. On traditional hardware, once a particular instruction has been performed, it is impossible to go backward through the code to find the root cause. Real hardware can only run time forward as controlled by the clock. A developer must restart the program from the beginning, and re-run it to just before that procedure call. This can be time consuming. Firstly, a breakpoint must be defined that halts the processor at just the right point, before the error occurs, so that the problem is not missed, yet not too far before the error occurs so that it doesn’t take too long to advance to the error under manual control. Secondly, time is wasted when rebooting the system and running the test script to get to the error. In a multi-processor system or one with extensive real-time interactions, it is often difficult to reproduce the error at all since the hardware is not deterministic.

Reverse execution is implemented using two capabilities of the simulation: the capability to checkpoint an entire system inexpensively and the underlying speed of simulation. Running backward one instruction is then accomplished by restoring a checkpoint and running forward all but one instruction. This process appears almost instantaneous to the user due to simulation’s speed, even though checkpoints are not recorded that frequently.

Regression testing and fault injection offer another area in which simulation has anadvantage over using actual hardware. It is impossible to inject faults at the hardware level unless special capabilities have been built into the chip. Even a simple process such as transmitting a network packet with a faulty error-correcting-code is typically difficult. In comparison, injecting these types of faults in a simulator is simple. In a similar manner, it is possible to script a hardware temperature sensor to record an unacceptably high temperature and ensure that the system shuts itself down gracefully, or to ensure that memory faults are handled correctly.

Conclusion

Simulation addresses the most urgent and pertinent problems of development today. Firstly, it allows software development to get started early, long before hardware is available. Since large amounts of software are required, it is the longest part of the schedule. Waiting for hardware availability delays the time to production when money finally starts flowing back to the electronics system company to amortize the costs of design.

Furthermore, it provides new tools to programmers that hardware-based development lacks. It has complete observability and control; it is fully deterministic, and it is completely non-invasive, thus avoiding “Heisenbugs,” in which the behavior of the system changes once it is instrumented to detect a bug. And when additional faults are required during testing, it is straightforward to inject problems into the design to check that they are handled benignly. Finally, with the ability to control time, it is possible to support features such as running time backward or having extremely complex breakpoints, all of which make a programmer’s job a lot easier.

Now that electronics system software is radically surpassing previous levels of complexity, the hardware-based development approach has remained by default, even though it is no longer fast enough or productive enough for a world-class system design team. System simulation has been used for the development of the most complex electronics systems for years and will continue to move into the mainstream. Customers with state-of-the-art development processes are now planning all the software development and test ahead of hardware availability, with first-customer-ship (FCS) just days after delivery of the first hardware.