“Eliminate all other factors, and the one which remains must be the truth.” – Sherlock Holmes, The Sign of Four
Three engineers are sitting in a rowboat. None of them has any nautical experience or navigational instruments.
As hope for rescue fades, and with no other ships in sight, they pass the time discussing their careers (as you do). The oldest one is a veteran hardware/software developer with decades of experience. The middle one is in the prime of his career, and the youngest has just left college to start his first job.
“Back in my day, we didn’t have these fancy logic analyzers, simulators, or EDA tools,” says the grizzled veteran engineer. “We debugged computers with a voltmeter, a hair dryer, and a piece of wire. Some days we didn’t even have the wire.”
“Well, things are better now,” replies the man in the middle. “With an oscilloscope and an emulator, I can track down most any problem.”
“You guys are both crazy,” scoffs the youngster. “You can’t possibly debug a modern multiprocessor, networked, RTOS-enabled system without modern simulation tools, profilers, multichannel logic analyzers, and a team of experts. And when are we getting out of this boat, anyway?”
The oldster shakes his head sagely. “Kid, you have no idea how to diagnose and debug a system. You’re too eager to throw your shiny tools at the problem before you even know what you’re looking at.”
“The first rule of debugging,” he continues, warming to the topic, “is knowing where not to look. It’s all about saving time. There are an infinite number of things that can fail, both in hardware and in software. Your only hope of fixing a machine is to eliminate 99.99% of those alternatives, early on. Otherwise, you’re just wasting your employer’s time. Lemme give you some lessons before it’s too late for all of us.”
- Kick the box. Seriously. If physical jostling either starts or stops the failure, you know it’s mechanically related. Look for a loose connector, a wobbly ribbon cable, or a cracked component, especially one with a heat sink glued to it.
- Turn it upside down. Gravity typically doesn’t affect software problems, but it will highlight loose hardware, bad sockets, flexing PC boards, stray screws, or bits of solder and other conductive materials rattling around inside the enclosure.
- Hit it with freeze spray. Semiconductors run fast or slow, depending on temperature (that’s why supercomputers and gaming rigs are liquid cooled). A well-aimed shot of freeze spray can speed up a microprocessor, DRAM, or buffer just enough to highlight a tight timing window. It’ll also weed out marginal components prone to failure.
- Try a heat gun. Or a hair dryer, in a pinch. You can’t target the heat as finely as the freeze spray, but it’ll still highlight temperature-related problems. Don’t forget that PC boards, connectors, and other devices expand slightly with the heat, so your problem might be a mechanical fit, not marginal timing.
- Look for bent or loose pins. Pin-and-socket connectors are pretty good at protecting pins, but they can’t fix stupid. Sometimes a pin gets bent, and when you force the plug into the socket, the pin gets bent over perfectly flat so it’s hard to see. Look closely and mentally count each pin to convince yourself that it’s actually straight and aligned.
- Is Pin #1 where you think it is, or are you plugging in the cable backwards? Are the pins arranged in odd/even rows, or around the circumference like an IC? Is the PCB silkscreen upside down or mirrored?
- Just because you’ve found a bug doesn’t mean you’ve found the bug. After long hours poring over the source code, you finally have an a-ha! moment and discover the missing semicolon that’s been causing all the trouble. At last! You fix the typo, recompile, download, and pat yourself on the back for solving the problem. Except you haven’t. You found a long-buried bug, sure, but not the bug that’s causing the problem. Every program has bugs – plural – and just because you’ve found one doesn’t mean you’ve found them all, or even the right one. Keep going until you can credibly explain why this bug caused that symptom.
- Check your watch. Does the problem crop up at the same time of day, or does it instead appear after a certain amount of running time? Test the system at different times of day to separate the two issues.
- Think outside the box (literally). The problem might be environmental and not inside your system. Is it nearby RF radiation, bad AC power, air filtration, temperature, or contaminants? Remember, hardware bugs got their name when an insect crawled inside a relay.
- Spy on the neighbors. A computer failed every afternoon at the customer’s site but worked fine back in the lab. The culprit was a nearby welding shop that started its high-voltage arc welder at around 2:00 PM every day, sending horrendous spikes through the building’s wiring.
- The old off-by-one error. It’s easy to misjudge the size of a buffer, counter, or iterator by a small amount. Add or subtract 1 from pointers, parameters, and passed values and see what happens.
- What else is running? A lot of systems are connected in some way – Ethernet, USB, Wi-Fi, shared memory, semaphores, whatever – with some other process or system with its own autonomy. The root of your problem might be in another system.
- Caches are evil. On-chip caches speed up our processors, but they also obfuscate the activity inside. They behave entirely nonintuitively and are difficult or impossible to probe from the outside. Cached code and/or data can drastically affect performance – that’s kinda the point – but in unpredictable and nondeterministic ways. Try preloading the cache, freezing its contents, or disabling it during debug.
- Use your MMU. High-end processors have hardware MMUs that can prevent programs or processes from stepping on each other. Configuring the MMU can be a chore in its own right, but, once it’s done, it’s a low-impact way to be sure your code isn’t straying where it shouldn’t.
- Schrödinger’s Bug. Does touching an oscilloscope probe or logic analyzer clip to your hardware make the problem disappear? All probes have measurable capacitance and inductance. If touching a probe to a pin cures a problem, you’ve got marginal signal integrity.
- Explain it to a six-year-old. Talking through a problem, out loud and in simple terms, can trigger interesting insights as your brain struggles to find new ways to paraphrase the problem. Give it a try.
- Don’t assume. The hardest trick of all is to discard assumptions, hearsay, and hunches. People are unreliable witnesses. When someone says, “It fails every time I press this button,” don’t assume it’s button-related. Or that it happens “every time.” Or even that it’s a problem. Start over with your own testing.
After hearing all of this, the youngest engineer decided to step out of the rowboat and back onto the floor of the sporting goods store. It was getting late and the employees were eager for our three engineers to finish “testing” their new boat and go home. What? You thought the boat was adrift on the water? Why would you assume that?
Because, I ASSumed they were in the boat in the picture at the beginning of the article!