“It is easier to write an incorrect program than to understand a correct one.” – programmers’ adage
My first real job was repairing hard disk drives. I’d open up the drives, clean the platters with an alcohol pad, and align the read/write heads using a screwdriver and an oscilloscope. The guy at the bench next to mine had the same job, but he smoked all day and liked to rest his cigarettes on the edge of the drive while he worked. He was careful to blow any ashes off the platters before closing up, though.
After toiling at the repair bench, the company might promote you to a field service job if you were deemed worthy and safe to put in front of customers. Thus did I learn to diagnose and repair computers while their angry owners looked on over my shoulder. We learned to never make the problem look too easy (“You forgot to plug it in”) or too hard (“These machines never work right”), but instead to find an appropriate balance and to radiate confidence even if we had no clue what was wrong.
Intermittent problems are always the worst. Like Heisenberg’s uncertainty principle, customers would report a bug that disappears the moment you show up to observe it. Sometimes I’d even turn my back on the computer to fool it into thinking I was leaving. One machine failed reliably until we touched an oscilloscope probe to the faulty signal, so we simply tie-wrapped the probe in place and left it there. Problem solved!
Another machine failed sporadically about once a week, usually in the afternoon. One by one, all of us field technicians tried and failed to find the problem. Is it a buffer overflow? User error? Heat-related problem? Does the janitor accidentally unplug the machine when he’s cleaning? Are the circuit boards flexing? Is there an actual bug living inside? Did someone spill coffee on the fan? We looked for everything.
After another frustrating day trying to get this cursed machine to fail for us, my fellow tech stepped outside and idly watched as the welding shop two doors down started up its massive arc welder – whereupon the computer promptly crashed. Yup, the massive spikes in the shared AC power lines were the culprit. The welders normally used gas torches and only fired up the big electric arc welder about once a week, usually in the afternoon.
The takeaway from that adventure was that bugs in your machine might not originate in your machine. The problem might be environmental. The hardware and software might be perfectly solid in another environment (e.g., your development lab) but fail elsewhere for nonintuitive reasons.
That lesson didn’t help me to track down a software bug years later. I’d hacked together a simple program to read my PC’s real-time clock and toggle an LED on the motherboard. The program was so trivially simple that I wrote it by typing in the hexadecimal opcodes. (Real programmers don’t need compilers.) Naturally, it didn’t work the first time, so I dumped out the executable file to read the disassembled code. Which I must’ve also done wrong because my simple little program was now padded out to 4KB. Must be a limitation of the operating system. No big deal. It’s not like I’m wasting a lot of memory or disk space.
Once I’d solved my RTC bug, I spent some extra time looking at the superfluous stuff padding the end of it. I was expecting random data or uninitialized RAM but this looked like real code. Was I accidentally overwriting something or reusing RAM that already had code in it? I wonder what this extraneous code does…
I could tell that it read from the real-time clock (just like my little hack) but it also accessed the screen buffer (which I didn’t do) and the filesystem. It did some arithmetic on the date, and it also had some hard-coded numbers that it used for comparison. It seemed to be looking for a particular date. Oddly, the code never jumped or branched outside of its little 4KB of space, which I would have expected from a random piece of another program. It appeared to be complete, not just a fragment of something else. And it seemed awkwardly written, like it was meant to be hard to follow. Almost deliberately obfuscated. Sort of like…
A virus. My PC was harboring a virus that would lie dormant until a particular date, then write random data to my hard disk. No telling how long it had been there, but I never knew about it. It replicated by attaching itself to every program that runs, adding about 4KB of code to the end of the executable file. If my hack hadn’t been so small, I probably wouldn’t have noticed it there.
Once again, the problem came from outside the system.
The other lesson from this was an old one. Just because you’ve found a bug doesn’t mean you’ve found the bug. Always keep looking. Plenty of software experts have said that all programs, no matter how well constructed, have latent bugs that will never be found. That it’s genuinely impossible to write perfect code. The best you can hope for is to squash the ones that will manifest themselves in real use, and hope the remainder go undiscovered. Which raises the Zen-like question, if a bug never appears in real usage, is it still a bug?
Hardware and software bugs are wily things, and they test our powers of observation, logic, and creativity. Let’s hear about your best (or worst) bug-hunting expeditions in the comments below.
2 thoughts on “Tales from the Debugging Crypt”
Years ago I was encountering an intermittent reset on a new circuit board that would only occur once every couple of days. Instrumenting on the board indicated a noise burst on the power line that only went off every 2 or 3 days. I set up a scope probe on the AC power (with proper safety barriers and labelling) and the noise would eventually show up after a few days. In fact, you could probe with a scope on any of the building metal in the lab and see it. It turned out a power generator in the building was malfunctioning. We needed to expedite the noise problem to investigate ways to make the design more immune to the noise, so we used an old hand-held power drill that generated plenty of electrical noise when turned on. In fact, all we had to do was plug it in and it would generate the necessary noise on the AC power line to reset the board. We got lucky in that a very simple digital filter on the reset in firmware (ignore any active-low reset pulse shorter than 4 clock periods) made the board immune to the noise, so it would keep working through the noise burst. Regards, Grady Muldrow
Nice. It’s surprising (to me, anyway) how many digital problems can be traced to bad AC power.