feature article
Subscribe Now

Tales from the Debugging Crypt

Share Your Best and Worst Debugging Stories

“It is easier to write an incorrect program than to understand a correct one.” – programmers’ adage

My first real job was repairing hard disk drives. I’d open up the drives, clean the platters with an alcohol pad, and align the read/write heads using a screwdriver and an oscilloscope. The guy at the bench next to mine had the same job, but he smoked all day and liked to rest his cigarettes on the edge of the drive while he worked. He was careful to blow any ashes off the platters before closing up, though. 

After toiling at the repair bench, the company might promote you to a field service job if you were deemed worthy and safe to put in front of customers. Thus did I learn to diagnose and repair computers while their angry owners looked on over my shoulder. We learned to never make the problem look too easy (“You forgot to plug it in”) or too hard (“These machines never work right”), but instead to find an appropriate balance and to radiate confidence even if we had no clue what was wrong. 

Intermittent problems are always the worst. Like Heisenberg’s uncertainty principle, customers would report a bug that disappears the moment you show up to observe it. Sometimes I’d even turn my back on the computer to fool it into thinking I was leaving. One machine failed reliably until we touched an oscilloscope probe to the faulty signal, so we simply tie-wrapped the probe in place and left it there. Problem solved!

Another machine failed sporadically about once a week, usually in the afternoon. One by one, all of us field technicians tried and failed to find the problem. Is it a buffer overflow? User error? Heat-related problem? Does the janitor accidentally unplug the machine when he’s cleaning? Are the circuit boards flexing? Is there an actual bug living inside? Did someone spill coffee on the fan? We looked for everything. 

After another frustrating day trying to get this cursed machine to fail for us, my fellow tech stepped outside and idly watched as the welding shop two doors down started up its massive arc welder – whereupon the computer promptly crashed. Yup, the massive spikes in the shared AC power lines were the culprit. The welders normally used gas torches and only fired up the big electric arc welder about once a week, usually in the afternoon. 

The takeaway from that adventure was that bugs in your machine might not originate in your machine. The problem might be environmental. The hardware and software might be perfectly solid in another environment (e.g., your development lab) but fail elsewhere for nonintuitive reasons. 

That lesson didn’t help me to track down a software bug years later. I’d hacked together a simple program to read my PC’s real-time clock and toggle an LED on the motherboard. The program was so trivially simple that I wrote it by typing in the hexadecimal opcodes. (Real programmers don’t need compilers.) Naturally, it didn’t work the first time, so I dumped out the executable file to read the disassembled code. Which I must’ve also done wrong because my simple little program was now padded out to 4KB. Must be a limitation of the operating system. No big deal. It’s not like I’m wasting a lot of memory or disk space. 

Once I’d solved my RTC bug, I spent some extra time looking at the superfluous stuff padding the end of it. I was expecting random data or uninitialized RAM but this looked like real code. Was I accidentally overwriting something or reusing RAM that already had code in it? I wonder what this extraneous code does… 

I could tell that it read from the real-time clock (just like my little hack) but it also accessed the screen buffer (which I didn’t do) and the filesystem. It did some arithmetic on the date, and it also had some hard-coded numbers that it used for comparison. It seemed to be looking for a particular date. Oddly, the code never jumped or branched outside of its little 4KB of space, which I would have expected from a random piece of another program. It appeared to be complete, not just a fragment of something else. And it seemed awkwardly written, like it was meant to be hard to follow. Almost deliberately obfuscated. Sort of like… 

A virus. My PC was harboring a virus that would lie dormant until a particular date, then write random data to my hard disk. No telling how long it had been there, but I never knew about it. It replicated by attaching itself to every program that runs, adding about 4KB of code to the end of the executable file. If my hack hadn’t been so small, I probably wouldn’t have noticed it there. 

Once again, the problem came from outside the system. 

The other lesson from this was an old one. Just because you’ve found a bug doesn’t mean you’ve found the bug. Always keep looking. Plenty of software experts have said that all programs, no matter how well constructed, have latent bugs that will never be found. That it’s genuinely impossible to write perfect code. The best you can hope for is to squash the ones that will manifest themselves in real use, and hope the remainder go undiscovered. Which raises the Zen-like question, if a bug never appears in real usage, is it still a bug? 

Hardware and software bugs are wily things, and they test our powers of observation, logic, and creativity. Let’s hear about your best (or worst) bug-hunting expeditions in the comments below. 

2 thoughts on “Tales from the Debugging Crypt”

  1. Years ago I was encountering an intermittent reset on a new circuit board that would only occur once every couple of days. Instrumenting on the board indicated a noise burst on the power line that only went off every 2 or 3 days. I set up a scope probe on the AC power (with proper safety barriers and labelling) and the noise would eventually show up after a few days. In fact, you could probe with a scope on any of the building metal in the lab and see it. It turned out a power generator in the building was malfunctioning. We needed to expedite the noise problem to investigate ways to make the design more immune to the noise, so we used an old hand-held power drill that generated plenty of electrical noise when turned on. In fact, all we had to do was plug it in and it would generate the necessary noise on the AC power line to reset the board. We got lucky in that a very simple digital filter on the reset in firmware (ignore any active-low reset pulse shorter than 4 clock periods) made the board immune to the noise, so it would keep working through the noise burst. Regards, Grady Muldrow

Leave a Reply

featured blogs
Dec 1, 2020
If you'€™d asked me at the beginning of 2020 as to the chances of my replicating an 1820 Welsh dresser, I would have said '€œzero,'€ which just goes to show how little I know....
Dec 1, 2020
More package designers these days, with the increasing component counts and more complicated electrical constraints, are shifting to using a front-end schematic capture tool. As with IC and PCB... [[ Click on the title to access the full blog on the Cadence Community site. ]...
Dec 1, 2020
UCLA’s Maxx Tepper gives us a brief overview of the Ocean High-Throughput processor to be used in the upgrade of the real-time event selection system of the CMS experiment at the CERN LHC (Large Hadron Collider). The board incorporates Samtec FireFly'„¢ optical cable ...
Nov 25, 2020
[From the last episode: We looked at what it takes to generate data that can be used to train machine-learning .] We take a break from learning how IoT technology works for one of our occasional posts on how IoT technology is used. In this case, we look at trucking fleet mana...

featured video

Improve SoC-Level Verification Efficiency by Up to 10X

Sponsored by Cadence Design Systems

Chip-level testbench creation, multi-IP and CPU traffic generation, performance bottleneck identification, and data and cache-coherency verification all lack automation. The effort required to complete these tasks is error prone and time consuming. Discover how the Cadence® System VIP tool suite works seamlessly with its simulation, emulation, and prototyping engines to automate chip-level verification and improve efficiency by ten times over existing manual processes.

Click here for more information about System VIP

featured paper

Exploring advancements in industrial and automotive markets with 60-GHz radar

Sponsored by Texas Instruments

The industrial and automotive markets have a tremendous need for innovative sensing technologies to help buildings, cities and automobiles sense the world around them and make more intelligent decisions.

Click here to read the article

Featured Chalk Talk

PiezoListen: A New Kind of Speaker for New Applications

Sponsored by Mouser Electronics and TDK

Until recently, putting speakers into extremely space-constrained designs was a daunting challenge. Now, however, advances in piezo speakers bring remarkable performance to ultra-small ultra-thin speakers. In this episode of Chalk Talk, Amelia Dalton chats with Matt Reynolds of TDK about PiezoListen - a whole new kind of high-performance multilayer piezo speaker.

Click here for more information about TDK PiezoListen™ Ultra-Thin Piezo Speakers