feature article
Subscribe Now

The Old Man and the C Debugger

Fixing Bugs is Easier if You Know Where Not to Look

“Eliminate all other factors, and the one which remains must be the truth.” – Sherlock Holmes, The Sign of Four

Three engineers are sitting in a rowboat. None of them has any nautical experience or navigational instruments.

As hope for rescue fades, and with no other ships in sight, they pass the time discussing their careers (as you do). The oldest one is a veteran hardware/software developer with decades of experience. The middle one is in the prime of his career, and the youngest has just left college to start his first job.

“Back in my day, we didn’t have these fancy logic analyzers, simulators, or EDA tools,” says the grizzled veteran engineer. “We debugged computers with a voltmeter, a hair dryer, and a piece of wire. Some days we didn’t even have the wire.”

“Well, things are better now,” replies the man in the middle. “With an oscilloscope and an emulator, I can track down most any problem.”

“You guys are both crazy,” scoffs the youngster. “You can’t possibly debug a modern multiprocessor, networked, RTOS-enabled system without modern simulation tools, profilers, multichannel logic analyzers, and a team of experts. And when are we getting out of this boat, anyway?”

The oldster shakes his head sagely. “Kid, you have no idea how to diagnose and debug a system. You’re too eager to throw your shiny tools at the problem before you even know what you’re looking at.”

“The first rule of debugging,” he continues, warming to the topic, “is knowing where not to look. It’s all about saving time. There are an infinite number of things that can fail, both in hardware and in software. Your only hope of fixing a machine is to eliminate 99.99% of those alternatives, early on. Otherwise, you’re just wasting your employer’s time. Lemme give you some lessons before it’s too late for all of us.”

  • Kick the box. Seriously. If physical jostling either starts or stops the failure, you know it’s mechanically related. Look for a loose connector, a wobbly ribbon cable, or a cracked component, especially one with a heat sink glued to it.
  • Turn it upside down. Gravity typically doesn’t affect software problems, but it will highlight loose hardware, bad sockets, flexing PC boards, stray screws, or bits of solder and other conductive materials rattling around inside the enclosure.
  • Hit it with freeze spray. Semiconductors run fast or slow, depending on temperature (that’s why supercomputers and gaming rigs are liquid cooled). A well-aimed shot of freeze spray can speed up a microprocessor, DRAM, or buffer just enough to highlight a tight timing window. It’ll also weed out marginal components prone to failure.  
  • Try a heat gun. Or a hair dryer, in a pinch. You can’t target the heat as finely as the freeze spray, but it’ll still highlight temperature-related problems. Don’t forget that PC boards, connectors, and other devices expand slightly with the heat, so your problem might be a mechanical fit, not marginal timing.
  • Look for bent or loose pins. Pin-and-socket connectors are pretty good at protecting pins, but they can’t fix stupid. Sometimes a pin gets bent, and when you force the plug into the socket, the pin gets bent over perfectly flat so it’s hard to see. Look closely and mentally count each pin to convince yourself that it’s actually straight and aligned.
  • Is Pin #1 where you think it is, or are you plugging in the cable backwards? Are the pins arranged in odd/even rows, or around the circumference like an IC? Is the PCB silkscreen upside down or mirrored?
  • Just because you’ve found a bug doesn’t mean you’ve found the bug. After long hours poring over the source code, you finally have an a-ha! moment and discover the missing semicolon that’s been causing all the trouble. At last! You fix the typo, recompile, download, and pat yourself on the back for solving the problem. Except you haven’t. You found a long-buried bug, sure, but not the bug that’s causing the problem. Every program has bugs – plural – and just because you’ve found one doesn’t mean you’ve found them all, or even the right one. Keep going until you can credibly explain why this bug caused that symptom.
  • Check your watch. Does the problem crop up at the same time of day, or does it instead appear after a certain amount of running time? Test the system at different times of day to separate the two issues.
  • Think outside the box (literally). The problem might be environmental and not inside your system. Is it nearby RF radiation, bad AC power, air filtration, temperature, or contaminants? Remember, hardware bugs got their name when an insect crawled inside a relay.
  • Spy on the neighbors. A computer failed every afternoon at the customer’s site but worked fine back in the lab. The culprit was a nearby welding shop that started its high-voltage arc welder at around 2:00 PM every day, sending horrendous spikes through the building’s wiring.
  • The old off-by-one error. It’s easy to misjudge the size of a buffer, counter, or iterator by a small amount. Add or subtract 1 from pointers, parameters, and passed values and see what happens.
  • What else is running? A lot of systems are connected in some way – Ethernet, USB, Wi-Fi, shared memory, semaphores, whatever – with some other process or system with its own autonomy. The root of your problem might be in another system.
  • Caches are evil. On-chip caches speed up our processors, but they also obfuscate the activity inside. They behave entirely nonintuitively and are difficult or impossible to probe from the outside. Cached code and/or data can drastically affect performance – that’s kinda the point – but in unpredictable and nondeterministic ways. Try preloading the cache, freezing its contents, or disabling it during debug.
  • Use your MMU. High-end processors have hardware MMUs that can prevent programs or processes from stepping on each other. Configuring the MMU can be a chore in its own right, but, once it’s done, it’s a low-impact way to be sure your code isn’t straying where it shouldn’t.  
  • Schrödinger’s Bug. Does touching an oscilloscope probe or logic analyzer clip to your hardware make the problem disappear? All probes have measurable capacitance and inductance. If touching a probe to a pin cures a problem, you’ve got marginal signal integrity.
  • Explain it to a six-year-old. Talking through a problem, out loud and in simple terms, can trigger interesting insights as your brain struggles to find new ways to paraphrase the problem. Give it a try.
  • Don’t assume. The hardest trick of all is to discard assumptions, hearsay, and hunches. People are unreliable witnesses. When someone says, “It fails every time I press this button,” don’t assume it’s button-related. Or that it happens “every time.” Or even that it’s a problem. Start over with your own testing.

 

After hearing all of this, the youngest engineer decided to step out of the rowboat and back onto the floor of the sporting goods store. It was getting late and the employees were eager for our three engineers to finish “testing” their new boat and go home. What? You thought the boat was adrift on the water? Why would you assume that?

One thought on “The Old Man and the C Debugger”

Leave a Reply

featured blogs
May 21, 2022
May is Asian American and Pacific Islander (AAPI) Heritage Month. We would like to spotlight some of our incredible AAPI-identifying employees to celebrate. We recognize the important influence that... ...
May 20, 2022
I'm very happy with my new OMTech 40W CO2 laser engraver/cutter, but only because the folks from Makers Local 256 helped me get it up and running....
May 19, 2022
Learn about the AI chip design breakthroughs and case studies discussed at SNUG Silicon Valley 2022, including autonomous PPA optimization using DSO.ai. The post Key Highlights from SNUG 2022: AI Is Fast Forwarding Chip Design appeared first on From Silicon To Software....
May 12, 2022
By Shelly Stalnaker Every year, the editors of Elektronik in Germany compile a list of the most interesting and innovative… ...

featured video

Synopsys PPA(V) Voltage Optimization

Sponsored by Synopsys

Performance-per-watt has emerged as one of the highest priorities in design quality, leading to a shift in technology focus and design power optimization methodologies. Variable operating voltage possess high potential in optimizing performance-per-watt results but requires a signoff accurate and efficient methodology to explore. Synopsys Fusion Design Platform™, uniquely built on a singular RTL-to-GDSII data model, delivers a full-flow voltage optimization and closure methodology to achieve the best performance-per-watt results for the most demanding semiconductor segments.

Learn More

featured paper

Intel Agilex FPGAs Deliver Game-Changing Flexibility & Agility for the Data-Centric World

Sponsored by Intel

The new Intel® Agilex™ FPGA is more than the latest programmable logic offering—it brings together revolutionary innovation in multiple areas of Intel technology leadership to create new opportunities to derive value and meaning from this transformation from edge to data center. Want to know more? Start with this white paper.

Click to read more

featured chalk talk

Tackling Automotive Software Cost and Complexity

Sponsored by Mouser Electronics and NXP Semiconductors

With the sheer amount of automotive software cost and complexity today, we need a way to maximize software reuse across our process platforms. In this episode of Chalk Talk, Amelia Dalton and Daniel Balser from NXP take a closer look at the software ecosystem for NXP’s S32K3 MCU. They investigate how real-time drivers, a comprehensive safety software platform, and high performance security system will help you tackle the cost and complexity of automotive software development.

Click here for more information about NXP Semiconductors S32K3 Automotive General Purpose MCUs