feature article
Subscribe Now

The Old Man and the C Debugger

Fixing Bugs is Easier if You Know Where Not to Look

“Eliminate all other factors, and the one which remains must be the truth.” – Sherlock Holmes, The Sign of Four

Three engineers are sitting in a rowboat. None of them has any nautical experience or navigational instruments.

As hope for rescue fades, and with no other ships in sight, they pass the time discussing their careers (as you do). The oldest one is a veteran hardware/software developer with decades of experience. The middle one is in the prime of his career, and the youngest has just left college to start his first job.

“Back in my day, we didn’t have these fancy logic analyzers, simulators, or EDA tools,” says the grizzled veteran engineer. “We debugged computers with a voltmeter, a hair dryer, and a piece of wire. Some days we didn’t even have the wire.”

“Well, things are better now,” replies the man in the middle. “With an oscilloscope and an emulator, I can track down most any problem.”

“You guys are both crazy,” scoffs the youngster. “You can’t possibly debug a modern multiprocessor, networked, RTOS-enabled system without modern simulation tools, profilers, multichannel logic analyzers, and a team of experts. And when are we getting out of this boat, anyway?”

The oldster shakes his head sagely. “Kid, you have no idea how to diagnose and debug a system. You’re too eager to throw your shiny tools at the problem before you even know what you’re looking at.”

“The first rule of debugging,” he continues, warming to the topic, “is knowing where not to look. It’s all about saving time. There are an infinite number of things that can fail, both in hardware and in software. Your only hope of fixing a machine is to eliminate 99.99% of those alternatives, early on. Otherwise, you’re just wasting your employer’s time. Lemme give you some lessons before it’s too late for all of us.”

  • Kick the box. Seriously. If physical jostling either starts or stops the failure, you know it’s mechanically related. Look for a loose connector, a wobbly ribbon cable, or a cracked component, especially one with a heat sink glued to it.
  • Turn it upside down. Gravity typically doesn’t affect software problems, but it will highlight loose hardware, bad sockets, flexing PC boards, stray screws, or bits of solder and other conductive materials rattling around inside the enclosure.
  • Hit it with freeze spray. Semiconductors run fast or slow, depending on temperature (that’s why supercomputers and gaming rigs are liquid cooled). A well-aimed shot of freeze spray can speed up a microprocessor, DRAM, or buffer just enough to highlight a tight timing window. It’ll also weed out marginal components prone to failure.  
  • Try a heat gun. Or a hair dryer, in a pinch. You can’t target the heat as finely as the freeze spray, but it’ll still highlight temperature-related problems. Don’t forget that PC boards, connectors, and other devices expand slightly with the heat, so your problem might be a mechanical fit, not marginal timing.
  • Look for bent or loose pins. Pin-and-socket connectors are pretty good at protecting pins, but they can’t fix stupid. Sometimes a pin gets bent, and when you force the plug into the socket, the pin gets bent over perfectly flat so it’s hard to see. Look closely and mentally count each pin to convince yourself that it’s actually straight and aligned.
  • Is Pin #1 where you think it is, or are you plugging in the cable backwards? Are the pins arranged in odd/even rows, or around the circumference like an IC? Is the PCB silkscreen upside down or mirrored?
  • Just because you’ve found a bug doesn’t mean you’ve found the bug. After long hours poring over the source code, you finally have an a-ha! moment and discover the missing semicolon that’s been causing all the trouble. At last! You fix the typo, recompile, download, and pat yourself on the back for solving the problem. Except you haven’t. You found a long-buried bug, sure, but not the bug that’s causing the problem. Every program has bugs – plural – and just because you’ve found one doesn’t mean you’ve found them all, or even the right one. Keep going until you can credibly explain why this bug caused that symptom.
  • Check your watch. Does the problem crop up at the same time of day, or does it instead appear after a certain amount of running time? Test the system at different times of day to separate the two issues.
  • Think outside the box (literally). The problem might be environmental and not inside your system. Is it nearby RF radiation, bad AC power, air filtration, temperature, or contaminants? Remember, hardware bugs got their name when an insect crawled inside a relay.
  • Spy on the neighbors. A computer failed every afternoon at the customer’s site but worked fine back in the lab. The culprit was a nearby welding shop that started its high-voltage arc welder at around 2:00 PM every day, sending horrendous spikes through the building’s wiring.
  • The old off-by-one error. It’s easy to misjudge the size of a buffer, counter, or iterator by a small amount. Add or subtract 1 from pointers, parameters, and passed values and see what happens.
  • What else is running? A lot of systems are connected in some way – Ethernet, USB, Wi-Fi, shared memory, semaphores, whatever – with some other process or system with its own autonomy. The root of your problem might be in another system.
  • Caches are evil. On-chip caches speed up our processors, but they also obfuscate the activity inside. They behave entirely nonintuitively and are difficult or impossible to probe from the outside. Cached code and/or data can drastically affect performance – that’s kinda the point – but in unpredictable and nondeterministic ways. Try preloading the cache, freezing its contents, or disabling it during debug.
  • Use your MMU. High-end processors have hardware MMUs that can prevent programs or processes from stepping on each other. Configuring the MMU can be a chore in its own right, but, once it’s done, it’s a low-impact way to be sure your code isn’t straying where it shouldn’t.  
  • Schrödinger’s Bug. Does touching an oscilloscope probe or logic analyzer clip to your hardware make the problem disappear? All probes have measurable capacitance and inductance. If touching a probe to a pin cures a problem, you’ve got marginal signal integrity.
  • Explain it to a six-year-old. Talking through a problem, out loud and in simple terms, can trigger interesting insights as your brain struggles to find new ways to paraphrase the problem. Give it a try.
  • Don’t assume. The hardest trick of all is to discard assumptions, hearsay, and hunches. People are unreliable witnesses. When someone says, “It fails every time I press this button,” don’t assume it’s button-related. Or that it happens “every time.” Or even that it’s a problem. Start over with your own testing.

 

After hearing all of this, the youngest engineer decided to step out of the rowboat and back onto the floor of the sporting goods store. It was getting late and the employees were eager for our three engineers to finish “testing” their new boat and go home. What? You thought the boat was adrift on the water? Why would you assume that?

One thought on “The Old Man and the C Debugger”

Leave a Reply

featured blogs
Apr 25, 2024
Structures in Allegro X layout editors let you create reusable building blocks for your PCBs, saving you time and ensuring consistency. What are Structures? Structures are pre-defined groups of design objects, such as vias, connecting lines (clines), and shapes. You can combi...
Apr 25, 2024
See how the UCIe protocol creates multi-die chips by connecting chiplets from different vendors and nodes, and learn about the role of IP and specifications.The post Want to Mix and Match Dies in a Single Package? UCIe Can Get You There appeared first on Chip Design....
Apr 18, 2024
Are you ready for a revolution in robotic technology (as opposed to a robotic revolution, of course)?...

featured video

MaxLinear Integrates Analog & Digital Design in One Chip with Cadence 3D Solvers

Sponsored by Cadence Design Systems

MaxLinear has the unique capability of integrating analog and digital design on the same chip. Because of this, the team developed some interesting technology in the communication space. In the optical infrastructure domain, they created the first fully integrated 5nm CMOS PAM4 DSP. All their products solve critical communication and high-frequency analysis challenges.

Learn more about how MaxLinear is using Cadence’s Clarity 3D Solver and EMX Planar 3D Solver in their design process.

featured paper

Designing Robust 5G Power Amplifiers for the Real World

Sponsored by Keysight

Simulating 5G power amplifier (PA) designs at the component and system levels with authentic modulation and high-fidelity behavioral models increases predictability, lowers risk, and shrinks schedules. Simulation software enables multi-technology layout and multi-domain analysis, evaluating the impacts of 5G PA design choices while delivering accurate results in a single virtual workspace. This application note delves into how authentic modulation enhances predictability and performance in 5G millimeter-wave systems.

Download now to revolutionize your design process.

featured chalk talk

Littelfuse Protection IC (eFuse)
If you are working on an industrial, consumer, or telecom design, protection ICs can offer a variety of valuable benefits including reverse current protection, over temperature protection, short circuit protection, and a whole lot more. In this episode of Chalk Talk, Amelia Dalton and Pete Pytlik from Littelfuse explore the key features of protection ICs, how protection ICs compare to conventional discrete component solutions, and how you can take advantage of Littelfuse protection ICs in your next design.
May 8, 2023
41,698 views