feature article
Subscribe Now

The Old Man and the C Debugger

Fixing Bugs is Easier if You Know Where Not to Look

“Eliminate all other factors, and the one which remains must be the truth.” – Sherlock Holmes, The Sign of Four

Three engineers are sitting in a rowboat. None of them has any nautical experience or navigational instruments.

As hope for rescue fades, and with no other ships in sight, they pass the time discussing their careers (as you do). The oldest one is a veteran hardware/software developer with decades of experience. The middle one is in the prime of his career, and the youngest has just left college to start his first job.

“Back in my day, we didn’t have these fancy logic analyzers, simulators, or EDA tools,” says the grizzled veteran engineer. “We debugged computers with a voltmeter, a hair dryer, and a piece of wire. Some days we didn’t even have the wire.”

“Well, things are better now,” replies the man in the middle. “With an oscilloscope and an emulator, I can track down most any problem.”

“You guys are both crazy,” scoffs the youngster. “You can’t possibly debug a modern multiprocessor, networked, RTOS-enabled system without modern simulation tools, profilers, multichannel logic analyzers, and a team of experts. And when are we getting out of this boat, anyway?”

The oldster shakes his head sagely. “Kid, you have no idea how to diagnose and debug a system. You’re too eager to throw your shiny tools at the problem before you even know what you’re looking at.”

“The first rule of debugging,” he continues, warming to the topic, “is knowing where not to look. It’s all about saving time. There are an infinite number of things that can fail, both in hardware and in software. Your only hope of fixing a machine is to eliminate 99.99% of those alternatives, early on. Otherwise, you’re just wasting your employer’s time. Lemme give you some lessons before it’s too late for all of us.”

  • Kick the box. Seriously. If physical jostling either starts or stops the failure, you know it’s mechanically related. Look for a loose connector, a wobbly ribbon cable, or a cracked component, especially one with a heat sink glued to it.
  • Turn it upside down. Gravity typically doesn’t affect software problems, but it will highlight loose hardware, bad sockets, flexing PC boards, stray screws, or bits of solder and other conductive materials rattling around inside the enclosure.
  • Hit it with freeze spray. Semiconductors run fast or slow, depending on temperature (that’s why supercomputers and gaming rigs are liquid cooled). A well-aimed shot of freeze spray can speed up a microprocessor, DRAM, or buffer just enough to highlight a tight timing window. It’ll also weed out marginal components prone to failure.  
  • Try a heat gun. Or a hair dryer, in a pinch. You can’t target the heat as finely as the freeze spray, but it’ll still highlight temperature-related problems. Don’t forget that PC boards, connectors, and other devices expand slightly with the heat, so your problem might be a mechanical fit, not marginal timing.
  • Look for bent or loose pins. Pin-and-socket connectors are pretty good at protecting pins, but they can’t fix stupid. Sometimes a pin gets bent, and when you force the plug into the socket, the pin gets bent over perfectly flat so it’s hard to see. Look closely and mentally count each pin to convince yourself that it’s actually straight and aligned.
  • Is Pin #1 where you think it is, or are you plugging in the cable backwards? Are the pins arranged in odd/even rows, or around the circumference like an IC? Is the PCB silkscreen upside down or mirrored?
  • Just because you’ve found a bug doesn’t mean you’ve found the bug. After long hours poring over the source code, you finally have an a-ha! moment and discover the missing semicolon that’s been causing all the trouble. At last! You fix the typo, recompile, download, and pat yourself on the back for solving the problem. Except you haven’t. You found a long-buried bug, sure, but not the bug that’s causing the problem. Every program has bugs – plural – and just because you’ve found one doesn’t mean you’ve found them all, or even the right one. Keep going until you can credibly explain why this bug caused that symptom.
  • Check your watch. Does the problem crop up at the same time of day, or does it instead appear after a certain amount of running time? Test the system at different times of day to separate the two issues.
  • Think outside the box (literally). The problem might be environmental and not inside your system. Is it nearby RF radiation, bad AC power, air filtration, temperature, or contaminants? Remember, hardware bugs got their name when an insect crawled inside a relay.
  • Spy on the neighbors. A computer failed every afternoon at the customer’s site but worked fine back in the lab. The culprit was a nearby welding shop that started its high-voltage arc welder at around 2:00 PM every day, sending horrendous spikes through the building’s wiring.
  • The old off-by-one error. It’s easy to misjudge the size of a buffer, counter, or iterator by a small amount. Add or subtract 1 from pointers, parameters, and passed values and see what happens.
  • What else is running? A lot of systems are connected in some way – Ethernet, USB, Wi-Fi, shared memory, semaphores, whatever – with some other process or system with its own autonomy. The root of your problem might be in another system.
  • Caches are evil. On-chip caches speed up our processors, but they also obfuscate the activity inside. They behave entirely nonintuitively and are difficult or impossible to probe from the outside. Cached code and/or data can drastically affect performance – that’s kinda the point – but in unpredictable and nondeterministic ways. Try preloading the cache, freezing its contents, or disabling it during debug.
  • Use your MMU. High-end processors have hardware MMUs that can prevent programs or processes from stepping on each other. Configuring the MMU can be a chore in its own right, but, once it’s done, it’s a low-impact way to be sure your code isn’t straying where it shouldn’t.  
  • Schrödinger’s Bug. Does touching an oscilloscope probe or logic analyzer clip to your hardware make the problem disappear? All probes have measurable capacitance and inductance. If touching a probe to a pin cures a problem, you’ve got marginal signal integrity.
  • Explain it to a six-year-old. Talking through a problem, out loud and in simple terms, can trigger interesting insights as your brain struggles to find new ways to paraphrase the problem. Give it a try.
  • Don’t assume. The hardest trick of all is to discard assumptions, hearsay, and hunches. People are unreliable witnesses. When someone says, “It fails every time I press this button,” don’t assume it’s button-related. Or that it happens “every time.” Or even that it’s a problem. Start over with your own testing.

 

After hearing all of this, the youngest engineer decided to step out of the rowboat and back onto the floor of the sporting goods store. It was getting late and the employees were eager for our three engineers to finish “testing” their new boat and go home. What? You thought the boat was adrift on the water? Why would you assume that?

One thought on “The Old Man and the C Debugger”

Leave a Reply

featured blogs
Sep 26, 2022
Most engineers are of the view that all mesh generators use an underlying geometry that is discrete in nature, but in fact, Fidelity Pointwise can import and mesh both analytic and discrete geometry. Analytic geometry defines curves and surfaces with mathematical functions. T...
Sep 22, 2022
On Monday 26 September 2022, Earth and Jupiter will be only 365 million miles apart, which is around half of their worst-case separation....
Sep 22, 2022
Learn how to design safe and stylish interior and exterior automotive lighting systems with a look at important lighting categories and lighting design tools. The post How to Design Safe, Appealing, Functional Automotive Lighting Systems appeared first on From Silicon To Sof...

featured video

PCIe Gen5 x16 Running on the Achronix VectorPath Accelerator Card

Sponsored by Achronix

In this demo, Achronix engineers show the VectorPath Accelerator Card successfully linking up to a PCIe Gen5 x16 host and write data to and read data from GDDR6 memory. The VectorPath accelerator card featuring the Speedster7t FPGA is one of the first FPGAs that can natively support this interface within its PCIe subsystem. Speedster7t FPGAs offer a revolutionary new architecture that Achronix developed to address the highest performance data acceleration challenges.

Click here for more information about the VectorPath Accelerator Card

featured paper

Algorithm Verification with FPGAs and ASICs

Sponsored by MathWorks

Developing new FPGA and ASIC designs involves implementing new algorithms, which presents challenges for verification for algorithm developers, hardware designers, and verification engineers. This eBook explores different aspects of hardware design verification and how you can use MATLAB and Simulink to reduce development effort and improve the quality of end products.

Click here to read more

featured chalk talk

Small Form Factor Industry Standards for Embedded Computing

Sponsored by Mouser Electronics and Samtec

Trends in today’s embedded computing designs including smart sensors, autonomous vehicles, and edge computing are making embedded computing industry standards more important than ever before. In this episode of Chalk Talk, Amelia Dalton chats with Matthew Burns from Samtec about how standards organizations like PC104, PICMG and VITA s are encouraging innovation in today’s embedded designs, how Samtec supports each one of these standards organizations and how you can utilize Samtec’s high performance interconnects for your next small-form factor embedded computing designs.

Click here for more information about Samtec Industry Standards Solution