“We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know.” – Donald Rumsfeld
Consider the humble ladder. It’s a hardware device that’s elegant in its simplicity. Two parallel side rails, with evenly spaced rungs in between. Everything you need; nothing you don’t. Ladders can be made of wood, metal, Fiberglas, or other materials. There are long ones, short ones, portable ones, and permanent ones. Nobody really needs to be taught how to use a ladder, although there are some standard safety rules that might make your tenure at the top a bit more secure.
Now consider your computer’s word-processing software. Or the email application on your smartphone. Not so simple, are they? In fact, there are a lot of features in most programs that seem to be superfluous. Simple, elegant, just-the-basics apps seem to be the exception. They get good reviews, but people remark on them like they’re some sort of aberration. Maybe they are. “Creeping featurism” is a term that’s almost as old as engineering itself.
But that’s not my point. We don’t need to flog the poor software galley slaves chained to their oars below decks. (That’s what managers are for.) No, the question for today is, “How much of your code does real work, as opposed to just catching errors?”
First-time programmers just starting out learn to create “hello, world!” or something similar. They get the feel for what it’s like to use a programming language to describe what they want and make it into compile-worthy code. At first, they work toward getting their early programs to obey their wishes, pure and simple. Bugs will creep in, sure, but that’s all part of the learning process.
But after the first few successful attempts, we start to learn the other side of programing: the part where you shore up the program to prevent it from failing in the real world. You start to build the safety net, the guard code. You’re no longer creating a program that does what you want. You’re creating additional code that prevents it from doing what you don’t want, and that’s a different process and a different mindset altogether.
What does the program do if the user accidentally types in a bogus date? What happens if it receives a malformed network packet? How does it behave if a pointer is out of bounds? These are all aspects of the “guard code” that we all have to include, even though it doesn’t add anything to the program and it doesn’t (usually) do any useful work. Oftentimes, it’s never called or executed at all. Guard code is there just to keep the program from tipping over in case something stupid happens.
Even though ladders are pretty safe, they don’t have safety features, per se. There’s no airbag at the bottom that deploys if you fall. There are (usually) no outriggers to prevent tip-overs. Ladders don’t have built-in current sensors to prevent you from using an aluminum ladder on power lines. There are no accelerometers or klaxons to alert you to unstable working angles. I once saw a ladder with a built-in bubble level to help you eyeball the slope, but that’s about it.
Programming isn’t like that. We actually spend a lot of our time adding safety features to our software. It’s like training wheels on a bicycle, except that they never come off. The guard code is always there, ready to catch that malformed packet or that bogus date, even though it may never happen.
On top of all that, we also have to guard against malicious intent, not just dumb mistakes. What if someone deliberately tries to break our program by shrewdly exploiting some weakness in the input buffer? You’ve got to guard against attacks, not just bugs.
And we have to add security features. It’s harder than ever to make software hacker-proof, because the hackers keep getting wilier and craftier. There are accidental bugs, and then there are malicious assaults, and we generally can’t catch them both with the same kind of guard code. You have to consciously look for, and trap, both types: the known unknowns and the unknown unknowns.
So how much of today’s code falls into that “guard code” category, versus the amount that does the real work implementing the program’s putative purpose? Any guesses?
I would take a SWAG that guard code accounts for 75 percent of most modern programs. It’s got to be at least half. Looking at big chunks of open-source code, I see an awful lot of source code that’s there just to trap errors, mistakes, user flubs, and similar non-malicious bugs. It’s sometimes hard to see what a function is actually doing, buried under all that safety net.
I’ll bet that the guard code is also the source of most bugs, ironically. We put it in there to catch stupid errors, and then the bug-catcher itself malfunctions. I don’t have any objective data to back that up, but that’s been my own experience. The real core of the program works fine; it’s all that other stuff in there to prop it up that’s problematic.
When you’re working on a tall ladder, it’s good practice to have a spotter below you. Someone who will – maybe not catch you, exactly, but at least call 911 when you face-plant on the pavement. They’re your safety net.
If we could do something similar with coding, we might get much faster programs and fewer bugs, besides. Let the “real” program run on one processor (or one CPU core of a multicore processor), while the “guard code” runs alongside on a parallel processor. One does the real work; the other checks that nothing is going off the rails. One parses input while the other checks boundary conditions. One calculates results while the other checks the validity of the input parameters. If the sidekick detects an error, we abort the process or restart the function or ask for new data.
Easy to say but hard to do, of course. But imagine how efficient – and fast! – your software would be if you didn’t have to idiot-check every single parameter, input buffer, string, and checksum. Imagine programming the way it used to be, when your only concern was making the program do what you wanted, not second-guessing all the things that could go wrong. That’s what spotters are for. We need code-spotters. That, and multicore processors, can be our ladders.
I think it boils down to the necessary checks to bring up a new system without data corruption and memory faulting, vs what’s necessary for safe operation after deployment.
A sane programmer is still using lot’s of asserts at bring up … to catch the stupid internal problems. And maybe even running those into initial releases with sane transparent logging and recovery.
After that, trapping unexpected switch defaults with sane recovery, and similar exits for other unexpected state values, even into production with sane transparent logging and recovery is very low overhead.
In most cases, the anal data checking belongs where data enters the system … data import, and user interfaces.
Plus a good program to regularly “lint/fsck” ALL your data for corruption is almost mandatory for any production sanity. Contrary to other less clueful view, data does rot, and will become corrupted by a strange mix of both hardware failures and software failures, at some point. The best self defence for this is checksum/hashing ALL critical data records/elements … once you know the data was correct when written, and the checksum/hash match on read, it’s not necessary to sanity check all the data fields again.