What Is a Compiler, Anyway?

“We still have judgement here, that we but teach bloody instructions which, being taught, return to plague th’inventor.” – Macbeth, 1.7

Today we dive into Computer Programming 101.

Computers don’t speak English. Or Mandarin, or German, or Spanish, or any other human language. Despite how Siri and Alexa may appear, computers and other gadgets are native speakers of their own binary tongues, languages that we can’t understand.

That means that if you want to program a computer, you have to speak its language, since it’s not going to speak yours. And a computer’s native – and only – language looks like this:

00110101101101011101010101010101110001010101011

…and so on. That’s it. Just ones and zeros. No commas, no spaces between words, no colorful illustrations. Just a bunch of “bits,” a word that began as a contraction of “binary digits.” Arranging these bits – these ones and zeros – in a different order will make the computer do different things. It’s your job as the programmer to figure out the right order. Good luck with that.

Back when the earth was cooling, people really did program computers by organizing the ones and zeros. That was painfully slow and notoriously error-prone, so we’ve come up with shortcuts to make the job easier. We call them programming languages, and there are lots of them. Java, C++, FORTRAN, Python, BASIC, Ada, Pascal, Rust, Swift, and dozens of others are all programming languages in widespread use today. Most programmers “speak” one or two programming languages well and might know a smattering of a few others. Like human languages, people tend to have a favorite programming language, and they also tend to stick with the one(s) they learned early on.

If you write a short program in the C language, it might look something like this:

#include <stdio.h>

int main() {

   printf("Hello, world!");

   return 0;

}

Big difference from the 0011010110… earlier, huh? It has actual words in it, even if they’re words we don’t normally recognize, like “printf.” It also has a lot of weird punctuation scattered around and curly braces in funny places. It’s certainly not English, but it’s a lot easier to read than all the ones and zeros. Definitely an improvement.

As another example, if you write the same program using the FORTRAN language instead of C, it looks more like this:

program hello

  implicit none

  write(*,*) 'Hello world!'

end program hello

Not a huge difference. Like the C program, it kinda, sorta looks like English, but not any English you learned in third grade. (And yes, all programmers around the world use the same pseudo-English programming words regardless of their native spoken language. There is no Swahili version of C, nor a German FORTRAN.)

But what about our original problem? If computers can understand only ones and zeros, how does a C program help us? That’s where the compiler comes in.

A compiler is a translator. It translates the almost-English commands that we saw above into the ones and zeros that the computer understands. It’s a one-way translation. It can’t convert the ones and zeros back into C (or Python, or Java, or Rust, etc.) so a compiler can’t help you understand a program after it’s been translated. This becomes important later.

Many companies create and sell compilers, plus there are free ones, too. Different compilers will translate the same program differently, just as different dictionaries might define the same words in slightly different ways. That’s okay, as long as the translated program produces the same results. When you give someone directions like, “Go to the stop sign, then turn left,” it’s the same as “Turn left at the stop sign.” Even though the directions are different, they’re effectively identical. Same goes for computer programs.

There’s more than one right way to write a program, and more than one way to translate it from the programming language into the computer’s ones and zeros. Some compilers strive to produce the shortest, most compact translation possible (“Stop sign, then left.”). Others aim for performance (“Drive as fast as you can toward the stop sign, yank the parking brake when you pass the fire hydrant, lean into the gas, swing around to the left, clip the apex, and keep going!”). You get the idea.

The same compiler can even produce both types of translations, along with various versions in between. You can control these with compiler switches, optional commands that tell the compiler whether you want a small program, a fast program, or something in between.

You only need to use the compiler once. Once your program has been translated, there’s no need to do it over and over. Compilation (translation) is always done ahead of time like this, not each time you run the program. If you change your program, you’ll need to compile it again, though.

What if the compiler makes a mistake? What if it mistranslates your commands and produces a bad program? That can happen, but it’s rare. Compilers these days are pretty reliable, and if your program doesn’t work the way you intended, the problem generally lies… elsewhere.

It’s much more common for the programmer to goof something up. Fortunately, compilers can catch a lot of mistakes for you, from simple typos to bigger problems. For example, if you’d left out the first curly brace in the C program (it’s at the end of the second line) your compiler will catch the mistake and notify you. It won’t correct your program for you; you’ll still have to do that yourself. But at least the compiler caught the error and stopped trying to translate something that didn’t make sense (to it).

What the compiler can’t catch is more subtle errors. It can’t know what you want the program to do, only what you’ve told it to do. If you accidentally type “Help me, world!” instead of “Hello, world!” your compiler can’t fix that. It assumes you did that on purpose – and maybe you did. There’s only so much a compiler can fix. The rest is up to you.

When you write a program using your favorite programming language, that’s called the source code. After a compiler translates it for you, the resulting ones and zeros are called the object code. The C and FORTRAN programs above are both examples of source code, and they both do pretty much the same thing.

Which raises an interesting question. Can you write a program using any language you want? Yup, pretty much. The two programs above could have been written in Java, Rust, or any other language and they’d still work the same way. You’d use a different compiler to translate the Rust source code into ones and zeros of object code, but the computer wouldn’t know the difference. Once a program is translated, the computer never “sees” the source code and doesn’t know or care what language you used to write it. Programming languages are entirely for our convenience, not the machine’s.

Do different computers and different microprocessor chips speak different languages? Nope. Well, yes. Sort of. It’s complicated. No computer understands a programming language like Java or C or Ada or any of the others. That’s why we have compilers. It doesn’t matter at all what programming language you use. There’s no connection between source code language and hardware. It’s not as if Intel x86 processors run only Java programs, or ARM processors run only programs written in Swift. It doesn’t work that way. Programming languages are processor-neutral, and vice versa. Any processor supports any language, because it’s the compiler that translates the source code into the binary object code that the processor understands.

Having said that, different processor chips do have different binary languages that they understand. That’s called the instruction-set architecture, or ISA. All ARM-based chips, for example, use the same instruction set. The same sequence of ones and zeros will work on any ARM-based chip, but not on an Intel or AMD x86 chip, which has a different ISA. Nor will that binary object code work on a Microchip PIC, or a MIPS processor, or RISC-V, or any other ISA. Source code is processor-neutral; object code is not.

That’s why having access to a program’s source code is important. It allows you to translate one program for multiple different processor chips, even ones with different ISAs. Source code is the “Rosetta Stone” that allows for infinite translations, even for future processors that haven’t been invented yet. With the source code and the appropriate compiler, you can translate any program to run on any chip. You can also edit, change, and update the program and then recompile (re-translate) it to produce new object code.

That’s part of the charm behind the “open source” software movement. Programmers freely share their source code so that other programmers can (a) see how it works, (b) look for bugs that the original programmer may not have found, and (c) add their own enhancements to the source code, compile it, and distribute an updated version of the program. And all without spending any money. Without the source code, you can’t do any of that.

Control of the source code implies commercial control. Most companies keep their source code secret, just like Coca-Cola protecting its recipe. They, and only they, can compile their programs from the source code into the object code needed for different chips. That allows the company to control where and how its programs will run. Microsoft retains control of the source code for Windows 10, which means that only Microsoft can decide what types of chips (what microprocessor ISA) can run Windows 10. Apple controls the source code for MacOS, and so on. You and I can’t run Windows or MacOS on just any computer we want to because we’d have no way to compile it.

Conversely, the source code for Linux is freely available. Anyone can compile it for any machine, which is part of the reason for its popularity. The same goes for FreeRTOS, Apache, PHP, and dozens of other open-source software projects. Open-source software isn’t necessarily better or worse than commercial software. They’re just differing philosophies about distribution, contribution, and control.

There, now you’re an expert. We’ll expect a functioning relational database program by the end of the week. Hello, world!