Hello there. Welcome to 2Q 21C. We hope you’ll enjoy your stay. (2Q 21C is the
notation I’ve invented to indicate the second quarter of the 21st century—you’re welcome.)
Over the past few years, we’ve been introduced to a cornucopia of new processor designs, many of which target artificial intelligence (AI) and machine learning (ML) applications.
Most of these machines feature one or more 32-bit central processing units (CPUs) augmented with neural processing accelerators. I think it’s fair to say we’ll be seeing more of these in the not-so-distant future (I can predict this with a high level of confidence because I’ll be introducing the first interesting offering for 2026 in my very next column).
For the moment, however, my mind is firmly focused on 4-bit machines. In the early days of microprocessors, 4-bit architectures made perfect sense. The first commercially successful CPUs of the 1970s were designed when silicon was expensive, transistor budgets were tiny, and every additional bit came at a real cost in terms of area, power, and complexity.
A 4-bit data path was sufficient for calculators, simple controllers, and embedded logic, enabling designers to build complete, programmable systems with only a few thousand transistors. These processors helped establish many of the architectural concepts we still use today—register files, instruction decoding, arithmetic logic units (ALUs), and program control—albeit on a delightfully compact scale.
As fabrication technologies improved and silicon became cheaper, 4-bit processors rapidly fell out of favor. By the late 1970s and early 1980s, 8-bit CPUs offered dramatically more capability for only a modest increase in cost.
As an aside, many people (even those who consider themselves to be “in the know”) are surprised to learn that the 8-bit Zilog Z80 microprocessor, which was introduced in 1976, actually featured a 4-bit ALU at its core (see Ken Shirriff’s column on this topic).
The early 8-bit processors were soon followed by 16-bit and 32-bit designs that enabled richer software, larger memory spaces, and more sophisticated control systems. From that point on, there was little practical reason to choose 4-bit CPUs for new products, particularly in industrial and commercial applications, where flexibility and longevity mattered more than shaving the last fraction of a cent from the bill of materials.
And yet… and yet… 4-bit processing never truly disappeared—it simply went underground. For example, I’m thinking of ancient pieces of test equipment, instrumentation, and factory automation that originated in the early days and have remained in service, persisting not because they are ideal, but because: “If it ain’t broke (and recertifying it would cost a fortune) don’t touch it.”
Also, inside many modern integrated circuits live tiny, highly specialized controllers and state machines that operate on just a handful of bits, quietly managing startup sequences, calibration routines, fault handling, and housekeeping tasks.
These hidden processors aren’t advertised on datasheets and rarely resemble the standalone CPUs of old, but architecturally they serve the same purpose: small, efficient decision-makers doing narrowly defined jobs extremely well. In that sense, while the age of the visible 4-bit CPU may have passed, its spirit remains very much alive, quietly orchestrating the inner workings of today’s supposedly 32-bit and 64-bit world.
But we digress… The reason 4-bit processors are on my mind is that I intend to build a 4-bit machine as an educational project. In particular, I’m planning to create the ALU portion of the machine using simple 74HC00-series devices such as AND, OR, and XOR gates; 2:1, 4:1, and 8:1 multiplexers; D-type flip-flops, etc. Meanwhile, an Arduino Uno will be used to simulate/emulate the rest of the machine.
Most processor designers are constrained by considerations such as efficiency, performance, and power consumption. In my case, my primary concern is that my design should be (a) educational, (b) interesting, and (c) involve a lot of light-emitting diodes (LEDs).
As another aside, on the off-chance anyone has one they wish to give away to make space in their garage or workshop, I’m still more than willing to provide a good home for an IBM 360 Model 91 front panel, but once again, we digress…
Initially, I assumed I’d be implementing a traditional architecture, something similar to the image I just whipped up below. I’ve omitted the control signals and status logic from this diagram for simplicity. Suffice it to say that the 4-bit status register comprises four flags: O (overflow), N (negative), Z (zero), and C (carry). If this diagram doesn’t make any sense to you, then may I be so bold as to suggest you look for a second-hand copy of Bebop Bytes Back: An Unconventional Guide to Computers.

A traditional 4-bit ALU (Source: Clive “Max” Maxfield)
The “0000” input of the 3:1 multiplexer allows us to pass the current value in the A register untouched (by ORing it with zeros, for example) for use with the shift and rotate instructions.
I was visualizing each of these functional blocks implemented on pieces of stripboard, with all the modules attached to a thin sheet of plywood (or something similar). Each arithmetic/logic function would have LEDs on its outputs. The multiplexers would also have LEDs on their control inputs, as well as additional LEDs to indicate which group of inputs was currently selected.
The advantage of this implementation is that it’s bog-standard, straightforward to explain, and simple to understand. The disadvantage is that it’s “so-so soup” that’s been done to death.
I was chatting with my friend Joe Farr about this just before breaking for the New Year holiday. We decided to each mull things over in our own way, with the goal of coming up with something “interesting” and “unexpected.” We set some ground rules as follows:
The ALU should be capable of supporting 16 operations: ADD, ADDC (add with carry), SUB, SUBB (subtract with borrow), AND, OR, XOR, NOT, CMP (compare), SHL (logical shift left), SHR (logical shift right), SHRA (arithmetic shift right), ROL (rotate left), ROR (rotate right), ROLC (rotate left through carry), and RORC (rotate right through carry).
The CPU will have a 4-bit data bus and a 12-bit (4K nybble/nibble) address bus. The idea is to begin by simulating both the CPU and its ALU inside the Arduino Uno. Later, the ALU can be brought out into the physical world with lots of LEDs (it’s important to remember the LEDs).
The first 256 nybbles of memory will reside in the Arduino. The first-pass goal is to create a program that requires only these 256 nybbles to generate 10 random numbers, store them in 10 memory locations, calculate the average, and store the result in an 11th location.
For more complex tasks, a cheap-and-cheerful 8-pin 24C32 4KB EEPROM device can be attached to the Arduino via I2C to provide additional memory, if required. This device is available in a breadboard-mountable PDIP package, thereby making it accessible to beginners.
Joe and I just had a post-New Year Zoom call to share and compare our implementations. For my part, I decided to implement all 16 ALU functions as separate logic blocks, as illustrated below.
I opted to have traditional A and B registers, but to have them drive all 16 functions. Every time a new value is copied into an A or B register, all the functions are evaluated. Each of my functions has its own 4-bit data (D) result register and its own 4-bit status (S) register.

Alternative ALU implementations (Source: Clive “Max” Maxfield)
Addresses 0x000 through 0x00F in my main memory contain hard-coded literal values 0b0000 through 0b1111. My programs automatically start running from address 0x010 in the main memory.
My 16 ALU functions live in a separate address space from the main memory. They are addressed as 0x0 through 0xF.
My CPU supports only four instructions: MOVE, JUMP (unconditional jump), JUMP_IF_0 (conditional jump if the specified status bit is 0), and JUMP_IF_1 (conditional jump if the specified status bit is 1). When coded in a 4-bit nybble, these instructions look like the following:
00?? MOVE
01XX JUMP
10?? JUMP_IF_0
11?? JUMP_IF_1
In the case of a MOVE, the ?? Bits decode as follows:
00 Memory to Memory Source: 3 nybbles1 , Destination: 3 nybbles1
01 Memory to A or B Source: 3 nybbles1, Destination: 1 nybble2
10 Data to Function Source: 1 nybble3, Destination: 1 nybble4
11 Function to Memory Source: 1 nybble5, Destination: 3 nybbles1
1Address in main memory
20b0000 = A, 0b0001 = B
3This will be the 0b0000 (0) or 0b0001 (1) data value that’s written into this function’s carry (C) status bit.
4This will be the address of the function in question (from 0x0 to 0xF). The LSB of the data nybble will be written into the C (carry) bit in this function’s status register.
5This will be the address of the function in question (from 0x0 to 0xF). The value read out will be a copy of this function’s data nybble.
In the case of an unconditional JUMP_IF_0 or JUMP_IF_1 instruction, the XX bits are “don’t care” (we might find a use for them later). An unconditional JUMP instruction will be followed by 3 destination address nybbles. The program counter (PC) will be set to this address.
In the case of the conditional jump instructions, the ?? bits decode as follows:
00 = bit 0 in the specified status register
01 = bit 1 in the specified status register
10 = bit 2 in the specified status register
11 = bit 3 in the specified status register
A conditional JUMP instruction will be followed by 1 source address nybble and 3 destination address nybbles. The source address nybble will be the address of the function in question (from 0x0 to 0xF); in this case, the system will look at the status register nybble associated with the specified function. The three destination address nybbles will be the target jump address in memory if the condition associated with this jump is met.
Reading the data value from a function doesn’t affect the function’s copy of that value. Performing a conditional jump doesn’t affect the function’s local status register. A program could load values into the A and B registers and then access the outputs—both data and status (via conditional jumps)—from multiple ALU functions.
That’s all I have so far.
Interestingly enough, Joe also decided that addresses 0x000 through 0x00F in the main memory will contain hard-coded literal values 0b0000 through 0b1111. Furthermore, Joe also decided to support the concept of A and B registers, but after that, he went in a completely different direction.
Joe’s A and B registers, ALU_Mode, ALU_Out, and ALU_Status are all readable/writable nybbles in his main memory map. As illustrated in the image above, Joe supports only one ALU functional entity, whose mode he populates with whichever function is required at that time. That is, Joe first loads his A and B registers, and then loads his ALU_Mode, after which he can access the ensuing ALU_Data and ALU_Status values.
Another aspect of Joe’s implementation is that it supports only a single instruction: MOVE. Every MOVE instruction has an associated 3-nybble source address and a 3-nybble destination address. Furthermore, since every instruction is a MOVE, there’s no need to represent it with an opcode, which means each instruction is essentially composed only of a source and destination address.
To be honest, I’ve only skimmed the surface of Joe’s implementation, which also boasts (“flaunts” might be a better word) of two index registers and a stack, along with the ability to push data values onto the stack and pop them back off again. And then things start to get interesting.
I’m still trying to wrap my brain around all of this. I’ll report further in a future column if you are interested in my doing so. In the meantime, I’d be extremely interested to hear your thoughts on all of this. What would you do if it fell on you to create a 4-bit processor, especially if the task was to come up with an innovative and/or unusual solution without care for performance or power?




” (c) involve a lot of light-emitting diodes (LEDs).”
– That’s what it’s all about.
I know one guy who set out to build a digital computer using only the technologies available circa 1899 — specifically small neon lamps and light-dependent resistors (LDRs), which means the logic gates themselves light up — mega-cool!!!
I remember the NE-77 three-electrode neon lamp.
Joe’s solution has 12 inputs: A[3..0] and B[3..0] and F[3..0], and eight outputs: D[3..0] and S[3..0]. Thus, Joe’s solution could be implemented as a lookup table in a 4096-byte read-only memory.
If the outputs of Max’s solution feed multiplexers to choose which one function’s D and S are used for any given instruction, Max’s solution and multiplexers can be implemented as a lookup table in a 4096-byte read-only memory.
Hi Peter — Remember that the status flags O, N, Z, and C are also outputs, plus the C flag can act as an input to certain functions (ADDC, SUBB, ROLC, RORC, etc.)
Yes, the ROM would be 8192 bytes to include the C input. Also, for LEDs and for teaching, the discrete gates are much better.
Working on embedded systems in 1980, we were already used to programming software into PROMs. I was tasked to build a tester to reproduce one customer’s system’s existing synchronous clocked waveforms, so another engineer could test our interface board before we had access to the customer’s system. I used a crystal oscillator driving a synchronous counter driving a PROM driving D flipflops.
We used PROMs to implement many things in the early 1980s, from gathering simple “glue logic” functions to acting as look-up tables to forming the heart of simple state machines. Ah, the good old days 🙂
It was proven that the best 4-bit ALU is SN74181.
Perhaps, it would be better to follow its inner structure.
Ah, the 74181. That was an iconic 4-bit ALU. I say was because it’s now obsolete and not actively made by major manufacturers anymore, but you can still buy them as new old stock (NOS).
It had ~70–75 gate equivalents (~170+ transistors). It boasted 32 total functions (16 logic and 16 arithmetic), but many of these were “odd” functions that fell out naturally from the internal logic after the “real” functions had been implemented.
The 181 offered a masterclass in elegant, transistor-efficient digital design. It enabled bit-slice CPU designs (e.g., 8-bit, 16-bit, 32-bit CPUs built from 4-bit slices); it was used in machines like the PDP-11, VAX, and countless minicomputers; and it taught an entire generation of engineers how ALUs really work.
But where would be the fun in simply replicating something that has been beaten (analyzed) to death? No, I think it’s better to create something innovative and different to give everyone something to argue about LOL
Anyway, designing ALU in SN7400 series needs a lot of soldering. To minimize it, the functions CMP, XOR, ADD, and negation are usually performed in a single adder by control of carry, XOR etc.