“You say you want a revolution? – John Lennon
There’s leading edge, there’s bleeding edge, there’s double-edged, and there’s over the edge. It’s hard to say which term applies to a new style of processor beginning to sneak out of the labs.
It’s called EDGE – for Explicit Data Graph Execution – and it’s… different. Microsoft itself has designed a custom EDGE processor and implemented it in hardware. Is the software giant really designing CPUs now?
Evidently so, and the company is pretty far down the path, too. It’s got Windows 10, Linux, FreeRTOS, .NET Core libraries, a C++ compiler, and more running on an FPGA implementation of its EDGE-style processor. Clearly, this is more than just a summer science experiment. The project isn’t exactly secret – Microsoft, along with Qualcomm, has publicly demonstrated Windows 10 running on the new processor – but neither company is saying much about their progress. So, what is EDGE and what does it offer that today’s CPUs don’t have?
If you want the 15-second elevator pitch, think VLIW plus 10 years.
To be honest, I don’t know much about EDGE, and even less about E2, the name of Microsoft’s instantiation of it. But based on the publicly available information (and the parts I’m allowed to share), here’s what we know.
The concept behind EDGE is parallelism – lots and lots of parallelism. You thought your eight-core Ryzen or Core i7 was highly parallel? Try 1000 cores all hammering away at once. EDGE processors have gobs of execution units, typically numbering in the hundreds or the thousands, and EDGE code tries to broadside as many of those execution units as possible all at once. EDGE also explicitly encodes data dependencies into the binary so that the CPU hardware doesn’t have to find them on its own. Both of these characteristics are laudable goals, but both have their problems, too. It’s not clear yet whether EDGE is a big improvement over current CPU design philosophies or whether it’s just another architectural dead-end like so many others.
EDGE’s massive parallelism has, er, parallels to the VLIW (Very Long Instruction Word) craze from the previous decade or so. VLIW processors thrived (theoretically) on spamming lots of instructions at once to lots of hardware. Rather than sip instructions 16 or 32 bits at a time, a VLIW machine would gobble 256 or 512 bits of opcode (or even more) all at once. VLIW compilers were smart enough (again, in theory) to find batches of instructions that could be dispatched simultaneously without generating nasty interlocks, hazards, and data dependencies that the hardware would have to untangle.
In essence, a VLIW machine is a RISC machine turned sideways. It’s wide instead of deep.
That’s all fine and dandy, except that VLIW compilers weren’t very good at finding clusters of instructions that could execute simultaneously, or of packaging those instructions together into a very wide instruction word without a lot of NOPs for padding. The hardware also turned out to be devilishly tricky. Binary compatibility usually suffered, too, because different VLIW machines (even those from the same family) had different execution resources and, therefore, needed a different binary encoding. Recompiling was routine.
Very few VLIW processors saw the light of day, and fewer still were sold. Intel’s Itanium is probably the world’s most successful VLIW design, and the less said about that, the better.
EDGE’s other neat trick is its hard-coded data dependencies. An EDGE compiler optimizes code like other compilers do, looking for instructions that aren’t dependent on each other’s data – or, if they are, by explicitly tagging the dependencies in the binary.
EDGE machines treat entire subroutines as one mega-instruction. Most well-written subroutines have a defined entry and exit point. More importantly, they also have a defined method for passing data in and out, usually by dereferencing pointers. Ideally, code never jumps out of the subroutine and data never sneaks in except through those well-defined interfaces. Encapsulating functions in this way makes each subroutine a self-contained block of code that can (theoretically) be optimized as a whole.
An EDGE processor works on whole subroutines at a time. It’s the compiler’s job to package those subroutines and present them to the hardware in such a way that the processor doesn’t have to check for data dependencies at run-time. With luck, you’ve eliminated all the Byzantine hardware like reorder buffers, wait stations, and speculative execution that keep the chip honest but that don’t add anything to performance.
Microsoft’s brief online description of the E2 project has been removed, which the company characterizes as both routine and unimportant. They emphasize that E2 is just a research project, not a commercial product in development. Even so, work on E2 has been going on for about eight years, and the team has grown to “dozens of engineers spanning multiple divisions, companies, and countries.” Plus, there’s that public demo at ISCA last month. E2 may not be destined for real products at Microsoft, but it’s not just a casual wheeze, either. You don’t port Windows 10 to a radically new CPU architecture for the laughs.
What about the rest of the world outside of Microsoft? Is EDGE going to be the Next Big ThingTM for microprocessor designs? Magic 8 Ball says… probably not.
EDGE is certainly enticing. The siren call of massive performance through massive parallelism has lured many a designer onto the rocky shoals of despair. Transistors are cheap, so throwing hardware at the problem makes economic sense. But does it make practical sense?
Relatively few programs have the kind of parallelism that EDGE, or VLIW, or even big RISC machines can exploit. That’s just not how we code. Throw all the hardware you want at it; it’s still not going to go much faster because there’s nothing for the machine to do except hurry up and wait. If what you want is a massively parallel machine than can do NOPs in record time, knock yourself out.
I’ll be the first to admit that I haven’t looked deeply into EDGE instruction sets, reviewed schematics, or pored over detailed block diagrams. There’s a lot I still don’t know. But as it stands, EDGE looks like an old cake with new frosting. It fiddles with the details of implementation, but it doesn’t sidestep any fundamental problems. Compilers just aren’t as omniscient as we’d like them to be, and runtime hazards are too abundant to simply code around them. We want our processors to be fast and efficient, but we’re not giving them problems that can be solved that way. Messy code requires messy CPUs.
What do you think? Are EDGE processors the wave of the future, or just another interesting science project for CPU nerds?
It all depends on the application. The key differentiator here might be the front-end. Design the application from the start as independent parallel processing “processes” (in essence, forget the global state machine) and the rest follows without headaches. Remember the transputer? It was based on the CSP process algebra. Even two instructions could be two parallel processes. This EDGE looks like a hardware version of CSP (the transputer was that too but was still a sequential machine with a very fast context switch). Now, with AI coming to the foreground again, requiring a lot of front end data parallel processing, this thing might have future. GPUs are fine and good at data parallism but very inefficient when it comes to power consumption. EDGE with a good front end compiler might do the job better.
We write algorithms based on the communications structure of the hardware it runs on. Most problems can be expressed with very different algorithms that are highly optimal for different highly parallel cpu, memory, and communications architectures.
When single processor/memory systems are targeted, we tend to write monolithic algorithms and programs, maybe as several processes and packaged neatly as a large collection of functions/methods.
When the targeted architecture is a closely coupled highly symmetric multiprocessor or multi-core system, high bandwidth shared memory communication is great as long as the caches are coherent. Algorithms suddenly have to become L1 and L2 cache optimized.
When the targeted architecture is a more loosely coupled highly symmetric multiprocessor sytem, aka NUMA, then local memory optimization becomes important, and we structure our algorithms around optimizing for local memory access with higher costs for processor node to node communications.
When the targeted architecture becomes network based, aka MPI clusters, the communications costs become even more important to optimize around.
Writing good EDGE code will have similar important architectural constraints on optimized algorithms for these architectures.