feature article
Subscribe Now

Living on the EDGE

Microsoft’s Semi-Secret E2 EDGE Processor Might be the Next Big Thing

“You say you want a revolution? – John Lennon

There’s leading edge, there’s bleeding edge, there’s double-edged, and there’s over the edge. It’s hard to say which term applies to a new style of processor beginning to sneak out of the labs.

It’s called EDGE – for Explicit Data Graph Execution – and it’s… different. Microsoft itself has designed a custom EDGE processor and implemented it in hardware. Is the software giant really designing CPUs now?

Evidently so, and the company is pretty far down the path, too. It’s got Windows 10, Linux, FreeRTOS, .NET Core libraries, a C++ compiler, and more running on an FPGA implementation of its EDGE-style processor. Clearly, this is more than just a summer science experiment. The project isn’t exactly secret – Microsoft, along with Qualcomm, has publicly demonstrated Windows 10 running on the new processor – but neither company is saying much about their progress. So, what is EDGE and what does it offer that today’s CPUs don’t have?

If you want the 15-second elevator pitch, think VLIW plus 10 years.

To be honest, I don’t know much about EDGE, and even less about E2, the name of Microsoft’s instantiation of it. But based on the publicly available information (and the parts I’m allowed to share), here’s what we know.

The concept behind EDGE is parallelism – lots and lots of parallelism. You thought your eight-core Ryzen or Core i7 was highly parallel? Try 1000 cores all hammering away at once. EDGE processors have gobs of execution units, typically numbering in the hundreds or the thousands, and EDGE code tries to broadside as many of those execution units as possible all at once. EDGE also explicitly encodes data dependencies into the binary so that the CPU hardware doesn’t have to find them on its own. Both of these characteristics are laudable goals, but both have their problems, too. It’s not clear yet whether EDGE is a big improvement over current CPU design philosophies or whether it’s just another architectural dead-end like so many others.

EDGE’s massive parallelism has, er, parallels to the VLIW (Very Long Instruction Word) craze from the previous decade or so. VLIW processors thrived (theoretically) on spamming lots of instructions at once to lots of hardware. Rather than sip instructions 16 or 32 bits at a time, a VLIW machine would gobble 256 or 512 bits of opcode (or even more) all at once. VLIW compilers were smart enough (again, in theory) to find batches of instructions that could be dispatched simultaneously without generating nasty interlocks, hazards, and data dependencies that the hardware would have to untangle.

In essence, a VLIW machine is a RISC machine turned sideways. It’s wide instead of deep.

That’s all fine and dandy, except that VLIW compilers weren’t very good at finding clusters of instructions that could execute simultaneously, or of packaging those instructions together into a very wide instruction word without a lot of NOPs for padding. The hardware also turned out to be devilishly tricky. Binary compatibility usually suffered, too, because different VLIW machines (even those from the same family) had different execution resources and, therefore, needed a different binary encoding. Recompiling was routine.

Very few VLIW processors saw the light of day, and fewer still were sold. Intel’s Itanium is probably the world’s most successful VLIW design, and the less said about that, the better.

EDGE’s other neat trick is its hard-coded data dependencies. An EDGE compiler optimizes code like other compilers do, looking for instructions that aren’t dependent on each other’s data – or, if they are, by explicitly tagging the dependencies in the binary.

EDGE machines treat entire subroutines as one mega-instruction. Most well-written subroutines have a defined entry and exit point. More importantly, they also have a defined method for passing data in and out, usually by dereferencing pointers. Ideally, code never jumps out of the subroutine and data never sneaks in except through those well-defined interfaces. Encapsulating functions in this way makes each subroutine a self-contained block of code that can (theoretically) be optimized as a whole.

An EDGE processor works on whole subroutines at a time. It’s the compiler’s job to package those subroutines and present them to the hardware in such a way that the processor doesn’t have to check for data dependencies at run-time. With luck, you’ve eliminated all the Byzantine hardware like reorder buffers, wait stations, and speculative execution that keep the chip honest but that don’t add anything to performance.

Microsoft’s brief online description of the E2 project has been removed, which the company characterizes as both routine and unimportant. They emphasize that E2 is just a research project, not a commercial product in development. Even so, work on E2 has been going on for about eight years, and the team has grown to “dozens of engineers spanning multiple divisions, companies, and countries.” Plus, there’s that public demo at ISCA last month. E2 may not be destined for real products at Microsoft, but it’s not just a casual wheeze, either. You don’t port Windows 10 to a radically new CPU architecture for the laughs.

What about the rest of the world outside of Microsoft? Is EDGE going to be the Next Big ThingTM for microprocessor designs? Magic 8 Ball says… probably not.

EDGE is certainly enticing. The siren call of massive performance through massive parallelism has lured many a designer onto the rocky shoals of despair. Transistors are cheap, so throwing hardware at the problem makes economic sense. But does it make practical sense?

Relatively few programs have the kind of parallelism that EDGE, or VLIW, or even big RISC machines can exploit. That’s just not how we code. Throw all the hardware you want at it; it’s still not going to go much faster because there’s nothing for the machine to do except hurry up and wait. If what you want is a massively parallel machine than can do NOPs in record time, knock yourself out.  

I’ll be the first to admit that I haven’t looked deeply into EDGE instruction sets, reviewed schematics, or pored over detailed block diagrams. There’s a lot I still don’t know. But as it stands, EDGE looks like an old cake with new frosting. It fiddles with the details of implementation, but it doesn’t sidestep any fundamental problems. Compilers just aren’t as omniscient as we’d like them to be, and runtime hazards are too abundant to simply code around them. We want our processors to be fast and efficient, but we’re not giving them problems that can be solved that way. Messy code requires messy CPUs.

3 thoughts on “Living on the EDGE”

    1. It all depends on the application. The key differentiator here might be the front-end. Design the application from the start as independent parallel processing “processes” (in essence, forget the global state machine) and the rest follows without headaches. Remember the transputer? It was based on the CSP process algebra. Even two instructions could be two parallel processes. This EDGE looks like a hardware version of CSP (the transputer was that too but was still a sequential machine with a very fast context switch). Now, with AI coming to the foreground again, requiring a lot of front end data parallel processing, this thing might have future. GPUs are fine and good at data parallism but very inefficient when it comes to power consumption. EDGE with a good front end compiler might do the job better.

  1. We write algorithms based on the communications structure of the hardware it runs on. Most problems can be expressed with very different algorithms that are highly optimal for different highly parallel cpu, memory, and communications architectures.

    When single processor/memory systems are targeted, we tend to write monolithic algorithms and programs, maybe as several processes and packaged neatly as a large collection of functions/methods.

    When the targeted architecture is a closely coupled highly symmetric multiprocessor or multi-core system, high bandwidth shared memory communication is great as long as the caches are coherent. Algorithms suddenly have to become L1 and L2 cache optimized.

    When the targeted architecture is a more loosely coupled highly symmetric multiprocessor sytem, aka NUMA, then local memory optimization becomes important, and we structure our algorithms around optimizing for local memory access with higher costs for processor node to node communications.

    When the targeted architecture becomes network based, aka MPI clusters, the communications costs become even more important to optimize around.

    Writing good EDGE code will have similar important architectural constraints on optimized algorithms for these architectures.

Leave a Reply

featured blogs
May 26, 2022
Introducing Synopsys Learning Center, an online, on-demand library of self-paced training modules, webinars, and labs designed for both new & experienced users. The post New Synopsys Learning Center Makes Training Easier and More Accessible appeared first on From Silico...
May 25, 2022
The Team RF "μWaveRiders" blog series is a showcase for Cadence AWR RF products. Monthly topics will vary between Cadence AWR Design Environment release highlights, feature videos, Cadence... ...
May 25, 2022
There are so many cool STEM (science, technology, engineering, and math) toys available these days, and I want them all!...
May 24, 2022
By Neel Natekar Radio frequency (RF) circuitry is an essential component of many of the critical applications we now rely… ...

featured video

EdgeQ Creates Big Connections with a Small Chip

Sponsored by Cadence Design Systems

Find out how EdgeQ delivered the world’s first 5G base station on a chip using Cadence’s logic simulation, digital implementation, timing and power signoff, synthesis, and physical verification signoff tools.

Click here for more information

featured paper

5 common Hall-effect sensor myths

Sponsored by Texas Instruments

Hall-effect sensors can be used in a variety of automotive and industrial systems. Higher system performance requirements created the need for improved accuracy and more integration – extending the use of Hall-effect sensors. Read this article to learn about common Hall-effect sensor misconceptions and see how these sensors can be used in real-world applications.

Click to read more

featured chalk talk

Sensor Technologies Here to Stay: Post-pandemic

Sponsored by Infineon

Today sensor technology has become integral to our everyday lives. And in the future, sensor technology will mean even more than it does today. In this episode of Chalk Talk, Amelia Dalton chats with David Jones from Infineon about the future of sensor technologies and how they are going to impact our lives in the post-pandemic world. They investigate how miniaturization, built-in antennas in-package and the evolution of radar technology have helped usher in a whole new era of sensing technologies and how all of this and more will help us live healthier and happier lives.

Click here for more information about Infineon's sensor technology portfolio