feature article
Subscribe Now

Living on the EDGE

Microsoft’s Semi-Secret E2 EDGE Processor Might be the Next Big Thing

“You say you want a revolution? – John Lennon

There’s leading edge, there’s bleeding edge, there’s double-edged, and there’s over the edge. It’s hard to say which term applies to a new style of processor beginning to sneak out of the labs.

It’s called EDGE – for Explicit Data Graph Execution – and it’s… different. Microsoft itself has designed a custom EDGE processor and implemented it in hardware. Is the software giant really designing CPUs now?

Evidently so, and the company is pretty far down the path, too. It’s got Windows 10, Linux, FreeRTOS, .NET Core libraries, a C++ compiler, and more running on an FPGA implementation of its EDGE-style processor. Clearly, this is more than just a summer science experiment. The project isn’t exactly secret – Microsoft, along with Qualcomm, has publicly demonstrated Windows 10 running on the new processor – but neither company is saying much about their progress. So, what is EDGE and what does it offer that today’s CPUs don’t have?

If you want the 15-second elevator pitch, think VLIW plus 10 years.

To be honest, I don’t know much about EDGE, and even less about E2, the name of Microsoft’s instantiation of it. But based on the publicly available information (and the parts I’m allowed to share), here’s what we know.

The concept behind EDGE is parallelism – lots and lots of parallelism. You thought your eight-core Ryzen or Core i7 was highly parallel? Try 1000 cores all hammering away at once. EDGE processors have gobs of execution units, typically numbering in the hundreds or the thousands, and EDGE code tries to broadside as many of those execution units as possible all at once. EDGE also explicitly encodes data dependencies into the binary so that the CPU hardware doesn’t have to find them on its own. Both of these characteristics are laudable goals, but both have their problems, too. It’s not clear yet whether EDGE is a big improvement over current CPU design philosophies or whether it’s just another architectural dead-end like so many others.

EDGE’s massive parallelism has, er, parallels to the VLIW (Very Long Instruction Word) craze from the previous decade or so. VLIW processors thrived (theoretically) on spamming lots of instructions at once to lots of hardware. Rather than sip instructions 16 or 32 bits at a time, a VLIW machine would gobble 256 or 512 bits of opcode (or even more) all at once. VLIW compilers were smart enough (again, in theory) to find batches of instructions that could be dispatched simultaneously without generating nasty interlocks, hazards, and data dependencies that the hardware would have to untangle.

In essence, a VLIW machine is a RISC machine turned sideways. It’s wide instead of deep.

That’s all fine and dandy, except that VLIW compilers weren’t very good at finding clusters of instructions that could execute simultaneously, or of packaging those instructions together into a very wide instruction word without a lot of NOPs for padding. The hardware also turned out to be devilishly tricky. Binary compatibility usually suffered, too, because different VLIW machines (even those from the same family) had different execution resources and, therefore, needed a different binary encoding. Recompiling was routine.

Very few VLIW processors saw the light of day, and fewer still were sold. Intel’s Itanium is probably the world’s most successful VLIW design, and the less said about that, the better.

EDGE’s other neat trick is its hard-coded data dependencies. An EDGE compiler optimizes code like other compilers do, looking for instructions that aren’t dependent on each other’s data – or, if they are, by explicitly tagging the dependencies in the binary.

EDGE machines treat entire subroutines as one mega-instruction. Most well-written subroutines have a defined entry and exit point. More importantly, they also have a defined method for passing data in and out, usually by dereferencing pointers. Ideally, code never jumps out of the subroutine and data never sneaks in except through those well-defined interfaces. Encapsulating functions in this way makes each subroutine a self-contained block of code that can (theoretically) be optimized as a whole.

An EDGE processor works on whole subroutines at a time. It’s the compiler’s job to package those subroutines and present them to the hardware in such a way that the processor doesn’t have to check for data dependencies at run-time. With luck, you’ve eliminated all the Byzantine hardware like reorder buffers, wait stations, and speculative execution that keep the chip honest but that don’t add anything to performance.

Microsoft’s brief online description of the E2 project has been removed, which the company characterizes as both routine and unimportant. They emphasize that E2 is just a research project, not a commercial product in development. Even so, work on E2 has been going on for about eight years, and the team has grown to “dozens of engineers spanning multiple divisions, companies, and countries.” Plus, there’s that public demo at ISCA last month. E2 may not be destined for real products at Microsoft, but it’s not just a casual wheeze, either. You don’t port Windows 10 to a radically new CPU architecture for the laughs.

What about the rest of the world outside of Microsoft? Is EDGE going to be the Next Big ThingTM for microprocessor designs? Magic 8 Ball says… probably not.

EDGE is certainly enticing. The siren call of massive performance through massive parallelism has lured many a designer onto the rocky shoals of despair. Transistors are cheap, so throwing hardware at the problem makes economic sense. But does it make practical sense?

Relatively few programs have the kind of parallelism that EDGE, or VLIW, or even big RISC machines can exploit. That’s just not how we code. Throw all the hardware you want at it; it’s still not going to go much faster because there’s nothing for the machine to do except hurry up and wait. If what you want is a massively parallel machine than can do NOPs in record time, knock yourself out.  

I’ll be the first to admit that I haven’t looked deeply into EDGE instruction sets, reviewed schematics, or pored over detailed block diagrams. There’s a lot I still don’t know. But as it stands, EDGE looks like an old cake with new frosting. It fiddles with the details of implementation, but it doesn’t sidestep any fundamental problems. Compilers just aren’t as omniscient as we’d like them to be, and runtime hazards are too abundant to simply code around them. We want our processors to be fast and efficient, but we’re not giving them problems that can be solved that way. Messy code requires messy CPUs.

3 thoughts on “Living on the EDGE”

    1. It all depends on the application. The key differentiator here might be the front-end. Design the application from the start as independent parallel processing “processes” (in essence, forget the global state machine) and the rest follows without headaches. Remember the transputer? It was based on the CSP process algebra. Even two instructions could be two parallel processes. This EDGE looks like a hardware version of CSP (the transputer was that too but was still a sequential machine with a very fast context switch). Now, with AI coming to the foreground again, requiring a lot of front end data parallel processing, this thing might have future. GPUs are fine and good at data parallism but very inefficient when it comes to power consumption. EDGE with a good front end compiler might do the job better.

  1. We write algorithms based on the communications structure of the hardware it runs on. Most problems can be expressed with very different algorithms that are highly optimal for different highly parallel cpu, memory, and communications architectures.

    When single processor/memory systems are targeted, we tend to write monolithic algorithms and programs, maybe as several processes and packaged neatly as a large collection of functions/methods.

    When the targeted architecture is a closely coupled highly symmetric multiprocessor or multi-core system, high bandwidth shared memory communication is great as long as the caches are coherent. Algorithms suddenly have to become L1 and L2 cache optimized.

    When the targeted architecture is a more loosely coupled highly symmetric multiprocessor sytem, aka NUMA, then local memory optimization becomes important, and we structure our algorithms around optimizing for local memory access with higher costs for processor node to node communications.

    When the targeted architecture becomes network based, aka MPI clusters, the communications costs become even more important to optimize around.

    Writing good EDGE code will have similar important architectural constraints on optimized algorithms for these architectures.

Leave a Reply

featured blogs
Oct 5, 2022
The newest version of Fine Marine - Cadence's CFD software specifically designed for Marine Engineers and Naval Architects - is out now. Discover re-conceptualized wave generation, drastically expanding the range of waves and the accuracy of the modeling and advanced pos...
Oct 4, 2022
We share 6 key advantages of cloud-based IC hardware design tools, including enhanced scalability, security, and access to AI-enabled EDA tools. The post 6 Reasons to Leverage IC Hardware Development in the Cloud appeared first on From Silicon To Software....
Sep 30, 2022
When I wrote my book 'Bebop to the Boolean Boogie,' it was certainly not my intention to lead 6-year-old boys astray....

featured video

PCIe Gen5 x16 Running on the Achronix VectorPath Accelerator Card

Sponsored by Achronix

In this demo, Achronix engineers show the VectorPath Accelerator Card successfully linking up to a PCIe Gen5 x16 host and write data to and read data from GDDR6 memory. The VectorPath accelerator card featuring the Speedster7t FPGA is one of the first FPGAs that can natively support this interface within its PCIe subsystem. Speedster7t FPGAs offer a revolutionary new architecture that Achronix developed to address the highest performance data acceleration challenges.

Click here for more information about the VectorPath Accelerator Card

featured paper

Algorithm Verification with FPGAs and ASICs

Sponsored by MathWorks

Developing new FPGA and ASIC designs involves implementing new algorithms, which presents challenges for verification for algorithm developers, hardware designers, and verification engineers. This eBook explores different aspects of hardware design verification and how you can use MATLAB and Simulink to reduce development effort and improve the quality of end products.

Click here to read more

featured chalk talk

TE's Dynamic Series for Robotics

Sponsored by Mouser Electronics and TE Connectivity

If you are designing a robot, a drive system, or any electromechanical system, the dynamic series of connectors from TE Connectivity might be a great solution for your next design. In this episode of Chalk Talk, Amelia Dalton chats with Jennifer Love from TE Connectivity about the design requirements common in robotic applications and why this new flexible connector with its innovative three point contact design, audible locking system, and dedicated tooling make it a great solution for all kinds of robotic designs.

Click here for more information about TE Connectivity Dynamic D8000 Pluggable Connectors