feature article
Subscribe Now

Living on the EDGE

Microsoft’s Semi-Secret E2 EDGE Processor Might be the Next Big Thing

“You say you want a revolution? – John Lennon

There’s leading edge, there’s bleeding edge, there’s double-edged, and there’s over the edge. It’s hard to say which term applies to a new style of processor beginning to sneak out of the labs.

It’s called EDGE – for Explicit Data Graph Execution – and it’s… different. Microsoft itself has designed a custom EDGE processor and implemented it in hardware. Is the software giant really designing CPUs now?

Evidently so, and the company is pretty far down the path, too. It’s got Windows 10, Linux, FreeRTOS, .NET Core libraries, a C++ compiler, and more running on an FPGA implementation of its EDGE-style processor. Clearly, this is more than just a summer science experiment. The project isn’t exactly secret – Microsoft, along with Qualcomm, has publicly demonstrated Windows 10 running on the new processor – but neither company is saying much about their progress. So, what is EDGE and what does it offer that today’s CPUs don’t have?

If you want the 15-second elevator pitch, think VLIW plus 10 years.

To be honest, I don’t know much about EDGE, and even less about E2, the name of Microsoft’s instantiation of it. But based on the publicly available information (and the parts I’m allowed to share), here’s what we know.

The concept behind EDGE is parallelism – lots and lots of parallelism. You thought your eight-core Ryzen or Core i7 was highly parallel? Try 1000 cores all hammering away at once. EDGE processors have gobs of execution units, typically numbering in the hundreds or the thousands, and EDGE code tries to broadside as many of those execution units as possible all at once. EDGE also explicitly encodes data dependencies into the binary so that the CPU hardware doesn’t have to find them on its own. Both of these characteristics are laudable goals, but both have their problems, too. It’s not clear yet whether EDGE is a big improvement over current CPU design philosophies or whether it’s just another architectural dead-end like so many others.

EDGE’s massive parallelism has, er, parallels to the VLIW (Very Long Instruction Word) craze from the previous decade or so. VLIW processors thrived (theoretically) on spamming lots of instructions at once to lots of hardware. Rather than sip instructions 16 or 32 bits at a time, a VLIW machine would gobble 256 or 512 bits of opcode (or even more) all at once. VLIW compilers were smart enough (again, in theory) to find batches of instructions that could be dispatched simultaneously without generating nasty interlocks, hazards, and data dependencies that the hardware would have to untangle.

In essence, a VLIW machine is a RISC machine turned sideways. It’s wide instead of deep.

That’s all fine and dandy, except that VLIW compilers weren’t very good at finding clusters of instructions that could execute simultaneously, or of packaging those instructions together into a very wide instruction word without a lot of NOPs for padding. The hardware also turned out to be devilishly tricky. Binary compatibility usually suffered, too, because different VLIW machines (even those from the same family) had different execution resources and, therefore, needed a different binary encoding. Recompiling was routine.

Very few VLIW processors saw the light of day, and fewer still were sold. Intel’s Itanium is probably the world’s most successful VLIW design, and the less said about that, the better.

EDGE’s other neat trick is its hard-coded data dependencies. An EDGE compiler optimizes code like other compilers do, looking for instructions that aren’t dependent on each other’s data – or, if they are, by explicitly tagging the dependencies in the binary.

EDGE machines treat entire subroutines as one mega-instruction. Most well-written subroutines have a defined entry and exit point. More importantly, they also have a defined method for passing data in and out, usually by dereferencing pointers. Ideally, code never jumps out of the subroutine and data never sneaks in except through those well-defined interfaces. Encapsulating functions in this way makes each subroutine a self-contained block of code that can (theoretically) be optimized as a whole.

An EDGE processor works on whole subroutines at a time. It’s the compiler’s job to package those subroutines and present them to the hardware in such a way that the processor doesn’t have to check for data dependencies at run-time. With luck, you’ve eliminated all the Byzantine hardware like reorder buffers, wait stations, and speculative execution that keep the chip honest but that don’t add anything to performance.

Microsoft’s brief online description of the E2 project has been removed, which the company characterizes as both routine and unimportant. They emphasize that E2 is just a research project, not a commercial product in development. Even so, work on E2 has been going on for about eight years, and the team has grown to “dozens of engineers spanning multiple divisions, companies, and countries.” Plus, there’s that public demo at ISCA last month. E2 may not be destined for real products at Microsoft, but it’s not just a casual wheeze, either. You don’t port Windows 10 to a radically new CPU architecture for the laughs.

What about the rest of the world outside of Microsoft? Is EDGE going to be the Next Big ThingTM for microprocessor designs? Magic 8 Ball says… probably not.

EDGE is certainly enticing. The siren call of massive performance through massive parallelism has lured many a designer onto the rocky shoals of despair. Transistors are cheap, so throwing hardware at the problem makes economic sense. But does it make practical sense?

Relatively few programs have the kind of parallelism that EDGE, or VLIW, or even big RISC machines can exploit. That’s just not how we code. Throw all the hardware you want at it; it’s still not going to go much faster because there’s nothing for the machine to do except hurry up and wait. If what you want is a massively parallel machine than can do NOPs in record time, knock yourself out.  

I’ll be the first to admit that I haven’t looked deeply into EDGE instruction sets, reviewed schematics, or pored over detailed block diagrams. There’s a lot I still don’t know. But as it stands, EDGE looks like an old cake with new frosting. It fiddles with the details of implementation, but it doesn’t sidestep any fundamental problems. Compilers just aren’t as omniscient as we’d like them to be, and runtime hazards are too abundant to simply code around them. We want our processors to be fast and efficient, but we’re not giving them problems that can be solved that way. Messy code requires messy CPUs.

3 thoughts on “Living on the EDGE”

    1. It all depends on the application. The key differentiator here might be the front-end. Design the application from the start as independent parallel processing “processes” (in essence, forget the global state machine) and the rest follows without headaches. Remember the transputer? It was based on the CSP process algebra. Even two instructions could be two parallel processes. This EDGE looks like a hardware version of CSP (the transputer was that too but was still a sequential machine with a very fast context switch). Now, with AI coming to the foreground again, requiring a lot of front end data parallel processing, this thing might have future. GPUs are fine and good at data parallism but very inefficient when it comes to power consumption. EDGE with a good front end compiler might do the job better.

  1. We write algorithms based on the communications structure of the hardware it runs on. Most problems can be expressed with very different algorithms that are highly optimal for different highly parallel cpu, memory, and communications architectures.

    When single processor/memory systems are targeted, we tend to write monolithic algorithms and programs, maybe as several processes and packaged neatly as a large collection of functions/methods.

    When the targeted architecture is a closely coupled highly symmetric multiprocessor or multi-core system, high bandwidth shared memory communication is great as long as the caches are coherent. Algorithms suddenly have to become L1 and L2 cache optimized.

    When the targeted architecture is a more loosely coupled highly symmetric multiprocessor sytem, aka NUMA, then local memory optimization becomes important, and we structure our algorithms around optimizing for local memory access with higher costs for processor node to node communications.

    When the targeted architecture becomes network based, aka MPI clusters, the communications costs become even more important to optimize around.

    Writing good EDGE code will have similar important architectural constraints on optimized algorithms for these architectures.

Leave a Reply

featured blogs
May 16, 2021
https://youtu.be/_wup2MSTVks Made on Communication Hill, San Jose (camera Carey Guo) Monday: Intel eASIC: Linley and DARPA Tuesday: Please Excuse the Mesh: CFD and Pointwise Wednesday: Linley:... [[ Click on the title to access the full blog on the Cadence Community site. ]]...
May 13, 2021
Samtec will attend the PCI-SIG Virtual Developers Conference on Tuesday, May 25th through Wednesday, May 26th, 2021. This is a free event for the 800+ member companies that develop and bring to market new products utilizing PCI Express technology. Attendee Registration is sti...
May 13, 2021
Our new IC design tool, PrimeSim Continuum, enables the next generation of hyper-convergent IC designs. Learn more from eeNews, Electronic Design & EE Times. The post Synopsys Makes Headlines with PrimeSim Continuum, an Innovative Circuit Simulation Solution appeared fi...
May 13, 2021
By Calibre Design Staff Prior to the availability of extreme ultraviolet (EUV) lithography, multi-patterning provided… The post A SAMPle of what you need to know about SAMP technology appeared first on Design with Calibre....

featured video

What’s Hot: DesignWare Logic Library IP for TSMC N5

Sponsored by Synopsys

Designing for N5? Josefina Hobbs details the latest info and customer results on Logic Library IP for TSMC N5. Whether performance, power, area or routability are your key concerns, Synopsys Library IP helps you meet your toughest design challenges.

Click here for more information about DesignWare Foundation IP: Embedded Memories, Logic Libraries, GPIO & PVT Sensors

featured paper

Optimizing an OpenCL AI Kernel for the data center using Silexica’s SLX FPGA

Sponsored by Silexica

AI applications are increasingly contributing to FPGAs being used as co-processors in data centers. Silexica's newest application note shows how SLX FPGA accelerates an AI-related face detection design example, leveraging the bottom-up flow of Xilinx’s Vitis 2020.2 and Alveo U280 accelerator card.

Click to read

Featured Chalk Talk

Keeping Your Linux Device Secure

Sponsored by Siemens Digital Industries Software

Embedded security is an ongoing process, not a one-time effort. Even after your design is shipped, security vulnerabilities are certain to be discovered - even in things like the operating system. In this episode of Chalk Talk, Amelia Dalton chats with Kathy Tufto from Mentor - a Siemens business, about how to make a plan to keep your Linux-based embedded design secure, and how to respond quickly when new vulnerabilities are discovered.

More information about Mentor Embedded Linux®