feature article
Subscribe Now

Living on the EDGE

Microsoft’s Semi-Secret E2 EDGE Processor Might be the Next Big Thing

“You say you want a revolution? – John Lennon

There’s leading edge, there’s bleeding edge, there’s double-edged, and there’s over the edge. It’s hard to say which term applies to a new style of processor beginning to sneak out of the labs.

It’s called EDGE – for Explicit Data Graph Execution – and it’s… different. Microsoft itself has designed a custom EDGE processor and implemented it in hardware. Is the software giant really designing CPUs now?

Evidently so, and the company is pretty far down the path, too. It’s got Windows 10, Linux, FreeRTOS, .NET Core libraries, a C++ compiler, and more running on an FPGA implementation of its EDGE-style processor. Clearly, this is more than just a summer science experiment. The project isn’t exactly secret – Microsoft, along with Qualcomm, has publicly demonstrated Windows 10 running on the new processor – but neither company is saying much about their progress. So, what is EDGE and what does it offer that today’s CPUs don’t have?

If you want the 15-second elevator pitch, think VLIW plus 10 years.

To be honest, I don’t know much about EDGE, and even less about E2, the name of Microsoft’s instantiation of it. But based on the publicly available information (and the parts I’m allowed to share), here’s what we know.

The concept behind EDGE is parallelism – lots and lots of parallelism. You thought your eight-core Ryzen or Core i7 was highly parallel? Try 1000 cores all hammering away at once. EDGE processors have gobs of execution units, typically numbering in the hundreds or the thousands, and EDGE code tries to broadside as many of those execution units as possible all at once. EDGE also explicitly encodes data dependencies into the binary so that the CPU hardware doesn’t have to find them on its own. Both of these characteristics are laudable goals, but both have their problems, too. It’s not clear yet whether EDGE is a big improvement over current CPU design philosophies or whether it’s just another architectural dead-end like so many others.

EDGE’s massive parallelism has, er, parallels to the VLIW (Very Long Instruction Word) craze from the previous decade or so. VLIW processors thrived (theoretically) on spamming lots of instructions at once to lots of hardware. Rather than sip instructions 16 or 32 bits at a time, a VLIW machine would gobble 256 or 512 bits of opcode (or even more) all at once. VLIW compilers were smart enough (again, in theory) to find batches of instructions that could be dispatched simultaneously without generating nasty interlocks, hazards, and data dependencies that the hardware would have to untangle.

In essence, a VLIW machine is a RISC machine turned sideways. It’s wide instead of deep.

That’s all fine and dandy, except that VLIW compilers weren’t very good at finding clusters of instructions that could execute simultaneously, or of packaging those instructions together into a very wide instruction word without a lot of NOPs for padding. The hardware also turned out to be devilishly tricky. Binary compatibility usually suffered, too, because different VLIW machines (even those from the same family) had different execution resources and, therefore, needed a different binary encoding. Recompiling was routine.

Very few VLIW processors saw the light of day, and fewer still were sold. Intel’s Itanium is probably the world’s most successful VLIW design, and the less said about that, the better.

EDGE’s other neat trick is its hard-coded data dependencies. An EDGE compiler optimizes code like other compilers do, looking for instructions that aren’t dependent on each other’s data – or, if they are, by explicitly tagging the dependencies in the binary.

EDGE machines treat entire subroutines as one mega-instruction. Most well-written subroutines have a defined entry and exit point. More importantly, they also have a defined method for passing data in and out, usually by dereferencing pointers. Ideally, code never jumps out of the subroutine and data never sneaks in except through those well-defined interfaces. Encapsulating functions in this way makes each subroutine a self-contained block of code that can (theoretically) be optimized as a whole.

An EDGE processor works on whole subroutines at a time. It’s the compiler’s job to package those subroutines and present them to the hardware in such a way that the processor doesn’t have to check for data dependencies at run-time. With luck, you’ve eliminated all the Byzantine hardware like reorder buffers, wait stations, and speculative execution that keep the chip honest but that don’t add anything to performance.

Microsoft’s brief online description of the E2 project has been removed, which the company characterizes as both routine and unimportant. They emphasize that E2 is just a research project, not a commercial product in development. Even so, work on E2 has been going on for about eight years, and the team has grown to “dozens of engineers spanning multiple divisions, companies, and countries.” Plus, there’s that public demo at ISCA last month. E2 may not be destined for real products at Microsoft, but it’s not just a casual wheeze, either. You don’t port Windows 10 to a radically new CPU architecture for the laughs.

What about the rest of the world outside of Microsoft? Is EDGE going to be the Next Big ThingTM for microprocessor designs? Magic 8 Ball says… probably not.

EDGE is certainly enticing. The siren call of massive performance through massive parallelism has lured many a designer onto the rocky shoals of despair. Transistors are cheap, so throwing hardware at the problem makes economic sense. But does it make practical sense?

Relatively few programs have the kind of parallelism that EDGE, or VLIW, or even big RISC machines can exploit. That’s just not how we code. Throw all the hardware you want at it; it’s still not going to go much faster because there’s nothing for the machine to do except hurry up and wait. If what you want is a massively parallel machine than can do NOPs in record time, knock yourself out.  

I’ll be the first to admit that I haven’t looked deeply into EDGE instruction sets, reviewed schematics, or pored over detailed block diagrams. There’s a lot I still don’t know. But as it stands, EDGE looks like an old cake with new frosting. It fiddles with the details of implementation, but it doesn’t sidestep any fundamental problems. Compilers just aren’t as omniscient as we’d like them to be, and runtime hazards are too abundant to simply code around them. We want our processors to be fast and efficient, but we’re not giving them problems that can be solved that way. Messy code requires messy CPUs.

3 thoughts on “Living on the EDGE”

    1. It all depends on the application. The key differentiator here might be the front-end. Design the application from the start as independent parallel processing “processes” (in essence, forget the global state machine) and the rest follows without headaches. Remember the transputer? It was based on the CSP process algebra. Even two instructions could be two parallel processes. This EDGE looks like a hardware version of CSP (the transputer was that too but was still a sequential machine with a very fast context switch). Now, with AI coming to the foreground again, requiring a lot of front end data parallel processing, this thing might have future. GPUs are fine and good at data parallism but very inefficient when it comes to power consumption. EDGE with a good front end compiler might do the job better.

  1. We write algorithms based on the communications structure of the hardware it runs on. Most problems can be expressed with very different algorithms that are highly optimal for different highly parallel cpu, memory, and communications architectures.

    When single processor/memory systems are targeted, we tend to write monolithic algorithms and programs, maybe as several processes and packaged neatly as a large collection of functions/methods.

    When the targeted architecture is a closely coupled highly symmetric multiprocessor or multi-core system, high bandwidth shared memory communication is great as long as the caches are coherent. Algorithms suddenly have to become L1 and L2 cache optimized.

    When the targeted architecture is a more loosely coupled highly symmetric multiprocessor sytem, aka NUMA, then local memory optimization becomes important, and we structure our algorithms around optimizing for local memory access with higher costs for processor node to node communications.

    When the targeted architecture becomes network based, aka MPI clusters, the communications costs become even more important to optimize around.

    Writing good EDGE code will have similar important architectural constraints on optimized algorithms for these architectures.

Leave a Reply

featured blogs
Dec 7, 2021
We explain the fundamentals of photonics, challenges in photonics research & design, and photonics applications including communications & photonic computing. The post Harnessing the Power of Light: Photonics in IC Design appeared first on From Silicon To Software....
Dec 7, 2021
Optimization is all about meeting requirements. In the last post , you read about how you can use measurements to optimize a circuit. This post will discuss the use of curve fitting to optimize a... [[ Click on the title to access the full blog on the Cadence Community site....
Dec 6, 2021
The scary thing is that this reminds me of the scurrilous ways in which I've been treated by members of the programming and IT communities over the years....
Nov 8, 2021
Intel® FPGA Technology Day (IFTD) is a free four-day event that will be hosted virtually across the globe in North America, China, Japan, EMEA, and Asia Pacific from December 6-9, 2021. The theme of IFTD 2021 is 'Accelerating a Smart and Connected World.' This virtual event ...

featured video

Design Low-Energy Audio/Voice Capability for Hearables, Wearables & Always-On Devices

Sponsored by Cadence Design Systems

Designing an always-on system that needs to conserve battery life? Need to also include hands-free voice control for your users? Watch this video to learn how you can reduce the energy consumption of devices with small batteries and provide a solution for a greener world with the Cadence® Tensilica® HiFi 1 DSP family.

More information about Cadence® Tensilica® HiFi 1 DSP family

featured paper

How to Fast-Charge Your Supercapacitor

Sponsored by Analog Devices

Supercapacitors (or ultracapacitors) are suited for short charge and discharge cycles. They require high currents for fast charge as well as a high voltage with a high number in series as shown in two usage cases: an automatic pallet shuttle and a fail-safe backup system. In these and many other cases, the fast charge is provided by a flexible, high-efficiency, high-voltage, and high-current charger based on a synchronous, step-down, supercapacitor charger controller.

Click to read more

featured chalk talk

IEC 62368-1 Overvoltage Requirements

Sponsored by Mouser Electronics and Littelfuse

Over-voltage protection is an often neglected and misunderstood part of system design. But often, otherwise well-engineered devices are brought down by over-voltage events. In this episode of Chalk Talk, Amelia Dalton chats with Todd Phillips of Littelfuse about the new IEC 623689-1 standard, what tests are included in the standard, and how the standard allows for greater safety and design flexibility.

Click here for more information about Littelfuse IEC 62368-1 Products