feature article
Subscribe Now

Power Parallelism

Ambric Accelerates Algorithms

What so many have been seeking is an architecture that combines efficient parallelizing of the performance core of demanding algorithms with the ease of programming and predictability of traditional von Neumann-based machines. A number of approaches to the problem have emerged over the past couple of years, each working to overcome the key problem of ease of programming while delivering on the promise of parallelism.

This week, Ambric announced their entry into the parallelism party. Their new devices are massive parallel processor arrays based on the globally asynchronous locally synchronous (GALS) architecture. GALS attacks the global synchronization problem created by complex algorithms spanning large semiconductor devices. As our devices get larger, clock propagation times increase, and algorithms grow more complex, the challenge of scheduling events across resources on a large device and keeping all the hardware synchronized with some global control scheme becomes increasingly difficult to manage. GALS keeps the synchronous part of the design local to small areas of the chip and makes interactions between these areas asynchronous. As a result, GALS increases the scalability of algorithms across large, parallel devices by localizing the problem of synchronization.

The closest cousins to Ambric’s architecture are probably FPGAs on one side, and multi-core traditional processors on the other. Like FPGAs, Ambric’s processors are based on a large array of similar processing elements with programmable interconnect. More like multi-core processors, however, each of these objects is a von Neumann-style processor executing a software algorithm. The goal of this architecture is to deliver something that approaches the performance and power efficiency possible with FPGA-based computing with ease of programming much closer to traditional processors.

Ambric approached the task of accelerated computing from the ground up, using the programming model as their ground. This is in sharp contrast to the FPGA-based computing model, where the programming environment is strapped on as an afterthought, yielding something logically akin to strapping rocket packs on a duck. In order to understand what is elegant about the Ambric model, it is probably best to look at what is inelegant about the idealized FPGA-based one. If we had the perfect “ESL” tool for FPGA-based algorithm design, we’d start with our algorithm in some abstract sequential programming language like C or C++. Our algorithm might not be sequential, but the semantics of the most common programming languages are, so we are forced to mangle our algorithm into sequentiality in order to capture it in code it in the first place.

Now, with our shiny sequential model all debugged and ready to rumble – nested FOR loops masquerading as arrays of parallel processing elements – we hand our creation off to our idealized FPGA-based compiler. First, this compiler would have to reverse-engineer our previous mangling efforts and re-extract the parallelism from our newly sequential model. It would look through our code, finding data dependencies, and it would construct something akin to a data flow graph (DFG) or a control and data flow graph (CDFG). (Remember these concepts, we’ll come back to them later – there will be a quiz!) With this graph in hand, it would then attempt to map each node on the graph to some datapath element constructed out of hardware available on our target device, in this case, an FPGA.

Then, with its list of datapath computing elements and a nice graph showing how the data should flow through them, our FPGA compiler would set about the daunting challenge of creating a schedule that would allow the data to flow through our synchronous device in an orderly fashion and designing a nice controller that could stand on the big parallel podium conducting our sequential symphony of semiconductor computing magic.

Once our compiler had worked out all the magic at the logic level, it would hand off its composition as a network of interconnected logic cells, and place and route software would begin the treacherous task of trying to meet timing across a big-’ol FPGA with clock lines running from here to February. If the timing didn’t all work out perfectly, we’d be left with a human-driven debugging problem, trying to fix timing on a logic design that bears absolutely no resemblance whatsoever to our original nice little C program.

As you can see, even with an ideal FPGA-based development flow, the unnatural evolution of our algorithm from parallel to serial – back to parallel – then into sequential logic, then into a synchronous physical timing design yields something less than conceptual elegance.

In Ambric’s process, however, the programming model is closer to the original algorithm, and the actual hardware maps very directly to the programming model. If your algorithm, like many, is a set of small, sequential processes running in parallel and passing data down the line, you’re really already at the GALS level to begin with. Ambric starts you in a graphical programming environment, stitching together algorithmic blocks in a dataflow sequence. Each of these algorithmic objects can be represented by a block whose behavior could be captured in a code snippet. By stitching these objects together in a graph showing data movement we’ve created…

OK, put away your textbooks and get out a pencil and a blank sheet of paper. It’s time for that pop-quiz. Essentially, what you’re doing is … anyone? Right. You’re manually creating your own control and data flow graph (CDFG) as we discussed back in paragraph 6. If you can get past that idea, the rest of the process is stunningly smooth. Unlike our FPGA-based example above, Ambric’s compiler doesn’t have a lot of heavy lifting to do to map your source algorithm into its processing elements.

Each of your objects becomes a code fragment running on one of the processors on Ambric’s array. These processing elements run asynchronously with respect to each other, communicating through Ambric’s specialized channels, which are composed of fifo-style registers with control that stalls each processing element if its input is not yet ready or if its output pipe is full. Because of this asynchronous, buffered design, there are no global synchronization problems to solve, and thus no epic timing closure battle to fight. The abstraction you create in the structured programming environment is very close to the reality that is implemented in the array.

Ambric is focusing their new product on the video and image processing segments – wanting to capitalize on the combination of huge computation demands and lucrative end markets. Ambric’s programming environment hits well within the range of methodologies currently in use in that area as well, with some of the most popular design flows already based on The MathWorks Simulink product, which has a similar graphical dataflow capture to Ambric’s offering.

One of the toughest challenges for Ambric will be to win new programmers over to their structured programming environment and away from traditional, procedural programming in C and C++. Even if the methodology is superior in terms of productivity and performance, learning curves, legacy software IP, and risk aversion on new projects make design teams reluctant to leap into new and relatively unproven design methodologies. The promise of teraOPS performance, however, at a reasonable price and power level, are sure to attract the attention of a lot of design teams running out of gas with today’s existing technologies and methodologies.

Leave a Reply

Power Parallelism

Ambric Accelerates Algorithms

Computing architectures have reached a critical juncture. The monolithic microprocessor has collided with the thermal wall with a resounding, “Ouch! That’s too hot!” Traditional processor architects have moved on to dual-and-more core processors, pursuing some parallelism to mitigate their power problems. Other technologies have responded with “Hey, if parallelism is the solution, why not really go for it?” Compute acceleration with devices like FPGAs can boost the number of numbers crunched per second per Watt by orders of magnitude, but programming them is an activity akin to custom hardware design – hardly safe territory for the average software developer seeking to speed up his favorite algorithm.

What so many have been seeking is an architecture that combines efficient parallelizing of the performance core of demanding algorithms with the ease of programming and predictability of traditional von Neumann-based machines. A number of approaches to the problem have emerged over the past couple of years, each working to overcome the key problem of ease of programming while delivering on the promise of parallelism.

This week, Ambric announced their entry into the parallelism party. Their new devices are massive parallel processor arrays based on the globally asynchronous locally synchronous (GALS) architecture. GALS attacks the global synchronization problem created by complex algorithms spanning large semiconductor devices. As our devices get larger, clock propagation times increase, and algorithms grow more complex, the challenge of scheduling events across resources on a large device and keeping all the hardware synchronized with some global control scheme becomes increasingly difficult to manage. GALS keeps the synchronous part of the design local to small areas of the chip and makes interactions between these areas asynchronous. As a result, GALS increases the scalability of algorithms across large, parallel devices by localizing the problem of synchronization.

The closest cousins to Ambric’s architecture are probably FPGAs on one side, and multi-core traditional processors on the other. Like FPGAs, Ambric’s processors are based on a large array of similar processing elements with programmable interconnect. More like multi-core processors, however, each of these objects is a von Neumann-style processor executing a software algorithm. The goal of this architecture is to deliver something that approaches the performance and power efficiency possible with FPGA-based computing with ease of programming much closer to traditional processors.

Ambric approached the task of accelerated computing from the ground up, using the programming model as their ground. This is in sharp contrast to the FPGA-based computing model, where the programming environment is strapped on as an afterthought, yielding something logically akin to strapping rocket packs on a duck. In order to understand what is elegant about the Ambric model, it is probably best to look at what is inelegant about the idealized FPGA-based one. If we had the perfect “ESL” tool for FPGA-based algorithm design, we’d start with our algorithm in some abstract sequential programming language like C or C++. Our algorithm might not be sequential, but the semantics of the most common programming languages are, so we are forced to mangle our algorithm into sequentiality in order to capture it in code it in the first place.

Now, with our shiny sequential model all debugged and ready to rumble – nested FOR loops masquerading as arrays of parallel processing elements – we hand our creation off to our idealized FPGA-based compiler. First, this compiler would have to reverse-engineer our previous mangling efforts and re-extract the parallelism from our newly sequential model. It would look through our code, finding data dependencies, and it would construct something akin to a data flow graph (DFG) or a control and data flow graph (CDFG). (Remember these concepts, we’ll come back to them later – there will be a quiz!) With this graph in hand, it would then attempt to map each node on the graph to some datapath element constructed out of hardware available on our target device, in this case, an FPGA.

Then, with its list of datapath computing elements and a nice graph showing how the data should flow through them, our FPGA compiler would set about the daunting challenge of creating a schedule that would allow the data to flow through our synchronous device in an orderly fashion and designing a nice controller that could stand on the big parallel podium conducting our sequential symphony of semiconductor computing magic.

Once our compiler had worked out all the magic at the logic level, it would hand off its composition as a network of interconnected logic cells, and place and route software would begin the treacherous task of trying to meet timing across a big-’ol FPGA with clock lines running from here to February. If the timing didn’t all work out perfectly, we’d be left with a human-driven debugging problem, trying to fix timing on a logic design that bears absolutely no resemblance whatsoever to our original nice little C program.

As you can see, even with an ideal FPGA-based development flow, the unnatural evolution of our algorithm from parallel to serial – back to parallel – then into sequential logic, then into a synchronous physical timing design yields something less than conceptual elegance.

In Ambric’s process, however, the programming model is closer to the original algorithm, and the actual hardware maps very directly to the programming model. If your algorithm, like many, is a set of small, sequential processes running in parallel and passing data down the line, you’re really already at the GALS level to begin with. Ambric starts you in a graphical programming environment, stitching together algorithmic blocks in a dataflow sequence. Each of these algorithmic objects can be represented by a block whose behavior could be captured in a code snippet. By stitching these objects together in a graph showing data movement we’ve created…

OK, put away your textbooks and get out a pencil and a blank sheet of paper. It’s time for that pop-quiz. Essentially, what you’re doing is … anyone? Right. You’re manually creating your own control and data flow graph (CDFG) as we discussed back in paragraph 6. If you can get past that idea, the rest of the process is stunningly smooth. Unlike our FPGA-based example above, Ambric’s compiler doesn’t have a lot of heavy lifting to do to map your source algorithm into its processing elements.

Each of your objects becomes a code fragment running on one of the processors on Ambric’s array. These processing elements run asynchronously with respect to each other, communicating through Ambric’s specialized channels, which are composed of fifo-style registers with control that stalls each processing element if its input is not yet ready or if its output pipe is full. Because of this asynchronous, buffered design, there are no global synchronization problems to solve, and thus no epic timing closure battle to fight. The abstraction you create in the structured programming environment is very close to the reality that is implemented in the array.

Ambric is focusing their new product on the video and image processing segments – wanting to capitalize on the combination of huge computation demands and lucrative end markets. Ambric’s programming environment hits well within the range of methodologies currently in use in that area as well, with some of the most popular design flows already based on The MathWorks Simulink product, which has a similar graphical dataflow capture to Ambric’s offering.

One of the toughest challenges for Ambric will be to win new programmers over to their structured programming environment and away from traditional, procedural programming in C and C++. Even if the methodology is superior in terms of productivity and performance, learning curves, legacy software IP, and risk aversion on new projects make design teams reluctant to leap into new and relatively unproven design methodologies. The promise of teraOPS performance, however, at a reasonable price and power level, are sure to attract the attention of a lot of design teams running out of gas with today’s existing technologies and methodologies.

Leave a Reply

featured blogs
Jul 20, 2024
If you are looking for great technology-related reads, here are some offerings that I cannot recommend highly enough....

featured video

Larsen & Toubro Builds Data Centers with Effective Cooling Using Cadence Reality DC Design

Sponsored by Cadence Design Systems

Larsen & Toubro built the world’s largest FIFA stadium in Qatar, the world’s tallest statue, and one of the world’s most sophisticated cricket stadiums. Their latest business venture? Designing data centers. Since IT equipment in data centers generates a lot of heat, it’s important to have an efficient and effective cooling system. Learn why, Larsen & Toubro use Cadence Reality DC Design Software for simulation and analysis of the cooling system.

Click here for more information about Cadence Multiphysics System Analysis

featured chalk talk

Intel AI Update
Sponsored by Mouser Electronics and Intel
In this episode of Chalk Talk, Amelia Dalton and Peter Tea from Intel explore how Intel is making AI implementation easier than ever before. They examine the typical workflows involved in artificial intelligence designs, the benefits that Intel’s scalable Xeon processor brings to AI projects, and how you can take advantage of the Intel AI ecosystem to further innovation in your next design.
Oct 6, 2023
35,422 views