feature article
Subscribe Now

Power Parallelism

Ambric Accelerates Algorithms

What so many have been seeking is an architecture that combines efficient parallelizing of the performance core of demanding algorithms with the ease of programming and predictability of traditional von Neumann-based machines. A number of approaches to the problem have emerged over the past couple of years, each working to overcome the key problem of ease of programming while delivering on the promise of parallelism.

This week, Ambric announced their entry into the parallelism party. Their new devices are massive parallel processor arrays based on the globally asynchronous locally synchronous (GALS) architecture. GALS attacks the global synchronization problem created by complex algorithms spanning large semiconductor devices. As our devices get larger, clock propagation times increase, and algorithms grow more complex, the challenge of scheduling events across resources on a large device and keeping all the hardware synchronized with some global control scheme becomes increasingly difficult to manage. GALS keeps the synchronous part of the design local to small areas of the chip and makes interactions between these areas asynchronous. As a result, GALS increases the scalability of algorithms across large, parallel devices by localizing the problem of synchronization.

The closest cousins to Ambric’s architecture are probably FPGAs on one side, and multi-core traditional processors on the other. Like FPGAs, Ambric’s processors are based on a large array of similar processing elements with programmable interconnect. More like multi-core processors, however, each of these objects is a von Neumann-style processor executing a software algorithm. The goal of this architecture is to deliver something that approaches the performance and power efficiency possible with FPGA-based computing with ease of programming much closer to traditional processors.

Ambric approached the task of accelerated computing from the ground up, using the programming model as their ground. This is in sharp contrast to the FPGA-based computing model, where the programming environment is strapped on as an afterthought, yielding something logically akin to strapping rocket packs on a duck. In order to understand what is elegant about the Ambric model, it is probably best to look at what is inelegant about the idealized FPGA-based one. If we had the perfect “ESL” tool for FPGA-based algorithm design, we’d start with our algorithm in some abstract sequential programming language like C or C++. Our algorithm might not be sequential, but the semantics of the most common programming languages are, so we are forced to mangle our algorithm into sequentiality in order to capture it in code it in the first place.

Now, with our shiny sequential model all debugged and ready to rumble – nested FOR loops masquerading as arrays of parallel processing elements – we hand our creation off to our idealized FPGA-based compiler. First, this compiler would have to reverse-engineer our previous mangling efforts and re-extract the parallelism from our newly sequential model. It would look through our code, finding data dependencies, and it would construct something akin to a data flow graph (DFG) or a control and data flow graph (CDFG). (Remember these concepts, we’ll come back to them later – there will be a quiz!) With this graph in hand, it would then attempt to map each node on the graph to some datapath element constructed out of hardware available on our target device, in this case, an FPGA.

Then, with its list of datapath computing elements and a nice graph showing how the data should flow through them, our FPGA compiler would set about the daunting challenge of creating a schedule that would allow the data to flow through our synchronous device in an orderly fashion and designing a nice controller that could stand on the big parallel podium conducting our sequential symphony of semiconductor computing magic.

Once our compiler had worked out all the magic at the logic level, it would hand off its composition as a network of interconnected logic cells, and place and route software would begin the treacherous task of trying to meet timing across a big-’ol FPGA with clock lines running from here to February. If the timing didn’t all work out perfectly, we’d be left with a human-driven debugging problem, trying to fix timing on a logic design that bears absolutely no resemblance whatsoever to our original nice little C program.

As you can see, even with an ideal FPGA-based development flow, the unnatural evolution of our algorithm from parallel to serial – back to parallel – then into sequential logic, then into a synchronous physical timing design yields something less than conceptual elegance.

In Ambric’s process, however, the programming model is closer to the original algorithm, and the actual hardware maps very directly to the programming model. If your algorithm, like many, is a set of small, sequential processes running in parallel and passing data down the line, you’re really already at the GALS level to begin with. Ambric starts you in a graphical programming environment, stitching together algorithmic blocks in a dataflow sequence. Each of these algorithmic objects can be represented by a block whose behavior could be captured in a code snippet. By stitching these objects together in a graph showing data movement we’ve created…

OK, put away your textbooks and get out a pencil and a blank sheet of paper. It’s time for that pop-quiz. Essentially, what you’re doing is … anyone? Right. You’re manually creating your own control and data flow graph (CDFG) as we discussed back in paragraph 6. If you can get past that idea, the rest of the process is stunningly smooth. Unlike our FPGA-based example above, Ambric’s compiler doesn’t have a lot of heavy lifting to do to map your source algorithm into its processing elements.

Each of your objects becomes a code fragment running on one of the processors on Ambric’s array. These processing elements run asynchronously with respect to each other, communicating through Ambric’s specialized channels, which are composed of fifo-style registers with control that stalls each processing element if its input is not yet ready or if its output pipe is full. Because of this asynchronous, buffered design, there are no global synchronization problems to solve, and thus no epic timing closure battle to fight. The abstraction you create in the structured programming environment is very close to the reality that is implemented in the array.

Ambric is focusing their new product on the video and image processing segments – wanting to capitalize on the combination of huge computation demands and lucrative end markets. Ambric’s programming environment hits well within the range of methodologies currently in use in that area as well, with some of the most popular design flows already based on The MathWorks Simulink product, which has a similar graphical dataflow capture to Ambric’s offering.

One of the toughest challenges for Ambric will be to win new programmers over to their structured programming environment and away from traditional, procedural programming in C and C++. Even if the methodology is superior in terms of productivity and performance, learning curves, legacy software IP, and risk aversion on new projects make design teams reluctant to leap into new and relatively unproven design methodologies. The promise of teraOPS performance, however, at a reasonable price and power level, are sure to attract the attention of a lot of design teams running out of gas with today’s existing technologies and methodologies.

Leave a Reply

Power Parallelism

Ambric Accelerates Algorithms

Computing architectures have reached a critical juncture. The monolithic microprocessor has collided with the thermal wall with a resounding, “Ouch! That’s too hot!” Traditional processor architects have moved on to dual-and-more core processors, pursuing some parallelism to mitigate their power problems. Other technologies have responded with “Hey, if parallelism is the solution, why not really go for it?” Compute acceleration with devices like FPGAs can boost the number of numbers crunched per second per Watt by orders of magnitude, but programming them is an activity akin to custom hardware design – hardly safe territory for the average software developer seeking to speed up his favorite algorithm.

What so many have been seeking is an architecture that combines efficient parallelizing of the performance core of demanding algorithms with the ease of programming and predictability of traditional von Neumann-based machines. A number of approaches to the problem have emerged over the past couple of years, each working to overcome the key problem of ease of programming while delivering on the promise of parallelism.

This week, Ambric announced their entry into the parallelism party. Their new devices are massive parallel processor arrays based on the globally asynchronous locally synchronous (GALS) architecture. GALS attacks the global synchronization problem created by complex algorithms spanning large semiconductor devices. As our devices get larger, clock propagation times increase, and algorithms grow more complex, the challenge of scheduling events across resources on a large device and keeping all the hardware synchronized with some global control scheme becomes increasingly difficult to manage. GALS keeps the synchronous part of the design local to small areas of the chip and makes interactions between these areas asynchronous. As a result, GALS increases the scalability of algorithms across large, parallel devices by localizing the problem of synchronization.

The closest cousins to Ambric’s architecture are probably FPGAs on one side, and multi-core traditional processors on the other. Like FPGAs, Ambric’s processors are based on a large array of similar processing elements with programmable interconnect. More like multi-core processors, however, each of these objects is a von Neumann-style processor executing a software algorithm. The goal of this architecture is to deliver something that approaches the performance and power efficiency possible with FPGA-based computing with ease of programming much closer to traditional processors.

Ambric approached the task of accelerated computing from the ground up, using the programming model as their ground. This is in sharp contrast to the FPGA-based computing model, where the programming environment is strapped on as an afterthought, yielding something logically akin to strapping rocket packs on a duck. In order to understand what is elegant about the Ambric model, it is probably best to look at what is inelegant about the idealized FPGA-based one. If we had the perfect “ESL” tool for FPGA-based algorithm design, we’d start with our algorithm in some abstract sequential programming language like C or C++. Our algorithm might not be sequential, but the semantics of the most common programming languages are, so we are forced to mangle our algorithm into sequentiality in order to capture it in code it in the first place.

Now, with our shiny sequential model all debugged and ready to rumble – nested FOR loops masquerading as arrays of parallel processing elements – we hand our creation off to our idealized FPGA-based compiler. First, this compiler would have to reverse-engineer our previous mangling efforts and re-extract the parallelism from our newly sequential model. It would look through our code, finding data dependencies, and it would construct something akin to a data flow graph (DFG) or a control and data flow graph (CDFG). (Remember these concepts, we’ll come back to them later – there will be a quiz!) With this graph in hand, it would then attempt to map each node on the graph to some datapath element constructed out of hardware available on our target device, in this case, an FPGA.

Then, with its list of datapath computing elements and a nice graph showing how the data should flow through them, our FPGA compiler would set about the daunting challenge of creating a schedule that would allow the data to flow through our synchronous device in an orderly fashion and designing a nice controller that could stand on the big parallel podium conducting our sequential symphony of semiconductor computing magic.

Once our compiler had worked out all the magic at the logic level, it would hand off its composition as a network of interconnected logic cells, and place and route software would begin the treacherous task of trying to meet timing across a big-’ol FPGA with clock lines running from here to February. If the timing didn’t all work out perfectly, we’d be left with a human-driven debugging problem, trying to fix timing on a logic design that bears absolutely no resemblance whatsoever to our original nice little C program.

As you can see, even with an ideal FPGA-based development flow, the unnatural evolution of our algorithm from parallel to serial – back to parallel – then into sequential logic, then into a synchronous physical timing design yields something less than conceptual elegance.

In Ambric’s process, however, the programming model is closer to the original algorithm, and the actual hardware maps very directly to the programming model. If your algorithm, like many, is a set of small, sequential processes running in parallel and passing data down the line, you’re really already at the GALS level to begin with. Ambric starts you in a graphical programming environment, stitching together algorithmic blocks in a dataflow sequence. Each of these algorithmic objects can be represented by a block whose behavior could be captured in a code snippet. By stitching these objects together in a graph showing data movement we’ve created…

OK, put away your textbooks and get out a pencil and a blank sheet of paper. It’s time for that pop-quiz. Essentially, what you’re doing is … anyone? Right. You’re manually creating your own control and data flow graph (CDFG) as we discussed back in paragraph 6. If you can get past that idea, the rest of the process is stunningly smooth. Unlike our FPGA-based example above, Ambric’s compiler doesn’t have a lot of heavy lifting to do to map your source algorithm into its processing elements.

Each of your objects becomes a code fragment running on one of the processors on Ambric’s array. These processing elements run asynchronously with respect to each other, communicating through Ambric’s specialized channels, which are composed of fifo-style registers with control that stalls each processing element if its input is not yet ready or if its output pipe is full. Because of this asynchronous, buffered design, there are no global synchronization problems to solve, and thus no epic timing closure battle to fight. The abstraction you create in the structured programming environment is very close to the reality that is implemented in the array.

Ambric is focusing their new product on the video and image processing segments – wanting to capitalize on the combination of huge computation demands and lucrative end markets. Ambric’s programming environment hits well within the range of methodologies currently in use in that area as well, with some of the most popular design flows already based on The MathWorks Simulink product, which has a similar graphical dataflow capture to Ambric’s offering.

One of the toughest challenges for Ambric will be to win new programmers over to their structured programming environment and away from traditional, procedural programming in C and C++. Even if the methodology is superior in terms of productivity and performance, learning curves, legacy software IP, and risk aversion on new projects make design teams reluctant to leap into new and relatively unproven design methodologies. The promise of teraOPS performance, however, at a reasonable price and power level, are sure to attract the attention of a lot of design teams running out of gas with today’s existing technologies and methodologies.

Leave a Reply

featured blogs
Apr 19, 2021
Cache coherency is not a new concept. Coherent architectures have existed for many generations of CPU and Interconnect designs. Verifying adherence to coherency rules in SoCs has always been one of... [[ Click on the title to access the full blog on the Cadence Community sit...
Apr 19, 2021
Samtec blog readers are used to hearing about high-performance design. However, we see an increase in intertest in power integrity (PI). PI grows more crucial with each design iteration, yet many engineers are just starting to understand PI. That raises an interesting questio...
Apr 15, 2021
Explore the history of FPGA prototyping in the SoC design/verification process and learn about HAPS-100, a new prototyping system for complex AI & HPC SoCs. The post Scaling FPGA-Based Prototyping to Meet Verification Demands of Complex SoCs appeared first on From Silic...
Apr 14, 2021
By Simon Favre If you're not using critical area analysis and design for manufacturing to… The post DFM: Still a really good thing to do! appeared first on Design with Calibre....

featured video

Learn the basics of Hall Effect sensors

Sponsored by Texas Instruments

This video introduces Hall Effect, permanent magnets and various magnetic properties. It'll walk through the benefits of Hall Effect sensors, how Hall ICs compare to discrete Hall elements and the different types of Hall Effect sensors.

Click here for more information

featured paper

Understanding Functional Safety FIT Base Failure Rate Estimates per IEC 62380 and SN 29500

Sponsored by Texas Instruments

Functional safety standards such as IEC 61508 and ISO 26262 require semiconductor device manufacturers to address both systematic and random hardware failures. Base failure rates (BFR) quantify the intrinsic reliability of the semiconductor component while operating under normal environmental conditions. Download our white paper which focuses on two widely accepted techniques to estimate the BFR for semiconductor components; estimates per IEC Technical Report 62380 and SN 29500 respectively.

Click here to download the whitepaper

featured chalk talk

Cutting the AI Power Cord: Technology to Enable True Edge Inference

Sponsored by Mouser Electronics and Maxim Integrated

Artificial intelligence and machine learning are exciting buzzwords in the world of electronic engineering today. But in order for artificial intelligence or machine learning to get into mainstream edge devices, we need to enable true edge inference. In this episode of Chalk Talk, Amelia Dalton chats with Kris Ardis from Maxim Integrated about the MAX78000 family of microcontrollers and how this new microcontroller family can help solve our AI inference challenges with low power, low latency, and a built-in neural network accelerator. 

Click here for more information about Maxim Integrated MAX78000 Ultra-Low-Power Arm Cortex-M4 Processor