feature article
Subscribe Now

Power Parallelism

Ambric Accelerates Algorithms

What so many have been seeking is an architecture that combines efficient parallelizing of the performance core of demanding algorithms with the ease of programming and predictability of traditional von Neumann-based machines. A number of approaches to the problem have emerged over the past couple of years, each working to overcome the key problem of ease of programming while delivering on the promise of parallelism.

This week, Ambric announced their entry into the parallelism party. Their new devices are massive parallel processor arrays based on the globally asynchronous locally synchronous (GALS) architecture. GALS attacks the global synchronization problem created by complex algorithms spanning large semiconductor devices. As our devices get larger, clock propagation times increase, and algorithms grow more complex, the challenge of scheduling events across resources on a large device and keeping all the hardware synchronized with some global control scheme becomes increasingly difficult to manage. GALS keeps the synchronous part of the design local to small areas of the chip and makes interactions between these areas asynchronous. As a result, GALS increases the scalability of algorithms across large, parallel devices by localizing the problem of synchronization.

The closest cousins to Ambric’s architecture are probably FPGAs on one side, and multi-core traditional processors on the other. Like FPGAs, Ambric’s processors are based on a large array of similar processing elements with programmable interconnect. More like multi-core processors, however, each of these objects is a von Neumann-style processor executing a software algorithm. The goal of this architecture is to deliver something that approaches the performance and power efficiency possible with FPGA-based computing with ease of programming much closer to traditional processors.

Ambric approached the task of accelerated computing from the ground up, using the programming model as their ground. This is in sharp contrast to the FPGA-based computing model, where the programming environment is strapped on as an afterthought, yielding something logically akin to strapping rocket packs on a duck. In order to understand what is elegant about the Ambric model, it is probably best to look at what is inelegant about the idealized FPGA-based one. If we had the perfect “ESL” tool for FPGA-based algorithm design, we’d start with our algorithm in some abstract sequential programming language like C or C++. Our algorithm might not be sequential, but the semantics of the most common programming languages are, so we are forced to mangle our algorithm into sequentiality in order to capture it in code it in the first place.

Now, with our shiny sequential model all debugged and ready to rumble – nested FOR loops masquerading as arrays of parallel processing elements – we hand our creation off to our idealized FPGA-based compiler. First, this compiler would have to reverse-engineer our previous mangling efforts and re-extract the parallelism from our newly sequential model. It would look through our code, finding data dependencies, and it would construct something akin to a data flow graph (DFG) or a control and data flow graph (CDFG). (Remember these concepts, we’ll come back to them later – there will be a quiz!) With this graph in hand, it would then attempt to map each node on the graph to some datapath element constructed out of hardware available on our target device, in this case, an FPGA.

Then, with its list of datapath computing elements and a nice graph showing how the data should flow through them, our FPGA compiler would set about the daunting challenge of creating a schedule that would allow the data to flow through our synchronous device in an orderly fashion and designing a nice controller that could stand on the big parallel podium conducting our sequential symphony of semiconductor computing magic.

Once our compiler had worked out all the magic at the logic level, it would hand off its composition as a network of interconnected logic cells, and place and route software would begin the treacherous task of trying to meet timing across a big-’ol FPGA with clock lines running from here to February. If the timing didn’t all work out perfectly, we’d be left with a human-driven debugging problem, trying to fix timing on a logic design that bears absolutely no resemblance whatsoever to our original nice little C program.

As you can see, even with an ideal FPGA-based development flow, the unnatural evolution of our algorithm from parallel to serial – back to parallel – then into sequential logic, then into a synchronous physical timing design yields something less than conceptual elegance.

In Ambric’s process, however, the programming model is closer to the original algorithm, and the actual hardware maps very directly to the programming model. If your algorithm, like many, is a set of small, sequential processes running in parallel and passing data down the line, you’re really already at the GALS level to begin with. Ambric starts you in a graphical programming environment, stitching together algorithmic blocks in a dataflow sequence. Each of these algorithmic objects can be represented by a block whose behavior could be captured in a code snippet. By stitching these objects together in a graph showing data movement we’ve created…

OK, put away your textbooks and get out a pencil and a blank sheet of paper. It’s time for that pop-quiz. Essentially, what you’re doing is … anyone? Right. You’re manually creating your own control and data flow graph (CDFG) as we discussed back in paragraph 6. If you can get past that idea, the rest of the process is stunningly smooth. Unlike our FPGA-based example above, Ambric’s compiler doesn’t have a lot of heavy lifting to do to map your source algorithm into its processing elements.

Each of your objects becomes a code fragment running on one of the processors on Ambric’s array. These processing elements run asynchronously with respect to each other, communicating through Ambric’s specialized channels, which are composed of fifo-style registers with control that stalls each processing element if its input is not yet ready or if its output pipe is full. Because of this asynchronous, buffered design, there are no global synchronization problems to solve, and thus no epic timing closure battle to fight. The abstraction you create in the structured programming environment is very close to the reality that is implemented in the array.

Ambric is focusing their new product on the video and image processing segments – wanting to capitalize on the combination of huge computation demands and lucrative end markets. Ambric’s programming environment hits well within the range of methodologies currently in use in that area as well, with some of the most popular design flows already based on The MathWorks Simulink product, which has a similar graphical dataflow capture to Ambric’s offering.

One of the toughest challenges for Ambric will be to win new programmers over to their structured programming environment and away from traditional, procedural programming in C and C++. Even if the methodology is superior in terms of productivity and performance, learning curves, legacy software IP, and risk aversion on new projects make design teams reluctant to leap into new and relatively unproven design methodologies. The promise of teraOPS performance, however, at a reasonable price and power level, are sure to attract the attention of a lot of design teams running out of gas with today’s existing technologies and methodologies.

Leave a Reply

Power Parallelism

Ambric Accelerates Algorithms

Computing architectures have reached a critical juncture. The monolithic microprocessor has collided with the thermal wall with a resounding, “Ouch! That’s too hot!” Traditional processor architects have moved on to dual-and-more core processors, pursuing some parallelism to mitigate their power problems. Other technologies have responded with “Hey, if parallelism is the solution, why not really go for it?” Compute acceleration with devices like FPGAs can boost the number of numbers crunched per second per Watt by orders of magnitude, but programming them is an activity akin to custom hardware design – hardly safe territory for the average software developer seeking to speed up his favorite algorithm.

What so many have been seeking is an architecture that combines efficient parallelizing of the performance core of demanding algorithms with the ease of programming and predictability of traditional von Neumann-based machines. A number of approaches to the problem have emerged over the past couple of years, each working to overcome the key problem of ease of programming while delivering on the promise of parallelism.

This week, Ambric announced their entry into the parallelism party. Their new devices are massive parallel processor arrays based on the globally asynchronous locally synchronous (GALS) architecture. GALS attacks the global synchronization problem created by complex algorithms spanning large semiconductor devices. As our devices get larger, clock propagation times increase, and algorithms grow more complex, the challenge of scheduling events across resources on a large device and keeping all the hardware synchronized with some global control scheme becomes increasingly difficult to manage. GALS keeps the synchronous part of the design local to small areas of the chip and makes interactions between these areas asynchronous. As a result, GALS increases the scalability of algorithms across large, parallel devices by localizing the problem of synchronization.

The closest cousins to Ambric’s architecture are probably FPGAs on one side, and multi-core traditional processors on the other. Like FPGAs, Ambric’s processors are based on a large array of similar processing elements with programmable interconnect. More like multi-core processors, however, each of these objects is a von Neumann-style processor executing a software algorithm. The goal of this architecture is to deliver something that approaches the performance and power efficiency possible with FPGA-based computing with ease of programming much closer to traditional processors.

Ambric approached the task of accelerated computing from the ground up, using the programming model as their ground. This is in sharp contrast to the FPGA-based computing model, where the programming environment is strapped on as an afterthought, yielding something logically akin to strapping rocket packs on a duck. In order to understand what is elegant about the Ambric model, it is probably best to look at what is inelegant about the idealized FPGA-based one. If we had the perfect “ESL” tool for FPGA-based algorithm design, we’d start with our algorithm in some abstract sequential programming language like C or C++. Our algorithm might not be sequential, but the semantics of the most common programming languages are, so we are forced to mangle our algorithm into sequentiality in order to capture it in code it in the first place.

Now, with our shiny sequential model all debugged and ready to rumble – nested FOR loops masquerading as arrays of parallel processing elements – we hand our creation off to our idealized FPGA-based compiler. First, this compiler would have to reverse-engineer our previous mangling efforts and re-extract the parallelism from our newly sequential model. It would look through our code, finding data dependencies, and it would construct something akin to a data flow graph (DFG) or a control and data flow graph (CDFG). (Remember these concepts, we’ll come back to them later – there will be a quiz!) With this graph in hand, it would then attempt to map each node on the graph to some datapath element constructed out of hardware available on our target device, in this case, an FPGA.

Then, with its list of datapath computing elements and a nice graph showing how the data should flow through them, our FPGA compiler would set about the daunting challenge of creating a schedule that would allow the data to flow through our synchronous device in an orderly fashion and designing a nice controller that could stand on the big parallel podium conducting our sequential symphony of semiconductor computing magic.

Once our compiler had worked out all the magic at the logic level, it would hand off its composition as a network of interconnected logic cells, and place and route software would begin the treacherous task of trying to meet timing across a big-’ol FPGA with clock lines running from here to February. If the timing didn’t all work out perfectly, we’d be left with a human-driven debugging problem, trying to fix timing on a logic design that bears absolutely no resemblance whatsoever to our original nice little C program.

As you can see, even with an ideal FPGA-based development flow, the unnatural evolution of our algorithm from parallel to serial – back to parallel – then into sequential logic, then into a synchronous physical timing design yields something less than conceptual elegance.

In Ambric’s process, however, the programming model is closer to the original algorithm, and the actual hardware maps very directly to the programming model. If your algorithm, like many, is a set of small, sequential processes running in parallel and passing data down the line, you’re really already at the GALS level to begin with. Ambric starts you in a graphical programming environment, stitching together algorithmic blocks in a dataflow sequence. Each of these algorithmic objects can be represented by a block whose behavior could be captured in a code snippet. By stitching these objects together in a graph showing data movement we’ve created…

OK, put away your textbooks and get out a pencil and a blank sheet of paper. It’s time for that pop-quiz. Essentially, what you’re doing is … anyone? Right. You’re manually creating your own control and data flow graph (CDFG) as we discussed back in paragraph 6. If you can get past that idea, the rest of the process is stunningly smooth. Unlike our FPGA-based example above, Ambric’s compiler doesn’t have a lot of heavy lifting to do to map your source algorithm into its processing elements.

Each of your objects becomes a code fragment running on one of the processors on Ambric’s array. These processing elements run asynchronously with respect to each other, communicating through Ambric’s specialized channels, which are composed of fifo-style registers with control that stalls each processing element if its input is not yet ready or if its output pipe is full. Because of this asynchronous, buffered design, there are no global synchronization problems to solve, and thus no epic timing closure battle to fight. The abstraction you create in the structured programming environment is very close to the reality that is implemented in the array.

Ambric is focusing their new product on the video and image processing segments – wanting to capitalize on the combination of huge computation demands and lucrative end markets. Ambric’s programming environment hits well within the range of methodologies currently in use in that area as well, with some of the most popular design flows already based on The MathWorks Simulink product, which has a similar graphical dataflow capture to Ambric’s offering.

One of the toughest challenges for Ambric will be to win new programmers over to their structured programming environment and away from traditional, procedural programming in C and C++. Even if the methodology is superior in terms of productivity and performance, learning curves, legacy software IP, and risk aversion on new projects make design teams reluctant to leap into new and relatively unproven design methodologies. The promise of teraOPS performance, however, at a reasonable price and power level, are sure to attract the attention of a lot of design teams running out of gas with today’s existing technologies and methodologies.

Leave a Reply

featured blogs
Aug 3, 2021
Picking up from where we left off in the previous post , let's look at some more new and interesting changes made in Hotfix 019. As you might already know, Allegro ® System Capture is available... [[ Click on the title to access the full blog on the Cadence Community si...
Aug 2, 2021
Can you envision intelligent machines creating a 'work of art' involving biological implementations of human legs being used to power some sort of mechanism?...
Jul 30, 2021
You can't attack what you can't see, and cloaking technology for devices on Ethernet LANs is merely one of many protection layers implemented in Q-Net Security's Q-Box to protect networked devices and transaction between these devices from cyberattacks. Other security technol...
Jul 29, 2021
Learn why SoC emulation is the next frontier for power system optimization, helping chip designers shift power verification left in the SoC design flow. The post Why Wait Days for Results? The Next Frontier for Power Verification appeared first on From Silicon To Software....

featured video

Intelligent fall detection using TI mmWave radar sensors

Sponsored by Texas Instruments

Actively sense when a fall has occurred and take action such as sending out an alert in response. Our 60GHz antenna-on-package radar sensor (IWR6843AOP) is ideal for fall detection applications since it’s able to detect falls in large, indoor environments, can distinguish between a person sitting and falling, and utilizes a point cloud vs a person’s identifiable features, which allows the sensor to be used in areas where privacy is vital such as bathrooms and bedrooms.

Click here to explore the AOP evaluation module

featured paper

Configure the backup voltage in a reversible buck/boost regulator

Sponsored by Maxim Integrated

This application note looks at a reference circuit design using Maxim’s MAX38888, which provides a supercapacitor-based power backup in the absence of the system rail by discharging its stored charge. The backup voltage provided by the regulator from the super cap is 12.5% less than the system rail when the system rail is removed. This note explains how to maintain the backup voltage within 5% of the minimum SYS charge voltage.

Click to read more

featured chalk talk

Build, Deploy and Manage Your FPGA-based IoT Edge Applications

Sponsored by Mouser Electronics and Intel

Designing cloud-connected applications with FPGAs can be a daunting engineering challenge. But, new platforms promise to simplify the process and make cloud-connected IoT design easier than ever. In this episode of Chalk Talk, Amelia Dalton chats with Tak Ikushima of Intel about how a collaboration between Microsoft and Intel is pushing innovation forward with a new FPGA Cloud Connectivity Kit.

Click here for more information about Terasic Technologies FPGA Cloud Connectivity Kit