Hardware Acceleration in C: Breaking Performance Barriers

by Amelia Dalton

Sometimes you need your C code to run really, really fast. But for most teams, hiring hardware experts to design accelerators just isn’t feasible. In this episode of Chalk Talk, Amelia Dalton chats with Eric Ma of Xilinx about hardware acceleration for your C applications – with no hardware design team required.

Click here for more information about Xilinx’s All Programmable SoC portfolio.

Click here for more information about Xilinx’s SDSoC Development Environment

6 thoughts on “Hardware Acceleration in C: Breaking Performance Barriers”

A decade ago, Xilinx staff openly ridiculed running C code on their FPGA’s as an accelerator. And then slowly warmed up to the idea with modest 3rd party support of these tools.

Xilinx also actively blocked open source tool development, by declaring every internal interface off limits. And still does. Strange how much Xilinx uses open source products, and still blocks 3rd party open source development efforts trying to advance and support their product line.

http://www.edaboard.co.uk/so-xilinx-is-xdl-and-related-libraries-an-available-open-so-t416773,start,15.html

A decade ago the Xilinx chips were not ready for being C accelerators. It’s not clear they are today either — it would take a few simple tests to determine if their power distribution is ready for 85% active switching. A 100% black/white checkerboard test wouldn’t take long to determine that.

Several early designs during the development of FpgaC yielded resets, lockups, and data corruption due to more logic switching than the power rails could handle. This was particularly true on wire bond parts, and better with die attach interposer packages. At the time Xilinx denied this was a problem, till I finally managed to pin Peter down in the forums about it. Later verified by floating some of the power and ground vias for differential signal analysis.

Normal logic designs are not very active, frequently with less than 5-15% of the logic switching on any given clock cycle. And across the die, less than that with multi-phase clocks and multiple clock domains running at different frequencies.

When you pack an FPGA full of a highly optimized pipelined design, a significant portion of the chip will become active on every clock edge, leading to power and ground rail bounces. While we were able to show our client a functional prototype, at reduced capacity, there was not a viable path to market at full performance due to this switching limitation. It’s hard to tell your client that they can only use 15% of their $8,000 FPGA accelerator board.

If you don’t pipeline a data intensive design with significant combinational logic delays, then only a small fraction of the logic will be active, with significantly poorer performance in both latency and dollars. This includes loop expansion optimizations as well.

One of the early C systems translated sequential C code into a large sequential state machine, and avoided the switching problem … while placing a hard limit on performance. In FpgaC we fully exploited available parallelism.

Fully parallel pipelining is really easy in C (if the C synthesis does it’s job right). Just break into small blocks per variable assignment, and invert the sequential execution path. This may require storage of variables that are not updated at every loop, so the correct time shifted value is available. Long sequential dependencies leading to long combinatorial paths are broken up, leading to many concurrent blocks with short combinatorial latencies. Examples in the FpgaC example sources.

While we did a few Xilinx FPGA projects after this, the poor professionalism on Xilinx’s part left a huge bad taste. Austin freely bashed failing competitive concerns, and crowed loudly at their failures due to the competitive stance of Xilinx. There is nothing right about being happy other engineers have lost their jobs, or that the many mom’s and pop’s retirement money was lost/reduced in the failures … there is plenty of business and money to go around.

Austin made many false statements, including that Xilinx didn’t abruptly EOL or remove support for products, like “other” competitors did. Xilinx did abruptly abandon tool chain support for the XC4k parts, for parts that were a significant part of the educational support market. Over night no new XC4k synthesis licenses were available from Xilinx, leaving a large number of educational and proto boards unusable in the market.

At the time I was sitting on thousands of XC4k parts ready to produce some high end hobby/educational boards, where this abrupt change forced over a $70K loss after liquidating most of the higher value parts due to no available/affordable tool chain remaining in the educational/hobby market.

Xilinx/Austin/Peter practiced a scorched earth policy toward anyone that protested … hardly wins/keeps customers. We stopped FpgaC development, and it’s been a very long time since I’ve purchased another Xilinx FPGA. No reason to support companies that actively try to burn you when you complain about poor practices.

Took a while but I found the original thread on what Xilinx called “perverse designs” where there were high switching loads in an otherwise fully valid design for Virtex series parts that caused system level resets due to power rail issues. This was one of many threads where Austin and Peter tried to bully and derail the subject of Xilinx problems with direct personal attacks. When the personal attacks failed, both Peter and Austin later acknowledged the product defects for designs of this “perverse” class.

Austin’s warning is while resets are unlikely, there were still serious issues with noise induced jitter, at least as late as Virtex 5 parts, even with the best possible external power system.

http://compgroups.net/comp.arch.fpga/fastest-fpga/68304

Austin (of Xilinx) wrote on 8/25/2006 2:45:00 PM:

So, even this perverse design (in V4, or V5) is now able to run, in the
largest part.

I still do not recommend it, as the THUMP from all that switching leads
to not being able to control jitter, even with the best possible power
distribution system. I believe that V4 and V5 will require some
internal ‘SSO’ usage restrictions, as they are not like earlier devices
which would configure, DONE would go high, and then immediately reset
and go back to configuring if you tried to instantiate and run a full
device shift register.

Austin

So three questions for Xilinx:

1) Which large FPGA’s will Xilinx stand behind for dense, highly optimized designs with high switching loads?

2) Will Xilinx continue the “kill the messenger” “scorched earth” burning of engineers that have problems with these high switching load designs?

3) Will Xilinx finally open up internal interfaces for 3rd party open source initiatives to advance the state of the art in reconfigurable computing?

It’s been painful to wait a decade … and still not know if these chips can even perform in a data center.

Hopefully Intel is listening, and can correct these issues with Altera parts.

I guess from the silence, the answer from Xilinx is simply NONE of their devices are ready

Hardware Acceleration in C: Breaking Performance Barriers

Related

6 thoughts on “Hardware Acceleration in C: Breaking Performance Barriers”

Leave a Reply Cancel reply