Collision of Two Worlds

We are always trying to make machines that think faster. Before we finish building computers that can solve the last generation of problems, our imagination expands and we have a whole new set of challenges that require a new level of computing power. Emerging applications like machine vision can seemingly consume all of the computing power we could possibly throw at them – and then some.

For the past couple of decades, a quiet but radical minority has seen FPGAs as a magic bullet in the quest for more computing power. However, the challenges of programming FPGAs for software-like tasks was daunting, and the inertial progress of von Neumann machines surfing the seemingly-eternal wave of Moore’s Law was sufficient to keep our appetites sated.

However, the monolithic von Neumann machine ran out of steam a few years ago. Instead of making larger, faster processors, we had to start building arrays of processors to work in parallel. Once we crossed that line into multiprocessor land, our happy simple time of easy-to-program hardware vanished forever. Programmers now have to understand at least some of the characteristics of the underlying computing machinery in order to write software to efficiently take advantage of many processors working in parallel.

So, software engineers – already doing some of the most complex engineering work in human history – had their job made even more unmanageable by the introduction of multiple parallel processing elements. The complexity of writing “conventional” software is gradually approaching parity with the complexity of developing custom parallel computing hardware from scratch.

So, entering the ring from the left – the newly hardware-savvy software engineer looking to take advantage of the fine-grained parallelism of FPGAs for a giant leap in computational efficiency.

Hardware engineers entered the arena from a different perspective. The rampage of Moore’s law has kept hardware engineers perpetually working to raise their level of design abstraction. When gate-by-gate design using schematics got too complicated to manage, we went to hardware description languages (HDLs) and logic synthesis. When our HDL flow fell short of the productivity we required, we brought in blocks of pre-designed IP. When plugging IP blocks together didn’t give us the flexibility we needed, we entertained the idea of high-level synthesis (HLS).

Now we have reached a point where both hardware and software engineers can take advantage of the performance and power efficiency of FPGAs. Because of high-level synthesis, hardware engineers now have a design tool at a sufficiently high level of abstraction that they can implement many (admittedly not the most complex) algorithms directly in hardware with an incredible degree of performance and power efficiency. The complexity limitations are very real, however, and therefore the work of hardware engineers is often to isolate only the most severe computing bottlenecks in the overall algorithm, and then to implement those as something like hardware accelerators, with conventional processors handling the less demanding parts of the problem.

Typically, this means some deeply nested looping structure within the algorithm is converted into a pipelined datapath/control/memory subsystem that takes advantage of the fine-grained parallelism of the FPGA to do the heavy lifting, while some kind of conventional processor navigates the complex control structures of the larger application.

The hardware/software divide viewed from the software engineering perspective is very similar. Software engineers also isolate portions of the algorithm where the conventional processor is heavily loaded and move those into FPGAs for faster processing.

Since each group is approaching the hardware/software tradeoff space of heterogeneous computing from a different direction, different methodologies have emerged catering to each. The software engineer is courted by flows like OpenCL, where the message is “Come on in, the water’s fine. We can take that OpenCL code you wrote for your GPU and magically “compile” it for FPGAs. You don’t need to go back to hardware school to use it. Just bring in your existing software and you’re good to go. (Of course, this is a very rosy version of the scenario. One might even characterize it as “marketing.”)

Hardware engineers are entranced by the siren song of HLS. At first glance, it looks like a power tool for language-based design. Of course the “language” is some dialect of C (ANSI C/C++, SystemC, etc.) but the rest of the process is in the vernacular of hardware design. You are constructing datapaths and controllers, managing memory access schemes, constructing pipelines, managing the chain of register-to-register combinational logic – pretty much the same things you’d do writing RTL code, but with greatly increased speed and aplomb. You can explore twenty microarchitectures in minutes, where each one would take days to code in RTL. HLS, in the marketing sense, is truly a superpower for hardware engineers.

Today, however, neither the software- nor the hardware-centric design flow can deliver on the rosy vision painted by the marketers. OpenCL code optimized for GPUs is most certainly not optimal for FPGAs, and you’ll need to do some monkeying around to get the results you want. HLS can’t handle just any old C code, and you hardware engineers will need to train yourselves to the somewhat-limited capabilities of the tools.

Both of these flows exist today, however. People have used both approaches to successfully amp up the processing speed and decrease the power consumption of everything from radar algorithms to financial analysis to gene sequencing. With hardware and software engineers staring almost nose-to-nose at each other across the conceptual divide, we are on the verge of a convergence of epic proportions – with FPGAs at the center.

Recently, we talked to Allan Cantle (founder of Nallatech) about the state of affairs in FPGA-based computing. Cantle’s team were pioneers in FPGA-based supercomputing more than a decade ago, and both their hardware and software have continued to be among the most effective at putting FPGAs to work in challenging computing problems. Cantle pointed out that, thanks to current language and tool technologies like OpenCL and HLS, we are now arriving at a state where teams can access the benefits of FPGA-based acceleration without having resident hardware experts.

One problem that needed to be addressed, however, was form factor. Software-centric teams are not accustomed to buying chips. In fact, most software-centric teams are uncomfortable even with the idea of boards. While hardware engineers live every day in the land of silicon and FR4, software engineers like their computers to come in nice sturdy boxes – preferably with a built-in power supply. Nallatech is currently attacking that problem as well – building FPGA-based accelerator cards as well as full rack-mounted systems (with partners like HP and IBM) that take advantage of Altera’s FPGAs and OpenCL flow.

These types of platforms have FPGAs knocking on the door of the data center – a land of riches previously off limits to our LUT-laden lads. With the extreme challenges faced by data centers today in terms of computational power efficiency, there is the potential for an explosive invasion of FPGA technology. It all hinges on the success of current efforts at taming the programming model problem for these heterogeneous computing systems, and getting those hardware and software disciplines to meet nicely in the middle.

In the last two decades we have learned that it’s quite energy efficient to execute multi-threaded C code in LUTs to obtain highly parallel execution. Early projects like TMCC, Handle-C, and Streams-C provided the pioneering work to prove the point. From those early starting points we have seen various commercial products become successful for certain applications.

A decade ago I took the open source TMCC, renamed it FpgaC, and with the help of several others evolved it to present only a subset of standard compliant ANSI C99 in the language. By focusing on using independent LUT memories as small arrays, and adding concurrency with several means, including a pipeline coding style, impressive degrees of parallelism are easily available with code that will also run correctly (highly portable) on a standard von Neumann machine.

http://en.wikipedia.org/wiki/FpgaC…………

With that project, we reached the limits of the very simple synthesis strategies we inherited from TMCC, while making some significant progress in only using standards compliant ANSI C99.

The project needed a ground up rewrite based around a standards compliant OpenMP based ANSI C language, combined with a modern synthesis algorithm optimized around a new one-hot state machine that allowed code blocks with deeper logic (longer latency from inputs to outputs) to use one or more clocks tuned for the fastest state execution, not the slowest state execution.

Even with the simple synthesis borrowed from TMCC, with some careful embedded C coding styles (plus a little manual net list optimization), the results are out right impressive for certain algorithms.

At the time there was a LOT of hostility from the hardware engineering community for even attempting to use ANSI C99 as an HDL. In the decade that has passed, the market clearly voted that C synthesis as an HDL is a necessary and important path.

The other limitation on using C synthesis for computing, are the horrible hardware engineer centric tools that take hours/days to place and route the huge nets produced by large C algorithms being compiled to LUTs.

With higher level knowledge, a good C synthesis algorithm should be able to make better subroutine/function placement and routing decisions based around the implicit name space locality of the code and variables. Especially if the synthesis is optimized using profile data collected for the algorithm using real data on another CPU.

There is every reason to allow C synthesis to produce highly optimized RPMs (Relatively Placed Macros) for each subroutine or function, so that the synthesis algorithm can match the number of one-hot states needed to each code block using some standard expected latency values.

There needs to be a VERY FAST version of this compile, place, route and run execution environment … on the order of a minute or two …. that is at least 50% execution speed optimized at -O3. If the code is a critical bottleneck in system level performance, then sure, compiling with -O99 should allow a mode to iteratively compile, place, route, with feedback in block latency timings to recompile, place, and route with optimized one-hot state machine timings and synthesis provided constraints in both placement and timing.

I still strongly believe this level of synthesis, place and route for RPMs should be open source and shared across all FPGA vendors with a common EDIF interface to low level routing and LUT optimizations provided by each vendors architecture dependent LUT packing and routing tools.

This becomes particularly effective if each higher level C function (with small inline functions subsumed) is locally synchronous, and globally asynchronous. Streams-C took one attack at this, others are available borrowing from some combination of fifos, PThreads, MPI, and using implicit hardware arbitration functions on shared resources. Using OpenMP provides a robust standards compliant framework too.

In the end, it’s not using LUT/registers for high performance computing that will make a difference, it’s being able to compile high level, compute intensive, common C functions directly into VLSI to be included in custom FPGA fabrics as hard IP, just as DSP, memory, and serdes functions are today

One thought on “Collision of Two Worlds”

TotallyLost says:

May 23, 2014 at 8:04 am

In the last two decades we have learned that it’s quite energy efficient to execute multi-threaded C code in LUTs to obtain highly parallel execution. Early projects like TMCC, Handle-C, and Streams-C provided the pioneering work to prove the point. From those early starting points we have seen various commercial products become successful for certain applications.

A decade ago I took the open source TMCC, renamed it FpgaC, and with the help of several others evolved it to present only a subset of standard compliant ANSI C99 in the language. By focusing on using independent LUT memories as small arrays, and adding concurrency with several means, including a pipeline coding style, impressive degrees of parallelism are easily available with code that will also run correctly (highly portable) on a standard von Neumann machine.

http://en.wikipedia.org/wiki/FpgaC…………

With that project, we reached the limits of the very simple synthesis strategies we inherited from TMCC, while making some significant progress in only using standards compliant ANSI C99.

The project needed a ground up rewrite based around a standards compliant OpenMP based ANSI C language, combined with a modern synthesis algorithm optimized around a new one-hot state machine that allowed code blocks with deeper logic (longer latency from inputs to outputs) to use one or more clocks tuned for the fastest state execution, not the slowest state execution.

Even with the simple synthesis borrowed from TMCC, with some careful embedded C coding styles (plus a little manual net list optimization), the results are out right impressive for certain algorithms.

At the time there was a LOT of hostility from the hardware engineering community for even attempting to use ANSI C99 as an HDL. In the decade that has passed, the market clearly voted that C synthesis as an HDL is a necessary and important path.

The other limitation on using C synthesis for computing, are the horrible hardware engineer centric tools that take hours/days to place and route the huge nets produced by large C algorithms being compiled to LUTs.

With higher level knowledge, a good C synthesis algorithm should be able to make better subroutine/function placement and routing decisions based around the implicit name space locality of the code and variables. Especially if the synthesis is optimized using profile data collected for the algorithm using real data on another CPU.

There is every reason to allow C synthesis to produce highly optimized RPMs (Relatively Placed Macros) for each subroutine or function, so that the synthesis algorithm can match the number of one-hot states needed to each code block using some standard expected latency values.

There needs to be a VERY FAST version of this compile, place, route and run execution environment … on the order of a minute or two …. that is at least 50% execution speed optimized at -O3. If the code is a critical bottleneck in system level performance, then sure, compiling with -O99 should allow a mode to iteratively compile, place, route, with feedback in block latency timings to recompile, place, and route with optimized one-hot state machine timings and synthesis provided constraints in both placement and timing.

I still strongly believe this level of synthesis, place and route for RPMs should be open source and shared across all FPGA vendors with a common EDIF interface to low level routing and LUT optimizations provided by each vendors architecture dependent LUT packing and routing tools.

This becomes particularly effective if each higher level C function (with small inline functions subsumed) is locally synchronous, and globally asynchronous. Streams-C took one attack at this, others are available borrowing from some combination of fifos, PThreads, MPI, and using implicit hardware arbitration functions on shared resources. Using OpenMP provides a robust standards compliant framework too.

In the end, it’s not using LUT/registers for high performance computing that will make a difference, it’s being able to compile high level, compute intensive, common C functions directly into VLSI to be included in custom FPGA fabrics as hard IP, just as DSP, memory, and serdes functions are today

Log in to Reply

Collision of Two Worlds

Related

One thought on “Collision of Two Worlds”

Leave a Reply Cancel reply

featured video

How MediaTek Optimizes SI Design with Cadence Optimality Explorer and Clarity 3D Solver

featured chalk talk