SRC Code

“It was the best of times, it was the worst of times,
it was the age of hardware design, it was the age of programming,
it was the epoch of synthesis, it was the epoch of compilation,
it was the season of optimization, it was the season of acceleration,
it was the spring of flexibility, it was the winter of automation,
we had everything before us, we had nothing before us,
we were all going direct to Hardware, we were all going direct
the other way–in short, the period was so far like the present
period, that some of its noisiest authorities insisted on its
being received, for good or for evil, in the superlative degree
of comparison only.”

– Apologies to Charles Dickens

Our tale of two cities begins with researchers and engineers from two distinct camps with distinct goals attacking the same technical challenge from two different directions. On one side, electronic design automation (EDA) is working to raise the level of design abstraction for hardware engineers. High-level synthesis tools like Celoxica’s DK Design Suite and Mentor’s Catapult C flow forth from EDA, promising to revolutionize hardware design. Across the technological channel (the one that separates software and hardware engineering), high-performance reconfigurable computing companies like SRC are looking for a way to open up the awesome power of hardware acceleration to programmers needing new performance levels they can’t achieve with modern Von Neumann machines. The two camps meet at the FPGA.

It would seem on the surface like these two groups are laboring along the same lines to solve the same problem. Both want to start with an algorithm described in an inherently sequential programming language (like C) and generate parallelized hardware architectures to accelerate execution. With the arrival of the FPGA, both groups are now even targeting the same final platform. Both have had to address the daunting behavioral synthesis problems of allocation and scheduling. Both have figured out how to generate interfaces to pass data efficiently between the hardware embodiment of the algorithm and the outside world.

As in many endeavors, though, the devil is in the details. In this case, the “details” relate back to the starting assumptions that each group made about the hardware environment, the user’s expertise, and the criteria for success. These items have conspired to create a situation where seemingly similar solutions diverged dramatically.

On SRC’s side, the hardware platform is essentially fixed. Their MAP processor contains Xilinx XC2V6000 devices for control and “user logic” backed up and buffered by dual-port memory. Adequate power supply and cooling are already supplied. Their hardware was designed and built before any specific user’s application was known. In the process of creating a compiler to compile target software to their reconfigurable processing element, SRC didn’t need to worry about many of the normal hardware design issues like power, logic density (as long as the design fit), speedgrade, or device size.

SRC reconfigurable MAP processing element.

SRC’s goal was to build a compiler for their high-performance computers that could take advantage of the speed and reconfigurability of FPGAs to parallelize and accelerate complex, performance-critical pieces of software applications. SRC’s “Implicit+Explicit” approach combines implicitly controlled elements like traditional microprocessors and DSPs with explicitly controlled devices like FPGAs. The mission of their “Carte” programming environment is to take C and FORTRAN applications and distribute them intelligently between microprocessors and SRC’s “MAP” FPGA-based processing elements in the most seamless fashion possible.

On the EDA side, the goal is to design new hardware. If an application can be squeezed into fewer logic elements or made to consume less power, the final product will cost less and work better. Profit margins are up for grabs. If the design team using the tool is targeting ASIC, every gate counts. Even when targeting FPGAs, a size level or speedgrade can change the silicon price in 30% increments. Clock frequency is a critical design constraint as it goes hand-in-hand with power consumption and heat dissipation requirements.

SRC’s typical target user is a software developer proficient in C or Fortran and often expert at optimizing code for maximum performance on traditional compilers targeting Von Neumann machines. SRC’s typical user is not a digital hardware designer. Their customer’s closest brush with parallelism probably often came from learning object-oriented design or dealing with event-based systems like graphical user interfaces (GUIs). This user needs a tool that looks and acts like a conventional compile/debug environment while secretly designing custom hardware behind the scenes.

EDA’s target user is more often an EE with hardware description language (HDL) experience. He is used to thinking of algorithms in dataflow terms with datapath, memory, and control elements such as finite state machines or microcontrollers. Design of these elements at the register-transfer level (RTL) is a tedious and exacting process replete with subtle hazards. An algorithm that might take only 100 lines of C code can easily explode into 5,000 lines of RTL HDL correctly detailing a hardware architecture that implements it efficiently. This user needs a tool that can accelerate the process of architectural exploration and facilitate the rapid generation of optimized hardware without manual RTL coding.

Back on the SRC side, the user’s criterion for success is performance, plain and simple. If the algorithm executes faster after hardware acceleration than it did in pure software, the operation is an unqualified success. The degree of reward depends on whether the performance-critical components improve by a factor of two, twenty, or two thousand, but satisfaction kicks in at a very low threshold.

The EDA user has many success criteria and, as is usually the case in engineering, some of them are in conflict. The EDA user is trading quality of results for productivity in a delicate balance. Is it worthwhile to get to market twice as fast with a new product if it requires a 200% cost premium or burns four times the power? The EDA user wants a tool that can give results very, very close to hand-coded RTL based on a human-designed micro-architecture. If a tool can’t provide that automatically, most users are willing to manually intervene right up to the level of hand-crafted RTL.

When these differences translate into tool requirements, reconfigurable computing’s priorities are maximum ease-of-use for a programmer and maximum algorithm acceleration from a given, fixed hardware platform. EDA’s priorities are productivity improvement for an HDL-schooled hardware designer and maximum quality of hardware results in area, performance, and power. As we will see later, these differences have a substantial impact on the nature of the technology each group has developed.

To understand the impact of these requirements on the design of hardware architectures, let’s consider a typical performance-critical loop in a hypothetical software application. In our loop (it’s a nested set, of course), we are traversing large two-dimensional arrays and performing multiplication and addition of pairs of elements. If we think of this problem in hardware terms, we could conceive of a completely combinational network of multipliers and adders (where the number of multiplier units might be in the thousands) doing the operation in one giant chunk. We could also conceive of a very different solution where a single multiplier unit (since they’re expensive in hardware terms) is shared and scheduled into as many clock cycles of a sequential solution as would be required to complete all of the array operations.

Because both solutions are using hardware multipliers and control, both probably achieve a significant performance improvement over a software-only solution. The two hardware architectures could be a factor of thousands apart, however, on cost and raw throughput. For the hardware engineer, any point along the performance/cost continuum could conceivably be the optimum mix, depending on his specific design goals. Perhaps a hundred-multiplier solution with partial resource sharing might give optimal acceleration while still fitting on a cost-effective FPGA. Doubling or halving the number of multipliers might double or halve the cost of the silicon. The end-user of the EDA tool needs a manual control to aim his results at or near the optimal point in this 1000X range.

For the reconfigurable computing customer, however, the problem is much simpler. The degree of parallelism is limited by the hardware resources already available in the reconfigurable processing unit. In SRC’s case, this would be the number of hardware multipliers in the Virtex devices contained in their MAP processor. Using fewer resources than those available leaves performance on the table and gains nothing. A manual control for the reconfigurable customer would be distracting and potentially destructive.

For the hardware designer that uses EDA’s tool, the interface from the target hardware to the outside world is a blank canvas. Anything from a simple parallel interface with primitive handshaking to a complex DMA, bus protocol, or SerDes interface might be needed to connect the design module to the outside world. The bandwidth and sequential nature of the interface chosen could have significant impact on the scheduling requirements of the datapath accelerator itself. There’s clearly no point in having a datapath with ten times the throughput capacity of the I/O channel. The design of the datapath is best completed with a detailed understanding of the I/O pipes at each end.

Since SRC’s customer isn’t designing their own interface to the FPGA, SRC’s MAP compiler isn’t required to adapt to diverse I/O environments. It can optimize each design for the I/O standard already connected to the FPGA. Like the other “known” elements of the design, this helps the compiler optimize the remaining variables more easily.

Back in the hardware/EDA environment, there’s always a latency/throughput/power space to consider for design optimization. Add additional pipelining to a design and you can get increased throughput, but at the cost of increased latency (and sometimes power due to higher operating frequencies). This decision can have deep implications on the nature of the results, so the high-level synthesis tools for hardware designers almost always make this controllable by the user.

With a guaranteed adequate power supply and cooling, however, and a relatively fixed Fmax specified by the FPGA implementation, SRC’s decision space is much more constrained. Instead of leaving their clock frequency and pipeline parameters as a control for the user, SRC assumes a frequency close to the maximum that where they can reliably meet timing constraints with Xilinx’s place-and-route. This assumption nails down yet another corner of the tent, and makes automation of SRC’s compilation all the more reasonable.

At some future juncture, all these high-level language tools may converge. The EDA tools may attain the ease-of-use of a typical software compiler, and the reconfigurable computing compiler may get quality-of-results equivalent to custom-designed hardware. For now, however, the tools are not the same, regardless of how similar they may sound on the surface. SRC may well lead the pack in making reconfigurable computing acceleration both transparent and effective for the programmer. Hardware-centric tools like Mentor Graphics’s Catapult-C may give the hardware engineer the power tool they need for designing optimized modules directly from algorithmic code. Middle-ground tools like Celoxica’s may offer the embedded systems designer a superb platform for making hardware/software tradeoffs at the system level, and may bleed over into the high-performance computing space because of their more software-oriented bias.

Today, SRC’s MAP compilers are able to take advantage of these special assumptions to provide a clean path from software to hardware acceleration for programmers wanting to take advantage of their unique high-performance reconfigurable computing capabilities. In an industry that’s always had to race to stay ahead of Moore’s Law, reconfigurable computing with FPGAs combined with automated hardware compilation from software programming languages may give high-performance computing the breathing room it needs for a new revolution.

SRC Code

Related

Leave a Reply Cancel reply

featured video

How NV5, NVIDIA, and Cadence Collaboration Optimizes Data Center Efficiency, Performance, and Reliability

featured chalk talk