Accelerating C Software Applications Using a CompactFlash FPGA Accelerator Card

As the cost per gate of FPGAs continues to plummet, developers of embedded software applications are being presented with increased opportunities to create high performance, hardware-accelerated systems. These systems—which represent applications in domains ranging from image processing and DSP to larger-scale applications for scientific computing—benefit from the massive levels of parallelism that are available when FPGAs are used as alternatives to traditional processor architectures.

This article describes how the convergence of easier-to-use, more powerful FPGA-based computing platforms and software-to-hardware design tools can make the design of accelerated FPGA-based algorithms easier and more practical for software application developers. The article also presents actual performance numbers for a benchmark algorithm, helping to illustrate the speedups that are possible when FPGAs are included in an overall embedded computing strategy. These results are easily achievable for other computationally-intensive algorithms using current-generation FPGA platforms and available tools.

Overview

It is widely understood that FPGAs have the potential to radically change the way that high-performance embedded applications are created. These devices can accelerate many applications by multiple orders of magnitude, by virtue of their massively parallel structures. But to achieve this level of acceleration it has, until recently, been necessary to apply low-level, FPGA-specific hardware design skills and tools. This fact has greatly limited the penetration of FPGAs as mainstream computing platforms.

On the hardware side, FPGAs have presented numerous challenges to those not intimately familiar with hardware engineering concepts. FPGA prototyping boards and systems have helped to some extent, but the sheer complexity of creating and programming a system based on an FPGA has remained challenging for all but the most hardware-savvy software programmers. This is changing, however. With the introduction of simplified, CompactFlash form-factor FPGA computing platforms and streamlined tools for platform building, software engineers now find that most of the hardware barriers to FPGAs have been removed.

On the software side, the introduction of software-to-hardware tools for FPGAs has dramatically improved the practicality of these devices as genuinely software-programmable computing platforms.

Software programming tools for FPGAs add value and improve results of applications by providing an appropriate and easily understood abstraction of the target platform, whether that platform is a single FPGA accelerator connected to an external processor or a larger-scale FPGA-based grid computer. As with any software tool designed for any computing target, a good abstraction of the FPGA-based platform allows software developers to create, test and debug relatively portable applications while encouraging them to use programming methods that will result in the highest practical performance within the constraints of the target.

Software tools also help by automating the process of compilation from higher-levels of abstraction (C-language, for example) into an optimized low-level equivalent that can be implemented—loaded and executed—on the target platform. In an ideal tool flow, the specific steps of this process would be of no concern to the programmer; the application would simply operate at its highest possible efficiency through the magic of automated tools. In practice this is rarely the case: any programmer seeking high performance—on any type of computing platform, including traditional processors—must have at least a rudimentary understanding of how the optimization and code generation or mapping process works, and must exert some level of control over the process either by adjusting the flow (specifying compiler options, for example) or by revisiting the original application and optimizing at the algorithm level, or both.

When FPGA-based platforms are the target, software-to-hardware tools must address both the automatic compilation/optimization problem and the need for appropriate programming abstractions, or programming models, for non-traditional, highly parallel computing devices.

A CompactFlash Interface FPGA Accelerator Card

Providing software programmers with an easy-to-use hardware platform is a critical first step in making FPGAs viable as general-purpose computing devices. With this in mind, Pico Computing set out to develop a card that would use the latest-generation FPGA devices but require minimal hardware understanding on the part of the user. The result is the Pico E-12, a PCMCIA-compatible, CompactFlash form-factor card that draws less than one watt and is supported by platform development and programming tools appropriate for use by software developers. This platform advances desktop and portable computing to a new level by providing massively parallel hardware computing resources in a low-power, self-contained package (Figure 1).

Figure 1. The Pico Computing EP card offers an extremely powerful embedded system in a Compact Flash package boasting a XILINX Virtex-4 FX12 FPGA including a 450 MHz PowerPC processor and 12,312 configurable Logic Cells. The E-12 EP also comes equipped with Gigabit Ethernet, 64 MB Flash ROM, 128 MB PC133 RAM , and a 16-bit Compact Flash interface making this tiny machine capable of accelerating many algorithms in hardware with the flexibility of software.

The Pico E-12 is based on the latest-generation Xilinx Virtex-4™ FPGA. There are two versions of the Pico E-12: Logic Optimized (LO) and Embedded Processor (EP). The Logic Optimized versions offer the most user-configurable logic, while the Embedded Processor version provides a reduced amount of FPGA logic but adds an embedded PowerPC™ processor. In either case, the FPGA device is completely reconfigurable through the E-12’s CompactFlash interface, with no external power or cabling required, either for programming or for normal operation.

C-to-Hardware Tools Increase Design Productivity

Having a hardware platform that is physically accessible to the software programmer is one thing, but it is of little use to that programmer without a method of creating FPGA-based applications that does not require hardware design knowledge. Impulse C and the Impulse CoDeveloper tools give software programmers access to FPGAs by allowing hardware accel-erators to be compiled directly from software descriptions.

Figure 2. The CoDeveloper tools simplify the conversion of higher-level software algorithms to lower-level FPGA logic, and provide necessary software-to-hardware communications.

Impulse C enables true software programming of FPGA devices using the C language. The Impulse C tools allow FPGA algorithms to be developed and debugged using popular C and C++ development environments including Microsoft Visual Studio™ and GCC-based tools. The CoDeveloper software-to-hardware compiler translates specific C-language processes to low-level FPGA-hardware, while optimizing the generated logic and identifying opportunities for parallelism. The Impulse C compiler analyzes the user’s C code and collapses multiple statements and operations into single-clock instruction stages. The compiler is also capable of unrolling loops and generating loop pipelines to exploit the extreme levels of parallelism possible in an FPGA. Instrumentation and monitoring functions generate debugging visualizations for highly-parallel, multi-process applications, helping system designers identify dataflow bottlenecks and other areas for acceleration.

For applications involving embedded processors, the Impulse C compiler automates the creation of hardware/software interfaces and generates outputs compatible with popular FPGA synthesis and system-builder tools including Xilinx Platform Studio™ and Altera SOPC Builder™. This makes it possible to create high performance, mixed hardware/software applications for FPGA-based platforms without the need to write low-level VHDL or Verilog.

The Impulse tools optimize the C code to exploit the FPGA’s parallel processing capability, resulting in potentially large factors of acceleration. The Impulse tools also generate the required software-to-hardware interfaces, allowing data to be moved efficiently between the FPGA and the optional on-board PowerPC™ processor, and between the E-12 card and a host PC (see Figure 3).

Figure 3. The Impulse C CoDeveloper tools provide software programming, from C-language, of both the embedded PowerPC™ processor and the on-board FPGA logic. Automated extraction and optimization of statement-level parallelism coupled with a programming model supporting hardware/software partitioning—all in the C language—makes true software programming of FPGA accelerators a reality.

Benchmarking FPGA Acceleration

Using the Pico E-12 in conjunction with Impulse C, we set out to compare relative performance of the three potential computing targets (a Virtex-4 FPGA, an embedded PowerPC 405 embedded processor and a desktop Pentium processor) for a computationally-intensive problem, using only C programming techniques.

The algorithm selected was a fractal image generator, which involves a large number of iterative multiply operations. The algorithm was originally implemented using floating point on a standard Pentium-based desktop computer, and can be summarized by the following C code:

// Calculate points
c_imag=ymax; for (j = 0; j < YSIZE; j++) {

c_real=xmin;
for (i = 0; i < XSIZE; i++) {

z_real = z_imag = 0;
// Calculate z0, z1, …. until divergence or maximum iterations
k = 0;
do {

tmp = z_real*z_real – z_imag*z_imag + c_real;
z_imag = 2.0*z_real*z_imag + c_imag;
z_real = tmp;
result = z_real*z_real + z_imag*z_imag;
k++;

} while (result < 4.0 && k < MAX_ITERATIONS);

// 3. Map points to gray scale: change to suit your preferences
B = G = R = 0;
if (k != MAX_ITERATIONS) {
____R = G = B = k > 255 ? 255 : k;
}

putc(B, outfile); putc(G, outfile); putc(R, outfile);
pixelCount++;
c_real+=dx;

}
c_imag-=dy;

}

The above algorithm describes the fractal image generator using double-precision floating point. For each point in a specified X-Y plane representing a given range of values, real and complex are iteratively calculated to determine of they converge toward zero or diverge to infinity. The number of iterations required to establish convergence/divergence determines the accuracy of the generated image.

When run with a MAX_ITERATIONS value of 10,000, this algorithm produces an image such as the one displayed in Figure 4. (The pattern generated depends on the range values provided in the variables ymax, ymin, xmax and xmin.)

Figure 4. A generated fractal image (without color mapping).

To implement this same algorithm in both an FPGA and in an embedded PowerPC processor, it was first necessary to convert the algorithm from floating point to fixed-point. This is necessary because FPGAs available today are not equipped with hardware-accessible floating point units. Floating to fixed-point conversion (which for this example required a few hours of work, using macros supplied by Impulse) results in C code which can be compiled to an FPGA with a reasonable level of efficiency. (The fixed-point version of the algorithm is available online at www.ImpulseC.com.)

Using the fixed-point version of the algorithm as a baseline, numerous tests were performed in which the same C code was compiled to both the FPGA (in this case the Xilinx Virtex-4 found in the Pico E-12 card) and to an embedded PowerPC processor (also found on the E-12 card, embedded within the Virtex-4). The same 1024 by 786 fractal image was then generated, using successively larger iteration values ranging from 100 (which produced a poor-quality fractal image) to 100,000. At the highest iteration level, the PowerPC running at 400Mhz was unable to complete the calculations in a practical amount of time (generation of an image required literally hours), while the FPGA logic running at just 25Mhz was able to perform the task approximately 18 times faster. This difference in speed is due to the fact that the multiply and other operations are parallelized and pipelined by the Impulse C compiler.

Cranking it Up for Higher Performance

To achieve even higher levels of performance, we then decided to create two parallel processes, each operating on alternate lines of the generated image. This was possible because there was adequate space available in the FPGA for multiple instances of the algorithm, which is by its nature scalable with little or no performance penalty. (In fact, there is room in the LX device for as many as four parallel instances of this process.) This doubling of FPGA resources resulted in a corresponding doubling of performance. At this time we also began some relatively straightforward optimizations of the C code, including the addition of Impulse C Pipeline and StageDelay pragmas, which increased the algorithm’s cycle-by-cycle throughput and doubled its clock speed, leading to the final results shown in Figure 5. As the chart shows, the algorithm running in the FPGA shows performance that is nearly 50% faster than the performance of the Pentium (with its floating point unit), even though the Pentium is clocking at a rate (3.6 GHz) over 70 times faster. And when compared to the performance of the embedded PowerPC processor, the FPGA code shows an impressive 147X increase in performance (not including I/O overhead), again running at clock rate that is significantly slower.

			Iterations
	100	1000	2000	10000	100000	Acceleration (10K iterations)
FPGA, 50MHz w/o I/O	.71	3.03	4.69	16.48	144.71	147X
FPGA, 50MHz with I/O	15.61	15.47	15.32	22.74	149.40	106X
Pentium, 3.6GHz	0.64	2.51	5.32	23.11	199.55	104X
PPC405, 400MHz	24.20	241.83	483.67	2418.30	n/a	1

Notes:
Results obtained on Xilinx Virtex-4 (LX and FX), Pico E-12 card, CoDeveloper Version 2.01

Figure 5. Test results for a range of maximum iteration values demonstrate substantial speedup of the algorithm (167X when using two parallel processes) compared to an embedded processor implementation.

Lastly, the algorithm was modified in such as way that the results of the image generation (a 1024 X 768 collection of pixel values) could be transmitted via the Pico E-12 Compact Flash interface, as an Impulse C stream, for display on the PC. As the chart shows, this communication overhead has a significant impact at lower iteration values, but has a minimal impact at higher iteration counts. This suggests that computationally-intensive computing problems are well suited for FPGA acceleration, but problems requiring relatively a high ratio of data communication to computations may be less well suited.

Summary

This article has demonstrated how it is possible, using an FPGA-based accelerator card and C-to-hardware tools, to create highly accelerated systems without the need for low-level hardware design skills. Cards such as the E-12 promise to revolutionize the way that FPGA devices are applied for high-performance embedded computing. Software-to-hardware tools such as Impulse C make programming for such cards practical and efficient.