Developers have long been intrigued by the potential of reconfigurable computing (RC) to accelerate some computationally-intensive high-performance computing (HPC) applications. But the barriers to achieving the order-of-magnitude performance gains RC can theoretically provide are well known: the complexity of programming for RC devices and the limitations of the hardware and software traditionally used to support them. As a result, software developers have focused on fine tuning applications to run faster on standard microprocessors, and have achieved important percentage gains.
Now, emerging systems like the Cray XD1 are bringing RC application acceleration into the real world and laying the groundwork to make order of magnitude performance gains a reality. At Cray, we wanted to see just how far we could push this technology.
We employed the Cray XD1 System’s Xilinx FPGA coprocessors to accelerate the Smith-Waterman life sciences application, and explored several techniques to optimize the implementation. Following is a description of our experience. While these steps describe Smith-Waterman specifically, the process we followed and the techniques we employed are broadly applicable to accelerating other applications.
Starting with the Right Application
The first requirement in exploiting RC for HPC applications is determining if your application is well suited to acceleration. We found the Smith-Waterman algorithm—a life sciences application for comparing DNA and amino acid sequences against known genes and proteins—to be an ideal candidate. (The version of the algorithm we chose was the ‘ssearch34’ program from the FASTA sequence comparison package, http://fasta.bioch.virginia.edu/.)
When run on a conventional system, Smith-Waterman spends as much as 98.6 percent of its time repeating a small kernel of code. (The kernel calculates a scoring matrix to compare query sequences against database sequences. Theoretically, an accelerated system could calculate many scoring matrix values in parallel.) This kernel is also extremely stable, so the effort required to accelerate the algorithm won’t be rendered useless by frequent code changes (a real possibility with some applications). And, with so much time spent repeating the same calculation, Smith-Waterman is very well suited for parallelization.
Smith-Waterman also offered a high degree of instruction efficiency. The application’s basic data types (sequences of characters representing nucleotides or amino acids) can be represented by as little as two to five bits. Additionally, all of the computations can be implemented using 24-bit integers. A conventional 64-bit processor would devote all its processing power to these simple equations, whereas an FPGA can use only the logic required, and devote the remaining resources to exploit more parallelism.
RC-friendly applications also rely primarily on local data (so the FPGA is not required to continually access external memory at lower bandwidths). In the Smith-Waterman kernel, calculations are based largely on the results of neighboring processing elements. In short, Smith-Waterman presented an ideal candidate.
A High-bandwidth Architecture
Once you’ve determined that your application is a good candidate for acceleration, you need a system capable of delivering it—one that provides close coupling between the conventional processor and the FPGA, and high bandwidth between the FPGA and its memory. The Cray XD1 system architecture places the FPGA ‘inside the box’ with a high-bandwidth, low-latency internal network connecting the FPGA to the AMD Opteron processors. The FPGAs are tightly integrated into the Opteron’s address space, so the system treats the acceleration FPGAs as co-processors to the Opteron. (See Figures 1 and 2.)
Figure 1. Cray XD1 System Chassis
Figure 2. FPGA Acceleration Processor
Implementing the Algorithm on the FPGA
Deciding how to partition the software and hardware implementation of an algorithm is a complex process, guided by the unique code profile and its memory and bandwidth requirements. Within the ssearch34 code, it was immediately clear that one function (FLOCAL_ALIGN) was an ideal candidate for acceleration. (See Figure 3.)
Figure 3. Smith-Waterman Algorithm Software-Only Code Profile
To understand our next steps, let’s review the algorithm itself. In order to gauge how closely a query string matches a database string, Smith-Waterman produces an alignment array and calculates a score for each square in the matrix. Scores are calculated through a combinatorial function that incorporates the scores of three surrounding squares. (See Figure 4.)
Figure 4. Smith-Waterman Algorithm Example
Whereas a serial processor would calculate one square of the matrix per clock cycle, the FPGA can be designed as a linear systolic array. One query character is preloaded into each processing element, which then applies the algorithm’s equations to calculate the score for that square. (See Figure 5.)
Figure 5. Smith-Waterman Pipeline
Our design uses 48 elements. When arranged in a single pipeline, the system can calculate 48 squares of the scoring matrix (or the results of 48 columns) each clock cycle. (See Figure 6.)
Figure 6. Calculating Smith-Waterman Scores
Once we had created and verified this basic implementation, it became apparent that we could employ a number of a number of optimizations to greatly improve the design’s performance.
Dual Input Buffers
While the latency of the Cray XD1 system architecture is quite low, there is still a fixed cost (approximately 600 nanoseconds) associated with sending data to and from the FPGA. This latency is virtually unnoticeable when calculating longer database strings, but it becomes a major factor in short strings (which make up a significant portion of real-world work).
To reduce the impact of this latency, we implemented two input FIFO buffers to store database strings for processing by the FPGA. The system processes the first database string from one FIFO buffer, while writing a second string to the second FIFO buffer. This overlap allows the design to use all available bandwidth, and makes a huge difference (millions of cell updates per second) when processing short database strings.
Pipelined Processing Elements
As discussed, Smith-Waterman performs a combinatorial function to calculate one square of the array, using scores that have already been calculated (and stored in state registers) from surrounding squares. Ultimately, the performance of the processing elements (and the overall performance of the system) depends on how long it takes the logic to perform these calculations.
One common way to enhance FPGA performance is to pipeline logic wherever possible—in this case, to split the combinatorial function into two smaller, staged functions, separated by registers. Since each function becomes smaller, the system has less delay, and a faster maximum clock speed—approximately double that of a non-pipelined design. However, there is a catch: Since a score in any column depends on the final score of the previous column (not just one stage of the function), this implementation now takes two clocks to produce a given score.
To circumvent this problem, processing elements in our design work on two scores at once. Each clock cycle, the first stage of the pipelined combinatorial function [ fa()] calculates the intermediate results for a square in one section of the array, while the second stage of the function [ fb()] completes the calculation for another section of the array (using fa()‘s results from the previous clock cycle). In practice, each processing element calculates two columns simultaneously, and produces a valid score for each column on alternating clock cycles—while preserving the clock speed improvements.
As described above, a dual input buffer design can offer major performance improvement for short database sequences. However, the strategy has one complication: The technique functions perfectly when Smith-Waterman is comparing non-DNA sequences, since each sequence is compared against all databases. But when working with DNA, the algorithm compares both the query sequence and its complement (the same sequence in reverse) to each database successively. As a result, short DNA queries can’t benefit from the dual input buffers, and suffer from the fixed latency issue.
We solved this problem by taking advantage of the pipelined processing element’s ability to process two queries at once. For these queries, instead of the processing element working on two sections of the array each clock, it works on the original DNA sequence and its complement.
Reduced Score Size
Surprising as it seems to many software designers, the speed of any FPGA processing element depends on the size of the score value being calculated. (Scores stored as larger integers require more logic—and thus, are larger and slower—than scores stored as smaller integers.) When processing elements require less logic, designers can place more elements on the FPGA, and achieve higher clock frequencies and better performance.
To achieve peak FPGA performance, the system design ideally should be based on the smallest score value possible. By designing our Smith-Waterman implementation with 24-bit integers (sufficient to store the maximum possible score of the algorithm) we were able to use 16 more processing elements than would have been possible with 32 bits, and run approximately 20 MHz faster.
Theoretically, even better performance could be achieved using even smaller score values. But smaller score values may not be sufficient to hold the maximum values possible. And, depending on the application, the extra effort and system resources required to cope with occasional overruns may not be worth the cost.
Accelerating the Application
Using these techniques, we substantially increased the performance of the Smith-Waterman algorithm. As the code profile shows (Figure 7), the percentage of time spent on the application kernel was cut by nearly a third, with most of the “do_work” function consisting of calls to the FPGA. For the data set shown below, the absolute time spent on the algorithm dropped from 6,461 seconds to a little over 100 seconds—approximately 64 times faster than the equivalent software-only implementation.
Figure 7. Accelerated Code Profile
While performance gains like this are impressive, what’s more impressive is the fact that they can be duplicated in a wide range of applications. With the right code, the right system architecture, and a little ingenuity in the implementation, system designers can use any of the optimization strategies described above (and others) to extend RC performance improvements into the real world.
Steve Margerm is the team leader for reconfigurable computing on the Cray XD1 system. Prior to joining Cray, Margerm worked in the telecommunications and networking fields for more than 10 years.