feature article
Subscribe Now

Reconfigurable Computing in Real-World Applications

Developers have long been intrigued by the potential of reconfigurable computing (RC) to accelerate some computationally-intensive high-performance computing (HPC) applications. But the barriers to achieving the order-of-magnitude performance gains RC can theoretically provide are well known: the complexity of programming for RC devices and the limitations of the hardware and software traditionally used to support them. As a result, software developers have focused on fine tuning applications to run faster on standard microprocessors, and have achieved important percentage gains.

Now, emerging systems like the Cray XD1 are bringing RC application acceleration into the real world and laying the groundwork to make order of magnitude performance gains a reality. At Cray, we wanted to see just how far we could push this technology.

We employed the Cray XD1 System’s Xilinx FPGA coprocessors to accelerate the Smith-Waterman life sciences application, and explored several techniques to optimize the implementation. Following is a description of our experience. While these steps describe Smith-Waterman specifically, the process we followed and the techniques we employed are broadly applicable to accelerating other applications.

Starting with the Right Application

The first requirement in exploiting RC for HPC applications is determining if your application is well suited to acceleration. We found the Smith-Waterman algorithm—a life sciences application for comparing DNA and amino acid sequences against known genes and proteins—to be an ideal candidate. (The version of the algorithm we chose was the ‘ssearch34’ program from the FASTA sequence comparison package, http://fasta.bioch.virginia.edu/.)

When run on a conventional system, Smith-Waterman spends as much as 98.6 percent of its time repeating a small kernel of code. (The kernel calculates a scoring matrix to compare query sequences against database sequences. Theoretically, an accelerated system could calculate many scoring matrix values in parallel.) This kernel is also extremely stable, so the effort required to accelerate the algorithm won’t be rendered useless by frequent code changes (a real possibility with some applications). And, with so much time spent repeating the same calculation, Smith-Waterman is very well suited for parallelization.

Smith-Waterman also offered a high degree of instruction efficiency. The application’s basic data types (sequences of characters representing nucleotides or amino acids) can be represented by as little as two to five bits. Additionally, all of the computations can be implemented using 24-bit integers. A conventional 64-bit processor would devote all its processing power to these simple equations, whereas an FPGA can use only the logic required, and devote the remaining resources to exploit more parallelism.

RC-friendly applications also rely primarily on local data (so the FPGA is not required to continually access external memory at lower bandwidths). In the Smith-Waterman kernel, calculations are based largely on the results of neighboring processing elements. In short, Smith-Waterman presented an ideal candidate.

A High-bandwidth Architecture

Once you’ve determined that your application is a good candidate for acceleration, you need a system capable of delivering it—one that provides close coupling between the conventional processor and the FPGA, and high bandwidth between the FPGA and its memory. The Cray XD1 system architecture places the FPGA ‘inside the box’ with a high-bandwidth, low-latency internal network connecting the FPGA to the AMD Opteron processors. The FPGAs are tightly integrated into the Opteron’s address space, so the system treats the acceleration FPGAs as co-processors to the Opteron. (See Figures 1 and 2.)


Figure 1. Cray XD1 System Chassis


Figure 2. FPGA Acceleration Processor

Implementing the Algorithm on the FPGA

Deciding how to partition the software and hardware implementation of an algorithm is a complex process, guided by the unique code profile and its memory and bandwidth requirements. Within the ssearch34 code, it was immediately clear that one function (FLOCAL_ALIGN) was an ideal candidate for acceleration. (See Figure 3.)


Figure 3. Smith-Waterman Algorithm Software-Only Code Profile

To understand our next steps, let’s review the algorithm itself. In order to gauge how closely a query string matches a database string, Smith-Waterman produces an alignment array and calculates a score for each square in the matrix. Scores are calculated through a combinatorial function that incorporates the scores of three surrounding squares. (See Figure 4.)


Figure 4. Smith-Waterman Algorithm Example

Whereas a serial processor would calculate one square of the matrix per clock cycle, the FPGA can be designed as a linear systolic array. One query character is preloaded into each processing element, which then applies the algorithm’s equations to calculate the score for that square. (See Figure 5.)


Figure 5. Smith-Waterman Pipeline

Our design uses 48 elements. When arranged in a single pipeline, the system can calculate 48 squares of the scoring matrix (or the results of 48 columns) each clock cycle. (See Figure 6.)


Figure 6. Calculating Smith-Waterman Scores

Once we had created and verified this basic implementation, it became apparent that we could employ a number of a number of optimizations to greatly improve the design’s performance.

Dual Input Buffers

While the latency of the Cray XD1 system architecture is quite low, there is still a fixed cost (approximately 600 nanoseconds) associated with sending data to and from the FPGA. This latency is virtually unnoticeable when calculating longer database strings, but it becomes a major factor in short strings (which make up a significant portion of real-world work).

To reduce the impact of this latency, we implemented two input FIFO buffers to store database strings for processing by the FPGA. The system processes the first database string from one FIFO buffer, while writing a second string to the second FIFO buffer. This overlap allows the design to use all available bandwidth, and makes a huge difference (millions of cell updates per second) when processing short database strings.

Pipelined Processing Elements

As discussed, Smith-Waterman performs a combinatorial function to calculate one square of the array, using scores that have already been calculated (and stored in state registers) from surrounding squares. Ultimately, the performance of the processing elements (and the overall performance of the system) depends on how long it takes the logic to perform these calculations.

One common way to enhance FPGA performance is to pipeline logic wherever possible—in this case, to split the combinatorial function into two smaller, staged functions, separated by registers. Since each function becomes smaller, the system has less delay, and a faster maximum clock speed—approximately double that of a non-pipelined design. However, there is a catch: Since a score in any column depends on the final score of the previous column (not just one stage of the function), this implementation now takes two clocks to produce a given score.

To circumvent this problem, processing elements in our design work on two scores at once. Each clock cycle, the first stage of the pipelined combinatorial function [ fa()] calculates the intermediate results for a square in one section of the array, while the second stage of the function [ fb()] completes the calculation for another section of the array (using fa()‘s results from the previous clock cycle). In practice, each processing element calculates two columns simultaneously, and produces a valid score for each column on alternating clock cycles—while preserving the clock speed improvements.

Dual Queries

As described above, a dual input buffer design can offer major performance improvement for short database sequences. However, the strategy has one complication: The technique functions perfectly when Smith-Waterman is comparing non-DNA sequences, since each sequence is compared against all databases. But when working with DNA, the algorithm compares both the query sequence and its complement (the same sequence in reverse) to each database successively. As a result, short DNA queries can’t benefit from the dual input buffers, and suffer from the fixed latency issue.

We solved this problem by taking advantage of the pipelined processing element’s ability to process two queries at once. For these queries, instead of the processing element working on two sections of the array each clock, it works on the original DNA sequence and its complement.

Reduced Score Size

Surprising as it seems to many software designers, the speed of any FPGA processing element depends on the size of the score value being calculated. (Scores stored as larger integers require more logic—and thus, are larger and slower—than scores stored as smaller integers.) When processing elements require less logic, designers can place more elements on the FPGA, and achieve higher clock frequencies and better performance.

To achieve peak FPGA performance, the system design ideally should be based on the smallest score value possible. By designing our Smith-Waterman implementation with 24-bit integers (sufficient to store the maximum possible score of the algorithm) we were able to use 16 more processing elements than would have been possible with 32 bits, and run approximately 20 MHz faster.

Theoretically, even better performance could be achieved using even smaller score values. But smaller score values may not be sufficient to hold the maximum values possible. And, depending on the application, the extra effort and system resources required to cope with occasional overruns may not be worth the cost.

Accelerating the Application

Using these techniques, we substantially increased the performance of the Smith-Waterman algorithm. As the code profile shows (Figure 7), the percentage of time spent on the application kernel was cut by nearly a third, with most of the “do_work” function consisting of calls to the FPGA. For the data set shown below, the absolute time spent on the algorithm dropped from 6,461 seconds to a little over 100 seconds—approximately 64 times faster than the equivalent software-only implementation.


Figure 7. Accelerated Code Profile

While performance gains like this are impressive, what’s more impressive is the fact that they can be duplicated in a wide range of applications. With the right code, the right system architecture, and a little ingenuity in the implementation, system designers can use any of the optimization strategies described above (and others) to extend RC performance improvements into the real world.

Steve Margerm is the team leader for reconfigurable computing on the Cray XD1 system. Prior to joining Cray, Margerm worked in the telecommunications and networking fields for more than 10 years.

7 thoughts on “Reconfigurable Computing in Real-World Applications”

  1. Pingback: cpns kemenkumham
  2. Pingback: GVK Biosciences
  3. Pingback: friv 1
  4. Pingback: DMPK ADME

Leave a Reply

featured blogs
Sep 30, 2022
When I wrote my book 'Bebop to the Boolean Boogie,' it was certainly not my intention to lead 6-year-old boys astray....
Sep 30, 2022
Wow, September has flown by. It's already the last Friday of the month, the last day of the month in fact, and so time for a monthly update. Kaufman Award The 2022 Kaufman Award honors Giovanni (Nanni) De Micheli of École Polytechnique Fédérale de Lausanne...
Sep 29, 2022
We explain how silicon photonics uses CMOS manufacturing to create photonic integrated circuits (PICs), solid state LiDAR sensors, integrated lasers, and more. The post What You Need to Know About Silicon Photonics appeared first on From Silicon To Software....

featured video

PCIe Gen5 x16 Running on the Achronix VectorPath Accelerator Card

Sponsored by Achronix

In this demo, Achronix engineers show the VectorPath Accelerator Card successfully linking up to a PCIe Gen5 x16 host and write data to and read data from GDDR6 memory. The VectorPath accelerator card featuring the Speedster7t FPGA is one of the first FPGAs that can natively support this interface within its PCIe subsystem. Speedster7t FPGAs offer a revolutionary new architecture that Achronix developed to address the highest performance data acceleration challenges.

Click here for more information about the VectorPath Accelerator Card

featured chalk talk

TE APL: Flexibility for Any Use

Sponsored by Mouser Electronics and TE Connectivity

Connectors can make a big difference when it comes to reducing system complexity and ease of use but did you know they can also help with automation and sustainability as well? In this episode of Chalk Talk, Amelia Dalton and Anita Costamagna from TE discuss TE’s APL Connectivity solutions. They dig into the details of these connector solutions and how you can get started using these connector solutions in your next design.

Click here for more information about TE Connectivity Appliance Solutions