feature article
Subscribe Now

Reconfigurable Computing in Real-World Applications

Developers have long been intrigued by the potential of reconfigurable computing (RC) to accelerate some computationally-intensive high-performance computing (HPC) applications. But the barriers to achieving the order-of-magnitude performance gains RC can theoretically provide are well known: the complexity of programming for RC devices and the limitations of the hardware and software traditionally used to support them. As a result, software developers have focused on fine tuning applications to run faster on standard microprocessors, and have achieved important percentage gains.

Now, emerging systems like the Cray XD1 are bringing RC application acceleration into the real world and laying the groundwork to make order of magnitude performance gains a reality. At Cray, we wanted to see just how far we could push this technology.

We employed the Cray XD1 System’s Xilinx FPGA coprocessors to accelerate the Smith-Waterman life sciences application, and explored several techniques to optimize the implementation. Following is a description of our experience. While these steps describe Smith-Waterman specifically, the process we followed and the techniques we employed are broadly applicable to accelerating other applications.

Starting with the Right Application

The first requirement in exploiting RC for HPC applications is determining if your application is well suited to acceleration. We found the Smith-Waterman algorithm—a life sciences application for comparing DNA and amino acid sequences against known genes and proteins—to be an ideal candidate. (The version of the algorithm we chose was the ‘ssearch34’ program from the FASTA sequence comparison package, http://fasta.bioch.virginia.edu/.)

When run on a conventional system, Smith-Waterman spends as much as 98.6 percent of its time repeating a small kernel of code. (The kernel calculates a scoring matrix to compare query sequences against database sequences. Theoretically, an accelerated system could calculate many scoring matrix values in parallel.) This kernel is also extremely stable, so the effort required to accelerate the algorithm won’t be rendered useless by frequent code changes (a real possibility with some applications). And, with so much time spent repeating the same calculation, Smith-Waterman is very well suited for parallelization.

Smith-Waterman also offered a high degree of instruction efficiency. The application’s basic data types (sequences of characters representing nucleotides or amino acids) can be represented by as little as two to five bits. Additionally, all of the computations can be implemented using 24-bit integers. A conventional 64-bit processor would devote all its processing power to these simple equations, whereas an FPGA can use only the logic required, and devote the remaining resources to exploit more parallelism.

RC-friendly applications also rely primarily on local data (so the FPGA is not required to continually access external memory at lower bandwidths). In the Smith-Waterman kernel, calculations are based largely on the results of neighboring processing elements. In short, Smith-Waterman presented an ideal candidate.

A High-bandwidth Architecture

Once you’ve determined that your application is a good candidate for acceleration, you need a system capable of delivering it—one that provides close coupling between the conventional processor and the FPGA, and high bandwidth between the FPGA and its memory. The Cray XD1 system architecture places the FPGA ‘inside the box’ with a high-bandwidth, low-latency internal network connecting the FPGA to the AMD Opteron processors. The FPGAs are tightly integrated into the Opteron’s address space, so the system treats the acceleration FPGAs as co-processors to the Opteron. (See Figures 1 and 2.)

20060207_cray_fig1.jpg

Figure 1. Cray XD1 System Chassis

20060207_cray_fig2.jpg

Figure 2. FPGA Acceleration Processor

Implementing the Algorithm on the FPGA

Deciding how to partition the software and hardware implementation of an algorithm is a complex process, guided by the unique code profile and its memory and bandwidth requirements. Within the ssearch34 code, it was immediately clear that one function (FLOCAL_ALIGN) was an ideal candidate for acceleration. (See Figure 3.)

20060207_cray_fig3.jpg

Figure 3. Smith-Waterman Algorithm Software-Only Code Profile

To understand our next steps, let’s review the algorithm itself. In order to gauge how closely a query string matches a database string, Smith-Waterman produces an alignment array and calculates a score for each square in the matrix. Scores are calculated through a combinatorial function that incorporates the scores of three surrounding squares. (See Figure 4.)

20060207_cray_fig4.jpg

Figure 4. Smith-Waterman Algorithm Example

Whereas a serial processor would calculate one square of the matrix per clock cycle, the FPGA can be designed as a linear systolic array. One query character is preloaded into each processing element, which then applies the algorithm’s equations to calculate the score for that square. (See Figure 5.)

 20060207_cray_fig5.jpg

Figure 5. Smith-Waterman Pipeline

Our design uses 48 elements. When arranged in a single pipeline, the system can calculate 48 squares of the scoring matrix (or the results of 48 columns) each clock cycle. (See Figure 6.)

20060207_cray_fig6.jpg

Figure 6. Calculating Smith-Waterman Scores

Once we had created and verified this basic implementation, it became apparent that we could employ a number of a number of optimizations to greatly improve the design’s performance.

Dual Input Buffers

While the latency of the Cray XD1 system architecture is quite low, there is still a fixed cost (approximately 600 nanoseconds) associated with sending data to and from the FPGA. This latency is virtually unnoticeable when calculating longer database strings, but it becomes a major factor in short strings (which make up a significant portion of real-world work).

To reduce the impact of this latency, we implemented two input FIFO buffers to store database strings for processing by the FPGA. The system processes the first database string from one FIFO buffer, while writing a second string to the second FIFO buffer. This overlap allows the design to use all available bandwidth, and makes a huge difference (millions of cell updates per second) when processing short database strings.

Pipelined Processing Elements

As discussed, Smith-Waterman performs a combinatorial function to calculate one square of the array, using scores that have already been calculated (and stored in state registers) from surrounding squares. Ultimately, the performance of the processing elements (and the overall performance of the system) depends on how long it takes the logic to perform these calculations.

One common way to enhance FPGA performance is to pipeline logic wherever possible—in this case, to split the combinatorial function into two smaller, staged functions, separated by registers. Since each function becomes smaller, the system has less delay, and a faster maximum clock speed—approximately double that of a non-pipelined design. However, there is a catch: Since a score in any column depends on the final score of the previous column (not just one stage of the function), this implementation now takes two clocks to produce a given score.

To circumvent this problem, processing elements in our design work on two scores at once. Each clock cycle, the first stage of the pipelined combinatorial function [ fa()] calculates the intermediate results for a square in one section of the array, while the second stage of the function [ fb()] completes the calculation for another section of the array (using fa()‘s results from the previous clock cycle). In practice, each processing element calculates two columns simultaneously, and produces a valid score for each column on alternating clock cycles—while preserving the clock speed improvements.

Dual Queries

As described above, a dual input buffer design can offer major performance improvement for short database sequences. However, the strategy has one complication: The technique functions perfectly when Smith-Waterman is comparing non-DNA sequences, since each sequence is compared against all databases. But when working with DNA, the algorithm compares both the query sequence and its complement (the same sequence in reverse) to each database successively. As a result, short DNA queries can’t benefit from the dual input buffers, and suffer from the fixed latency issue.

We solved this problem by taking advantage of the pipelined processing element’s ability to process two queries at once. For these queries, instead of the processing element working on two sections of the array each clock, it works on the original DNA sequence and its complement.

Reduced Score Size

Surprising as it seems to many software designers, the speed of any FPGA processing element depends on the size of the score value being calculated. (Scores stored as larger integers require more logic—and thus, are larger and slower—than scores stored as smaller integers.) When processing elements require less logic, designers can place more elements on the FPGA, and achieve higher clock frequencies and better performance.

To achieve peak FPGA performance, the system design ideally should be based on the smallest score value possible. By designing our Smith-Waterman implementation with 24-bit integers (sufficient to store the maximum possible score of the algorithm) we were able to use 16 more processing elements than would have been possible with 32 bits, and run approximately 20 MHz faster.

Theoretically, even better performance could be achieved using even smaller score values. But smaller score values may not be sufficient to hold the maximum values possible. And, depending on the application, the extra effort and system resources required to cope with occasional overruns may not be worth the cost.

Accelerating the Application

Using these techniques, we substantially increased the performance of the Smith-Waterman algorithm. As the code profile shows (Figure 7), the percentage of time spent on the application kernel was cut by nearly a third, with most of the “do_work” function consisting of calls to the FPGA. For the data set shown below, the absolute time spent on the algorithm dropped from 6,461 seconds to a little over 100 seconds—approximately 64 times faster than the equivalent software-only implementation.

20060207_cray_fig7.jpg

Figure 7. Accelerated Code Profile

While performance gains like this are impressive, what’s more impressive is the fact that they can be duplicated in a wide range of applications. With the right code, the right system architecture, and a little ingenuity in the implementation, system designers can use any of the optimization strategies described above (and others) to extend RC performance improvements into the real world.

Steve Margerm is the team leader for reconfigurable computing on the Cray XD1 system. Prior to joining Cray, Margerm worked in the telecommunications and networking fields for more than 10 years.

7 thoughts on “Reconfigurable Computing in Real-World Applications”

  1. Pingback: cpns kemenkumham
  2. Pingback: GVK Biosciences
  3. Pingback: friv 1
  4. Pingback: DMPK ADME

Leave a Reply

featured blogs
Dec 3, 2021
Believe it or not, I ran into John (he told me I could call him that) at a small café just a couple of evenings ago as I pen these words....
Dec 3, 2021
The annual Design Automation Conference (DAC) is coming up December 5th to 9th, next week. It is in-person in San Francisco's Moscone Center West. It will be available virtually from December... [[ Click on the title to access the full blog on the Cadence Community site...
Dec 1, 2021
We discuss semiconductor lithography and the importance of women in engineering with Mariya Braylovska, Director of R&D for Custom Design & Manufacturing. The post Q&A with Mariya Braylovska, R&D Director, on the Joy of Solving Technical Challenges with a...
Nov 8, 2021
Intel® FPGA Technology Day (IFTD) is a free four-day event that will be hosted virtually across the globe in North America, China, Japan, EMEA, and Asia Pacific from December 6-9, 2021. The theme of IFTD 2021 is 'Accelerating a Smart and Connected World.' This virtual event ...

featured video

Imagination Uses Cadence Digital Full Flow for GPU Development

Sponsored by Cadence Design Systems

Learn how Imagination Technologies uses the latest Cadence digital design and simulation solutions to deliver leading-edge GPU technology for automotive, mobile, and data center products.

Click here to learn more about Cadence’s digital design and signoff solutions

featured paper

Enable faster real-time control with high-precision position sensors

Sponsored by Texas Instruments

The demand for highly automated industrial systems is continuing to increase and often requires advanced, reliable position sensing solutions to control equipment performance and collect factory-level data. Learn how the industry’s most accurate linear 3D Hall-effect position sensor, the TMAG5170, enables faster real-time control and increased diagnostic capabilities to help design safer, more reliable, automated industrial systems.

Click to read more

featured chalk talk

How Trinamic's Stepper Motor Technologies Improve Your Application

Sponsored by Mouser Electronics and Analog Devices

Stepper motor control has come a long way in the past few years. New techniques can give greater control, smoother operation, greater torque, and better efficiency. In this episode of Chalk Talk, Amelia Dalton chats with Lars Jaskulski about Trinamic stepper solutions and how to take advantage of micro stepping, load measurement, and more.

Click here for more information about Trinamic TMCM-6110 6-Axis Stepper Motor Driver Board