Computational Bottlenecks and Hardware Decisions for FPGAs

The High Performance Computing (HPC) community recognized the inherent limits of serial processing long ago. In the drive to continually improve the performance of HPC codes, programmers have explored a variety of alternatives, including employing new types of processors, coupling multiple processors, and parallel processing. Initially, these strategies were hindered by slow inter-processor communications and limited ability to parallelize algorithms. As a result, early attempts at alternatives to serial computation could employ only tens of processors. Over time, several innovations (including higher-bandwidth inter-processor links, algorithmic improvements that reduced the amount of data sent across those links, coarse-grain parallelism, and the emergence of lower-cost microprocessors) allowed programmers to effectively employ hundreds of microprocessors. Current state-of-the-art systems can apply thousands of microprocessors with high-bandwidth, low-latency interconnects to the most challenging HPC problems. Nevertheless, even these systems have limitations for certain types of HPC applications. Eventually, the extra overhead required for parallel processing overcomes the benefits that the additional processors provide, and the performance gains that one would expect approach an asymptotic limit.

To address this problem and attempt to push the performance of HPC applications even further, HPC programmers once again are looking to new types of processors, such as Field Programmable Gate Arrays (FPGA). Unlike previous approaches, which relied on coarse-grain parallelism across many microprocessors, FPGAs provide an opportunity for massive fine-grain parallelism and pipelining within individual FPGA processors. FPGAs allow users to build a temporary custom circuit that will solve an explicit problem. Since the size, number, and type of function units (for example, add, sub, mult, div) are defined by the programmer, algorithms can be parallelized on the instruction level. With substantial local block RAM (BRAM) available, programmers can have many of these instructions execute on the same clock cycle. In this way, users can achieve tremendous performance gains on certain types of problems for which traditional parallel processing techniques have offered few advantages.

However, FPGA processors carry their own costs and present programmers with their own tradeoffs. For example, FPGAs are not well suited for performing conventional microprocessor tasks (such as running the operating system (OS) and performing mandatory serial work such as connecting to networks and writing/reading data to/from disk drives), so a conventional serial microprocessor is still required, with the FPGA functioning as a co-processor. FPGAs can indeed accelerate HPC applications in such an architecture, but the advantages that the FPGA provides will depend upon several factors, including where the FPGA is attached relative to the serial microprocessor, and the bandwidth limitations of both the algorithm and the hardware. To accurately predict the value of employing FPGA acceleration for a given application, programmers need to carefully consider the algorithm being implemented to evaluate how well it performs on both the microprocessor and FPGA, and to assess the time it takes to move data across the interface and back. After all, the FPGA may be hundreds of times faster than a serial microprocessor, but if you can’t get your data there and back fast enough, it’s not worth using. Let’s explore some simple strategies for evaluating whether a given HPC application will benefit from FPGA acceleration.

Hardware Decisions

The first consideration in assessing the value of FPGA acceleration for an application is to understand how the FPGA will be attached to the system, and the inherent limitations different architectures may present. FPGAs have been around for a long time and were typically used in embedded applications, replacing groups of components with a single FPGA. As FPGAs grew in size, engineers started putting groups of them on dedicated circuit boards and attached them to standard microprocessor-based systems. This allowed engineers to use the flexibility of a serial microprocessor for more conventional tasks, while offloading work to the attached board. The FPGA board might filter data coming into the system and then send the results to the microprocessor for further analysis. Situated on an attached board plugged into a relatively slow VME or PCI bus (and unable to effectively support HPC programmers’ preferred 64-bit arithmetic), FPGAs were not applicable for most HPC applications. Recently though, hardware vendors have been addressing these issues.

The current crop of FPGAs from chip vendors offers enough logic to support tens of 64-bit function units. For example, by dividing the logic requirements for the 64-bit floating point cores from Xilinx[1] into the available logic on the Virtex-4 LX200[2], we see that, theoretically, this chip could support about 70 floating point mult units or 128 floating point add units. However, the programmer is still saddled with a bandwidth issue, because their data is located in the microprocessor’s memory and needs to be moved across an interface to the FPGA. To address this problem, vendors have placed FPGAs on much faster PCI-X and HyperTransport (HT) interfaces.

Early FPGA architectures placed the chips on a PCI card and attached them to where the system would normally talk to a network card. PCI runs at 33 MHz and is 32-bits wide, providing a total bandwidth of 133 MB/s—greatly limiting the effectiveness of the attached FPGA for any application that needs to stream data in and out of the FPGA. PCI-X improves the picture, running at 133 MHz and 64-bits wide, yielding a bandwidth of 1064 MB/s. Alternatively, HT is an open protocol for attaching devices directly to the processor. It can support bandwidths up to 3.2 GB/s in each direction. Using HT, the FPGA can be attached directly to the processor with minimal latency and optimal bandwidth.

To illustrate the performance differences among these architectures, let’s consider how each architecture—FPGA connected via PCI interface, FPGA connected via PCI-X, FPGA connected via HT, and a conventional microprocessor architecture—addresses a theoretical problem. Let’s assume we have 10,000 64-bit floating point values in the microprocessor’s memory that we would like to add a floating point constant too. That means we need to transmit a total of 80,000 bytes of input (10,000 x 8 bytes per value) to the FPGA, and a total of 80,000 bytes of output from the FPGA back to the processor’s memory.

Using a PCI interface, it would take 574 microseconds (80,000 bytes / 133*1024*1024 bytes per second) to transmit the data and another 574 microseconds to get the results back, for a total of 1148 microseconds. PCI-X improves the situation with an input and output transfer time of 72 microseconds (80,000 bytes / 1064*1024*1024 bytes per second) for a total round trip of 144 microseconds. Due to an interface (I/O pins) limitation on the FPGA, the fastest HT transfer rate supported is 1.6 GB/s in each direction. (Future generations of FPGA devices should remove this limitation.) Since HT can move data in both directions at the same time, the total input/output transfer time would be 48 microseconds (80,000 bytes / 1600*1024*1024 bytes per second). Now, consider the time it would take to compute those same results in situ on the microprocessor, calculating one 64-bit result per clock cycle and running at 2.5 GHz. In this conventional architecture, the 10,000 adds would take 4 microseconds (10,000 operations / 2500*1024*1024 ops per second), assuming all the data was available every clock cycle. The bandwidth to memory is 6.4 GB/s, so moving the data from memory to the processor would take 12 microseconds (80,000 bytes / 6400*1024*1024 bytes per second).

Figure 1: Processing Times for Various Architectures Using 64-bit Integers

*assuming zero calculation time on the FPGA

For this simple example, the 12 microseconds required to perform the operation on a conventional microprocessor (even when the data is in main memory) is faster than any FPGA solution—even if the FPGAs take zero time to perform the calculations. However, unlike conventional processors, FPGA performance also can be impacted by the data itself. Let’s consider the same example, except using 8-bit integers instead of 64-bit floating point data.

Using 8-bit data reduces the roundtrip transfer times by a factor of eight. As a result, the total processing time becomes 143 microseconds for the PCI architecture (1148/8), 18 microseconds for the PCI-X architecture (144/8), and 6 microseconds for the HT architecture (48/8). Now, the HT-connected FPGA could potentially be twice as fast as the serial microprocessor. However, the true processing time difference would depend on how the microprocessor handles the 8-bit data. It could pack the data, eight to a word, but would then have to perform shifts and masks to access each 8-bit sub-word. That would make the calculation time longer for the microprocessor—still giving the FPGA an advantage.

Application Requirements

Picking a non-computationally intensive algorithm with one word of input, one floating point operation, and one word of output was a poor example for exploiting FPGAs. Obviously, for such a straightforward problem, a conventional microprocessor performs adequately. However, as the computational intensity increases, the advantages of FPGAs become more and more compelling. Using the three interfaces mentioned above (PCI, PCI-X, and HT), let’s calculate the computational intensity required in order to make FPGA acceleration advantageous. Let’s consider the two Basic Linear Algebra Subroutines (BLAS) algorithms, DAXPY and DGEMM, and compare their theoretical performance on a conventional microprocessor-based system with a system using a single FPGA connected via our three interfaces.

DAXPY is the 64-bit vector/vector linear algebra operation of the form:

     for i = 1 to n
       y(i)=a*x(i)+y(i)
     end

where x(i) and y(i) are vectors and a is a scalar. We can see that for every result, we have two inputs (x(i) and y(i)), two operations (mult and add), and one result (y(i)). Using the same trip count of 10,000 values, we can calculate the round trip transfer times for PCI, PCI-X, and HT as 1721 microseconds, 215 microseconds, and 95 microseconds respectively. The typical microprocessor can perform a mult and add in the same clock, so the time will be 4 microseconds (2* 10,000 operations / 2*2500*1024*1024 ops per second). Looking at the data transfer time from memory to the microprocessor, we have 24 microseconds (2*80,000 bytes / 6400*1024*1024 bytes per second). Clearly, the microprocessor is the winner. We need to move less data and perform more operations for the FPGA to be advantageous.

Figure 2: DAXPY Processing Times for Various Architectures

*assuming zero calculation time on the FPGA

To increase the amount of calculations on the FPGA, the algorithm needs to have nested loops. DGEMM is the 64-bit BLAS routine that performs a matrix/matrix multiplication and has the form:
     for i = 1 to n
     for j = 1 to n
         c(i,j) = 0
         for k = 1 to n
           c(i,j) = c(i,j) + a(i,k) * b(k,j)
         end
       end
     end

Here, there are two input matrices (a and b) that are both n by n in size, and one output matrix (c) that is also n by n. So, we have 2n2 values to transfer in and n2 values to transfer out. As for the number of operations, we can see one add and one mult in the inner of the triply nested loops, which yield 2n3 operations. Clearly this routine has a much higher computational intensity.

In this scenario, the roundtrip transfer times for our problem size of 10,000 values (2*80,0002 bytes in and 80,0002 bytes out) using PCI, PCI-X and HT are 138 seconds, 17 seconds, and 8 seconds respectively. The computational time for the operations on the microprocessor (2*10,0003) would be 382 seconds, again assuming two operations per clock. The transfer time to get that data from memory into the microprocessor is 2 seconds (2*80,0002 / 6400*1024*1024). So the microprocessor can perform these operations in 382 seconds.

Figure 3: DGEMM Processing Times for Various Architectures

*assuming zero calculation time on the FPGA

Assuming a zero calculation time on the FPGA, the best possible speedup we can hope for using a PCI interface is 2.8 times faster (382/138). Using a similar calculation for PCI-X and HT yields speedups of 22.4 (382/17) and 47.7 (382/8) times, respectively. From a data transfer point of view, we can see that the FPGA has the potential to offer a substantial speedup over the microprocessor for this type of application. However, we have been assuming a zero FPGA calculation time. Now we need to focus our attention on the actual speed of the FPGA.

FPGA Calculation Speed

To get an idea of how many operations we can compute per clock, we need to understand the memory bandwidth within the FPGA. The Xilinx Virtex-4 LX200 chip has 336 dual-ported 18-bit-by-1024 BRAMs. To achieve one 64-bit number per clock, we need to read twice from each of two BRAMs for a total of four 18-bit values. So, the peak 64-bit word bandwidth from BRAM is 168 words per clock (336/2). For each calculation of the inner loop of the DGEMM routine, we need two input words (a(i,k) and b(k,j)) and one output word (c(i,j)). Dividing 168 by 3 yields 56 trips of the inner loop that could potentially be calculated every clock cycle.

Recall that, theoretically, the LX200 can support only 70 64-bit mults or 128 adds. That figure assumes no other logic on the chip, so a more realistic number would be about 30 64-bit adds and 30 64-bit mults for the LX200. The DGEMM routine had 2*10,0003 operations. Performing 60 operations per clock (with an assumed clock rate for the FPGA at about 200 MHz), it would take 159 seconds (2*10,0003 / 60*200*1024*1024) to perform this calculation. Adding in the HT transfer time of 8 seconds, we get a total FPGA calculation time of 167 seconds. Comparing this to the 382 seconds for the microprocessor yields a more realistic potential speedup of about 2.3 times faster (382/167) for the HyperTransport-connected LX200. (Note that now the PCI-attached board’s speedup has dropped to 1.3 (382/(159+138))—an acceleration that most programmers will feel does not justify the effort required to achieve it.)

Our DGEMM example shows that this algorithm has the potential to be twice as fast on a HT-connected FPGA relative to the microprocessor, however, the more calculations we can add to the FPGA side of the equation, the more attractive the FPGA solution will look. For example, consider an algorithm that has an iterative solver with DGEMM within it. In this scenario, the FPGA could iterate on the bulk of the data—making the computational intensity much higher, and the advantages of FPGA processing even greater. Today, we can see that we are already reaching the capacity limits of the LX200, but Xilinx has already come out with the Virtex-5 LX330, which gives the programmer more room and a faster clock, making the FPGA even more attractive.

For algorithms that perform non-byte aligned calculation, the FPGA will offer a large speedup over any microprocessor. Imagine an algorithm that wants to calculate on groups of bits within each word. The microprocessor would have to perform at least a shift and mask operation to extract each group of bits, whereas the FPGA could examine many groups in parallel. You can see the effects of this on the bioinformatics algorithm Smith-Waterman, where the FPGA achieves large speedups relative to conventional microprocessors.[3] In this example, the author used 5-bit data to represent the amino acid data, and was able to implement the design on the FPGA using only 24-bit integer units, which yielded a 64 times speedup over the microprocessor. Clearly some algorithms are made for FPGA acceleration.

Programmers also may employ other tricks to increase the advantages of the FPGA, such as trying reduced precision to decrease the amount of data transferred and to squeeze more operations per clock cycle out of the FPGA. Using single precision, 32-bits, would halve the amount of data transferred and would allow more than twice as many function units and words in BRAM on the FPGA. (In a paper presented at the 2005 Cray User Group (CUG) meeting, a programmer converted 32-bit single precision floating point data to 16-bit fixed point data. Then, when calculating the Faster Fourier Transform (FFT) on the FPGA, carried an extra bit at every stage of the calculations, resulting in 32-bit fix point results. When these 32-bit fix point results were truncated back down to the 23-bits mantissa used in 32-bit single precision floating point numbers, we observed that the FPGA algorithm had the same accuracy as standard 32-bit floating point results on the microprocessor.[4])We should realize that all of these examples are providing only a simple first-pass calculation. The real work begins in designing the circuit and trying to get it all to fit on the FPGA and run at a reasonable speed.

FPGAs in Practice

The question of whether FPGA acceleration will benefit a given application requires a rigorous but straightforward process. When evaluating an FPGA solution, programmers need to look at several aspects: the bandwidth between the microprocessor’s memory and the FPGA, the size and speed of the FPGA itself, the amount of memory attached to the FPGA, and most importantly, the requirements of the algorithm that will be mapped onto the FPGA. By performing some rudimentary calculations, programmers can get a much clearer picture of the potential benefits of FPGA acceleration for their applications. For example, while the calculations presented here have omitted many details, they still have provided enough information to conclude that coding the DGEMM routine for the FPGA will deliver some speedup, especially when using a HT-attached FPGA. (Programmers who want to experiment with a HT-attached FPGA architecture may wish to look at the Cray XD1 supercomputer with either the Xilinx Virtex-II Pro 50 or Virtex-4 LX160 attached per node, or the DRC Development System 1000 from DRC Computer Corporation with one or more attached Virtex-4 LX60 or LX200.)

After concluding that an FPGA solution can deliver a speedup, programmers need to consider other issues that can impact (and potentially impede) their use of FPGAs. Among these issues is choosing the right development tool. Our simple DGEMM calculation showed that it is possible to get a speedup, but what if the FPGA developer doesn’t know how to write efficient code, or a tool they use cannot extract the ultimate performance from the FPGA itself? The FPGA is only as good as the circuit design it is programmed with. However, programmers must navigate the many FPGA development tool vendors, some of whom are moving from the embedded FPGA market into the HPC/FPGA market, and some of whom are entering the market now for the first time. In a recent FPGA overview paper [5], the authors cited 11 different FPGA development tools available today, and even that list was not comprehensive.

At the lowest level, programmers can develop the design in a Hardware Description Language (HDL) such as VHDL or Verilog, which offers the most control over the design and, for experienced programmers, the best performance. But, HDL experts tend to be hardware designers, not HPC programmers. On the other end of the spectrum are high-level languages that abstract the complexities of HDL to try to make the job of programming the FPGA as familiar as possible for HPC programmers. These languages can greatly improve the productivity of the HPC programmer, but usually at the cost of some performance. In the case of our DGEMM example, these performance losses might render the entire effort moot.

Finally, prospective FPGA programmers should remember that the field of Reconfigurable Computing is very dynamic. There are many FPGA chip vendors operating today, and new specialized FPGA vendors (as well as new development tools) entering the market all the time. Programmers working on a project would do well to occasionally resurvey the marketplace to see what has changed and what new tools may be available.

________________________________

Xilinx Floating-point Operators v2.0 – DS335, Xilinx, January 18, 2006
Xilinx Virtex-4 Family Overview – DS112(v1.5), Xilinx, February 10, 2006
Reconfigurable Computing in Real-World Applications, Steve Margerm, FPGA Journal, February 7, 2006
Evaluation of running FFTs on the Cray XD1 with attached FPGAs, Mike Babst, Roderick Swift, Dave Strenski, CUG, May 2005
An Overview of FPGA and FPGA programming; Initial experiences at Daresbury, Richard Wain, Ian Bush, Martyn Guest, Miles Deegan, Igor Kozin, and Christine Kitchen. CCLRC DL-TR-2006-007, June 2006

Dave Strenski is an applications analyst for Cray Inc., which designs and manufactures high performance computing systems. Prior to Cray, Dave held a variety of technical positions at several computer and research organizations. He holds degrees in Land Surveying, Civil and Mechanical Engineering. His publications include works in the areas of parallel computing, numerical consistency, genomic data searching algorithms, and optimization. He also holds a patent on a meshing algorithm for threaded fasteners. As a hobby, Dave plays with solar power.

Computational Bottlenecks and Hardware Decisions for FPGAs

Related

Leave a Reply Cancel reply

featured chalk talk