Extracting higher performance from today’s FPGA-based systems involves much more than just cranking up the clock rate. Typically, one must achieve a delicate balance between a complex set of performance requirements – I/O bandwidth, fabric logic, memory bandwidth, DSP and/or embedded processing performance – and critical constraints such as power restrictions, signal integrity and cost budgets. Moore’s Law notwithstanding, to maximize performance while maintaining this balance, the FPGA designer must look beyond the clock frequency altogether.
Overcoming Performance Bottlenecks
Each new generation of process technology brings with it the potential for substantially higher clock frequencies and the inherent promise of a proportional increase in processing performance. However, today’s system performance challenges go well beyond glue logic performance and maximized clock rates. The degree to which today’s FPGA designers can increase performance depends more on their ability to accurately identify and overcome system-level performance bottlenecks.
The personal computer (PC) clearly illustrates this point. The real system performance bottleneck in a PC lies not in the processor’s clock frequency, but in the ability of all of the various functional blocks within the PC to work together at the desired performance levels.
Similarly, the best way to understand system performance in an FPGA design is to assess the interrelationships of the parameters that exhibit the greatest collective influence on system performance – i.e., Data Rate In, Data Rate Out, Data Rate External Memory, and FPGA Processing Rate. In the broadest view of an FPGA-based system, these parameters should relate as follows:
Data Rate In = Data Rate Out
- The rate at which data enters must equal the rate at which data exits to avoid overflows.
Data Rate External Memory ≥ 2 x Data Rate In
- The rate of the external memory interface must be at least twice the rate of the incoming data to avoid overflows (data must be written and read).
FPGA Processing Rate ≥ Rate In
- The internal data processing rate at which FPGA-based algorithmic calculations are performed must equal or exceed the rate at which data enters.
- These algorithms can be implemented in the programmable fabric logic, dedicated DSP slices, or in embedded processors.
The I/O rates (Data Rate In, Data Rate Out, and Data Rate External Memory) are functions of the data rates per pin and the available pins. Although the development of high-performance networking and memory interfaces has dramatically increased these I/O rates, signal integrity and/or board layout constraints may still limit the actual sustainable I/O performance levels.
Generally, the FPGA Processing Rate is determined by the width of the internal data path and the manner in which the algorithms are implemented. Internal architectures employing wide data buses and substantial amounts of hardware parallelism (whether in the fabric or in dedicated DSP slices) can process several data streams (or channels) simultaneously. Consequently, the designer can maintain the appropriate interrelationships while running these internal data processing functions at a fraction of the clock rates required by the external I/O structures. (Figure 1)
Figure 1 – Hardware parallelism provides substantial performance improvements at a fraction of external clock rates
The inherent ability of FPGAs to enhance performance through the use of massive parallelism is clearly illustrated in image, signal or packet processing applications. For these intensive data processing applications, simply increasing the internal clock frequency of pipelined processor architectures cannot produce sufficient performance increases to satisfy the application.
Figure 2 illustrates a systolic time-multiplexed mode filter implementation where both pipelining and parallelism are implemented to take full advantage of the dedicated DSP slices and achieve the required performance with better resource utilization.
Figure 2 Filter implementations using dedicated DSP slices
In many cases, I/O bandwidth represents the most critical system bottleneck in FPGA-based systems. Due to limitations in pin counts, external memory bandwidth or physical layer limitations, designers often find it is substantially more difficult to achieve sufficiently high transfer rates in and out of the FPGA than it is to sustain internal processing performance. Thus, I/O bandwidth as implemented with parallel LVDS interfaces for chip-to-chip interfaces, external single-ended memory interfaces, or even serial multi-gigabit interfaces for backplanes, often becomes the limiting factor in the quest for performance.
The great majority of systems use a data buffer external to the FPGA for temporary storage, thereby making this buffer’s bandwidth a critical factor in determining overall performance. Memory interfaces like DDR2 SDRAM, QDR II SRAM, or RLDRAM II are source-synchronous, with per-pin data rates of more than 533 Mbps. However, memory bandwidth is determined not only by the per-pin data rate but also by the width of the bus.
Some FPGAs have dedicated logic circuitry built into every I/O that simplifies the physical layer interface and provides the capability to implement buses with bandwidths that scale linearly with the number of data I/Os. The greatest challenge, however, for source synchronous interfaces like DDR2 SDRAM or QDR II SRAM is meeting the various Read Data capture requirements. For instance, as the data valid window becomes shorter it becomes more important and more challenging to align the received clock with the center of the data.
To enable reliable data capture, I/O circuitry needs to include built-in delay or phase shift elements, ideally adjustable in less than 100 pico-second increments, to ensure proper alignment between clock and data signals. The capability of this I/O circuitry to calibrate timing at run time (rather than at design time) can also substantially improve design margins.
Chip-to-chip interfaces may require 1 Gbps per pin LVDS buses to support the highest bandwidths. One way to accommodate is to employ wider parallel bus interfaces. However, the major challenge with these high-speed transfer rates is matching the external clock rates with those employed by the internal circuits. Some I/O technologies simplify the design of differential parallel bus interfaces with embedded SERDES blocks that serialize and de-serialize parallel interfaces to match the data rate to the speed of the internal FPGA circuits. Additionally, the I/O technology needs to provide per-bit and per-channel de-skew for increased design margins, enabling the design of interfaces such as SPI-4.2, XSBI, and SFI-4, as well as RapidIO. Only when these capabilities are built into every I/O can wider LVDS buses be implemented, thereby boosting the I/O bandwidth to more than 400 Gbps for the larger BGA packages.
As designs move to faster interface speeds, serial interconnect can provide a chip-to chip and backplane solution that saves power and board space while reducing design complexity. For example, parallel buses are better suited for short distance interfaces where board space is not critical, while serial interfaces can provide a better solution in backplane or daisy-chained chip-to-chip implementations. Some FPGA vendors offer Multi Gigabit Transceivers (MGTs) with performance from 622 Mbps to 10.3125 Gbps. These transceivers are fully programmable and can implement a myriad of speeds and serial standards.
Logic Fabric Performance
Today’s FPGA vendors have enhanced the performance of the programmable logic fabric by building devices with advanced 90 nm technology The result of scaling down is not only a reduced die size but also higher clock frequency, with some of today’s fabric logic clocks running at 500MHz. However, the fabric logic clocking is often not the determining factor in the overall system design performance. Maximum clocking frequency is determined by the worst case path delay between two registers in the implementation, and is often below the maximum frequency specified by the FPGA vendor. The ability to mix and match the right building blocks, usually implemented as hard IP (DSP slices, on-chip memory and I/O logic blocks), combined with proper design techniques and tool settings that maximize both parallelism and pipelining, will generate results that are closer to the full potential of the FPGA device.
On-Chip Memory Performance
On-chip memory is also essential to achieve higher system performance because it is used extensively to store data between algorithmic processes. On-chip memory, whether distributed LUT based memory, block RAM or FIFO, is limited in size and is used for relatively small buffer storage. Different applications require different memory sizes and access times. Choosing the right memory hierarchy and fully utilizing the on-chip memory can impact the system performance. For example, distributed RAM is best suited for smaller sizes (< 2Kb) and fast clock to data out, while block RAM can accommodate larger buffers at frequencies of up to 500 MHz.
Many image, signal and data processing applications need dedicated logic with increased parallelism capable of implementing arithmetic algorithms at higher rates. Some of the latest FPGA offerings enable the designer to configure the DSP slices to implement multipliers, counters, multiply-accumulators, adders and many more functions, all without consuming logic fabric resources.
For example, architecting the Finite Impulse Response (FIR) filters as adder chains rather than adder trees will remove the bottleneck from the fabric while exploiting the full capabilities of the dedicated DSP slices, enabling performance of up to 500 MHz for systolic filter implementations (Figure 2).
Traditionally, designers have implemented CPU-based processing systems on standalone ASSPs. Platform FPGAs provide the ability to integrate high-performance processors in programmable fabric logic, opening the door for higher levels of integration, as well as higher performance levels through better coupling. On-chip buses, memory, and peripherals enable the embedded systems designer to eliminate bottlenecks between the fabric logic, DSP slices, I/O blocks and the CPUs.
Programmable logic is simply better suited to perform numerically intensive algorithm calculations, while traditional processors are a better fit for decision-making algorithms. Today’s real-time systems require both of these processing structures to deliver higher levels of performance. Some of today’s more capable FPGAs provide up to two hard IP processor cores, each delivering more than 700 DMIPS performance with minimal power consumption. The hard IP cores are a better alternative for delivering higher performance at lower power than soft cores due to the efficiency of the dedicated circuitry. These hard cores deliver an order-of-magnitude better transistor utilization for the same function.
Another important challenge in the quest for higher embedded processing performance is the coupling between the embedded CPU and co-processors or hardware accelerators like the DSP slices. Some FPGA vendors have implemented dedicated circuitry that provides a low-latency path for connecting co-processor modules to the embedded processor. These user-defined, configurable hardware accelerator functions operate as extensions to the embedded processor, offloading the CPU from demanding computational tasks. For example, implementing floating-point calculations in hardware can improve performance by a factor of 20 over software emulation.
Composite Performance Metrics
As system requirements become more complex and their interrelationships become more critical, assessing overall performance requires a more comprehensive, fine-grained method of defining and comparing relative performance values. To better quantify the composite system-level performance value of today’s leading FPGAs, one must conduct an individual point-by-point comparison of the seven performance criteria most critical to system-level FPGA designs: logic fabric, embedded processing, DSP, on-chip RAM, high-speed serial interface, I/O memory bandwidth, and I/O LVDS bandwidth.
Let’s consider a specific example like a simple FM receiver design from www.opencores.org. This design is a traditional low pass filter implemented as a 16 taps averaging filter. A designer utilizing the Virtex-4 DSP48 slices with some RTL modifications can implement it as an adder chain. More traditional FPGA implementations that lack the DSP48 pipelined MAC structure use adder trees and therefore are slowed down by the bottleneck in the fabric. Figure 3 shows architectural differences and the improvement for the FM receiver design with the DSP48 compared to traditional FPGA implementations. The DSP48 pipelined structure significantly minimizes the register to register delay enabling the 500 MHz performance.
Figure 3 – FPGA performance comparison for a DSP application
In today’s FPGA designs, the quest for higher performance has become a multi-faceted battle. It is no longer sufficient to run the fabric logic at higher clock rates. One must also build a system with higher I/O throughput and external memory bandwidth, and the right functional blocks for hardware and algorithmic parallelism through DSP, on-chip memory and embedded processing. Ultimately, all of these elements must work together to deliver the desired performance level while satisfying the specific signal integrity, power and cost budget restrictions for the particular system.
About the author: Adrian Cosoroaba joined Xilinx in 2004 as Marketing Manager for Virtex Solutions and he is responsible for worldwide marketing activities. He brings to Xilinx over 19 years of semiconductor experience in system applications and marketing. Prior to joining Xilinx, he held a range of applications engineering and strategic marketing positions at Fujitsu. He holds an M.S. in Electrical Engineering from Ohio State University and a B.S. in Engineering Physics from University of California at Berkeley.