Image Processing Applications On New Generation FPGAs

The new generation of FPGAs with DSP resource and embedded processors are attracting the interest of the image processing market. With enhanced capabilities most of the DSP processing work can be off loaded from the software program stack to embedded processors and DSP resources on the FPGA to improve performance and reduce the cost of the whole system.

The traditional way of implementing algorithms in software limits the performance because the data is processed serially. Frequency of operation can be increased up to a certain extent to increase the performance or the required data rate to process the image data, but increasing the frequency above certain limits causes system level and board level issues that become a bottle neck in the design.

With the current image processing applications moving towards consumer markets, the amount of data to be processed has increased at a fast pace. The new compression algorithms on the market are keeping up with the increasing data requirement.

The DSP processors are also trying to keep up with these requirements. With the ever increasing need for processing, parallel processing on any hardware can help reduce the processing overhead and offer better system performance. A FPGA’s flexible architecture enables parallel processing providing a proper balance between the performance and the cost of the system, and in addition the flexibility to reprogram gives a quick turn around time.

With the right combination of IP’s, fast time-to- market can be achieved with real time prototyping. At the same time FPGAs also provide flexibility to upgrade to new standards. Figure 1 shows a group of Image IP’s that can be easily reshuffled to quickly create applications like video cell phones, set-top boxes, LCD projectors, Keyboard and Mouse Over IP (KVMIP), Digital cameras/camcorders etc. Along with parallel processing and high data rates, these groups of IP’s also provides high configurability that help to fine-tune the system to achieve certain performance rates.

The image processing block can be divided into two sub sections namely pixel processing blocks and frame processing blocks. The Pixel processing blocks works directly on the incoming pixel data whereas the frame processing blocks works on image stored in terms of frames. Color space converters, Gamma corrections and brightness control are some of the examples of pixel processing blocks. Static Huffman, AES, DCT, Interlace De-interlace come under the category of frame processing blocks.

Figure 1. Image IP Blocks

Implementation of pixel processing blocks, like color space conversion, can be a simple job if FPGA resources are freely available and the pixel data frequency is low. The implementation on a FPGA with at least nine multipliers can be fairly simple. The matrix coefficients and offsets are stored in the ROM or loaded dynamically through external host configuration interface. Conversions are performed using a generic 3×3 matrix multiplication. The same can be achieved with just three multipliers, if the incoming pixel frequency is low, by running the internal core frequency three times the pixel frequency.

In terms of implementing a simple Color Space Conversion IP it really helps if a couple of hooks are kept in the design to make it flexible, for example, to add or remove pipe line registers, to configure the number of multiplier to be used etc. When you move on the Xilinx Virtex4 the implementation of multiplier and accumulator blocks simplifies the process even more. All the above considerations reduce the time it take to customize the IP for a particular FPGA while integrating it with other cores for different applications, saving the engineering time necessary to modify the core and avoiding human errors in doing so.

Image resize and Image rotation blocks fall under the category of pixel processing blocks with an overhead of line storages and low frequency of operations or as a frame processing blocks. A real time image resize can take up to 8 blocks of RAM with adder and multiplier trees with limited upscaling and downscaling capability. This can be a good solution for a system that requires real time image processing without worrying about the frame storage on an external memory like DDR. But if you are targeting applications that are constrained by the FPGA area and high pixel frequency, the real time image resize might not be a feasible solution. Going for an external DDR/SDRAM storage offers a better solution.

Proper partitioning of the logic to be implemented in hardware and software is one of the factors which decide the overall system efficiency. To implement Image resizing in hardware, you can have the hardware calculate the complete image size and come up with horizontal and vertical displacement that the hardware needs to read and compress. This logic will use up a good amount of hardware and arithmetic blocks. Since this operation won’t be performed often (in terms of calculating the image size and displacement) it can be moved to software and just provide the image size, vertical displacement and horizontal displacement values in the control registers.

Video compression plays a vital role in image processing applications. Uncompressed, high- definition pictures can easily take 1920x1080x24x30=1.49Gbps. Reserving the compression blocks to work on the stored data, helps in tolerating the latency that the IP’s might infer while running at high system frequency.

However, when you have many cores, in terms of frame processing blocks, to fetch the data from memory like DDR/SDRAM/SRAM, it really becomes crucial to define the interfaces between all the IP’s when they talk to the memory controllers.

Any standard bus like PLB/OPB/AMBA can be an easy solution to this, but do the IP really need to use the overhead? A very simple protocol can be defined with the least overhead in the form of request and grant. Various mechanisms like multiplexing the address, data and burst length bus over the same line can easily help reduce the number of lines which will play an important role when implementing anything in an FPGA compared to making an ASIC where every route is created as per the requirement. These buses are surely the way to go because of the standardization, but remember, they were NOT developed for implementation in FPGA. Image IP solutions should not take more than a couple of weeks for integration. Since bus standards are not optimized for FPGAs, we developed our own bus. Using this bus we can optimize the design for frequency and area with the FPGA architecture in mind.

One of the key points to be noted in developing Image Processing IP for the FPGA is the reusability and efficiency with which the hardware is implemented. This allows an efficient system in terms of cost and performance and also helps to reduce the time-to-market, by quickly integrating the IP blocks without touching them again for any modification, avoiding repetition of verification cycle.