FPOAs Meet the Challenges of H.264 Encoding of High Definition Video

Introduction

Insatiable demand for high definition video and rapidly proliferating video production and distribution methods are driving the need for advanced video encoding and compression schemes. Several international standards and forums have been established in recent years to deal with various aspects of digital video encoding. The MPEG-4 Part10/H.264 encoding standard is useful in a wide range of professional video applications including broadcast head-end, IPTV, multi-stream encoding / decoding, and image processing, among others.

The standard provides for a higher level—and wider range—of compression and quality of the compressed video as compared to previous MPEG standards, in turn demanding increased processing performance and flexibility from the embedded processors in professional video equipment. It’s easy to see why current programmable logic architectures may not be up to the task. The field programmable object array™ (FPOA™) is a high performance, programmable logic device that operates at speeds up to 1 GHz and is programmed at the object level. FPOAs are especially well suited to meet the increased performance and flexibility requirements of HD H.264 encoding applications. Following is an overview of one such implementation on a MathStar Arrix™ FPOA platform.

The MPEG-4 Part 10/H.264 Standard

The MPEG-4 Part 10/H.264 standard (commonly referred to as H.264) was jointly authored and is jointly maintained by the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). The first revision of the standard was completed in May 2003, and revised in March 2005 to include the full range of video quality encoding options available today.

The wide range of video quality and compression options comprehended in the H.264 standard reflect the increasingly diverse needs for management of high definition video. Two applications illustrate this point. In production studios, HD video is often encoded and stored for later use. Preservation of video quality in the encoded content is critical, and storing the content is relatively inexpensive. In this application, lower compression ratios are often chosen. Conversely, for cable television head end encoders, preservation of video quality is less critical, but the transmission cost-per-bit is high. In this application, high compression ratios are preferred.

These encoding options are grouped into profiles and levels. Profiles specify video quality. Levels specify screen resolution, frame rate, and maximum bit rate at each profile. Of particular interest for this discussion is High 4:2:2 Profile, Level 4.0 and 4.1, (commonly referred to as 1080i or 720p HD). High 4:2:2 profile targets professional applications that use interlaced or progressive video, supporting 4:2:2 chroma sampling format while using up to 10 bits per component sample of decoded picture precision. Level 4.1 expresses support for picture resolutions ranging from 1280×720 @ 60 frames per second (720p) to 1920×1088 @ 60 fields (30 frames) per second (1080i).

How an H.264 Encoder Works

Figure 1 describes the data flow for an H.264 Encoder.

Figure 1. H.264 Encoder Block Diagram.

The basic steps in H.264 encoding are prediction, transform and quantization, and entropy encoding. Prediction takes advantage of spatial and temporal data redundancies in video frames to greatly reduce the amount of data to be encoded. Transform and quantization further compact the data by applying mathematical techniques that express the energy in the predicted video as a matrix of frequency coefficients, many of which will be zero. Finally, entropy encoding substitutes binary codes for strings of repeating coefficients to achieve the final, compressed bitstream.

The encoder consists of an encoding path and an embedded decoder that reconstructs encoded frames for prediction. Because the actual decoder at the other end of the wire is presented only with encoded frames, the encoder must itself decode the video data to isolate prediction errors that would otherwise cause “drift” in the actual decoding process.

FPOAs meet the challenge of H.264 Encoding

MathStar Arrix FPOAs provide an ideal platform for implementing an H.264 encoder. They deliver up to four times the performance of today’s top FPGAs and combine high performance and re-programmability to meet a wide variety of high definition application needs. FPOAs are comprised of hundreds of objects that pass data and signals to each other through a patented 1 GHz interconnect fabric. The Arrix family of FPOAs provide 256 arithmetic logic unit (ALU), 80 register file (RF), and 64 multiply accumulator (MAC) objects. The objects and the interconnect fabric run on a common clock, operating deterministically at frequencies up to 1 GHz. This deterministic performance eliminates the tedious timing closure steps associated with FPGAs, reducing design iterations and development time.

The H.264 High 4:2:2 profile, Level 4.1 encoder is mapped to the Arrix FPOA architecture as shown in Figure 2 below. The High 4:2:2 profile supports full 4:2:2 color space video and up to 10 bits-per-color component. Note the partitioning of the encoder is such that FPOA1 performs the functions necessary for “I-frame only” encoding, while FPOA2 is added to provide a full I/P/B encoder. This design is inherently flexible depending on the quality / bitrate tradeoffs of the intended application.

Figure 2. H.264 Encoder Implementation in MathStar FPOAs

This encoder design utilizes roughly 50 percent of the total FPOA resources. The cost of this High 4:2:2 profile HD platform, including external memory and a small FPGA to perform the entropy coding, is less than half of other re-programmable solutions. The nature of the FPOA is such that it can be repurposed in the field to perform other video codec functions, such as MPEG2 or JPEG2000. This “agile codec” platform is useful for any system that values high performance, quality and the flexibility to repurpose the existing hardware to perform another codec function.

The following sections describe video frame types and the major blocks in the encoder in more detail.

Frame Types and Partitioning

Video data is composed of sequences of picture frames. A picture frame is simply a collection of pixels in a certain color format. The first step in reducing the amount of information in the video stream is to take advantage of the redundant information within a given picture frame and between successive picture frames. In an H.264 encoder, intra frames(I-frames) are compressed using only the spatially redundant information within the frame itself. Predicted frames (P-frames) are compressed using either spatially redundant information within the frame, or temporally redundant information in frames that precede it in the video stream, or a combination of both. P-frames typically require fewer encoding bits than I-frames. Bi-predictive frames (B-frames) are like P-frames, except that both preceding and following frames are used for maximum compression. B-frames typically require fewer encoding bits than I- or P-frames.

Picture frames are subdivided into macroblocks for processing. A macroblock is a 16×16 block of luma (brightness) samples, and two corresponding blocks of chroma (color) samples.

Inter-Prediction (Motion Estimation and Motion Compensation)

Inter-prediction analyzes motion in the video sequence, taking advantage of the redundant information between successive frames. Motion estimation is one of the most computationally intensive blocks in the entire encoder. For each 16×16 pel macroblock, the motion estimation algorithm attempts to find as close a match as possible for a current macroblock in either a past or future reference frame. The match is calculated using a sum of absolute differences (SAD) algorithm. To reduce computational load, the motion estimation is performed only on the luma component of the macroblock. Once a best match, has been found, the displacement between the macroblock and its comparison macroblock in the reference frame is coded as a motion vector. The motion compensator uses the motion vector to predict the position in the current frame, generating the prediction macroblock (P in figure 1).

The motion estimation algorithm may adaptively choose to further subdivide the 16×16 macroblock into smaller search partitions to find a better matches and thus achieve better compression. In this case, the motion estimator will generate multiple motion vectors, each of which is used by the motion compensator to construct the 16×16 prediction macroblock.

Finally, the prediction macroblock is subtracted from the current macroblock to produce a residual (Dn in figure 1, also referred to as a difference macroblock or prediction error). Since the bits needed to represent the residual and motion vector(s) are typically much less than the bits required to represent the original set of 16×16 pixels, substantial compression can be achieved.

The performance of the motion estimation implementation guarantees real-time processing for HD video. For 1080i @ 60 fields/sec, the maximum rate is 244,800 macroblocks per second. The motion estimation engine searches across two reference frames and four search partitions to find the best match. The algorithm used to define the search area is a combination of predicting the likely motion vector(s) and performing a full search around those areas down to quarter pixel resolution. The core of the motion estimation block requires approximately 50 FPOA objects. This is supplemented by objects dedicated to external memory controllers, configuration and I/O.

Intra-Prediction

The MPEG-4 Part 10/H.264 standard introduces intra-prediction to video encoding, enabling additional frame compression by utilizing spatial redundancy (for example, areas of blue sky) within the frame itself. I-frames are compressed using only intra-predicted macroblocks. P- and B-frames are compressed using either intra- or inter-predicted macroblocks, depending on which prediction mode generates the best match. As described in the MPEG-4/H.264 standard, the intra-prediction algorithm performs a matching process for each macroblock using only the pixel information directly above and to the left of the current macroblock. The comparison uses a sum of absolute differences (SAD) algorithm and provides the best intra-prediction mode match for the current macroblock for both luma and chroma components. There are up to four intra-prediction modes available for 16×16 macroblocks and many more modes available for intra-prediction in 8×8 and 4×4 partitions. This encoder implementation assumes the use of four modes for 16×16 macroblocks but allows for incremental resources for the addition of the other modes. This block requires approximately 30 objects.

Multiple search partitions per macroblock, quarter pixel interpolation, and intra-prediction are several of the new features in the MPEG4/H.264 standard that enable higher compression levels, but introduce significantly higher computational demands for which the Arrix FPOA is ideally suited.

Deblocking Filter

The deblocking filter is applied to every decoded macroblock in order to reduce blocking distortion. The deblocking filter is applied after the inverse transform and has two benefits: 1) block edges are smoothed, improving the appearance of decoded images – particularly at higher compression ratios – and, 2) the filtered macroblock is stored for use with motion-compensated prediction of further frames in the encoder. This approach results in a smaller residual after prediction. Intra-coded macroblocks are also filtered, but intra-prediction is carried out using unfiltered, reconstructed macroblocks. This block requires approximately 35 FPOA objects.

Integer Transform and Quantization

The residual for each macroblock is further compressed by DCT-like integer transforms and quantization. The transforms convert the spatial data in the residual into frequency data, resulting in a matrix of coefficients. The coefficients are then quantized to further reduce the number of bits required to represent each coefficient. This is the one step in the encoding process where information is lost.

The encoder implemention performs the transform and quantization processes defined by the MPEG-4/H.264 standard. In H.264 compression, three transforms are used depending on the type of residual data that is to be coded. The three transforms include a transform for the 4×4 array of luma DC frequency coefficients in intra macroblocks, a transform for the 2×2 array of chroma DC coefficients and a transform for all other 4×4 blocks in the residual data. The transform for the intra macroblock requires the computation of DC coefficients in addition to additional operations for AC frequency coefficients. The transform and quantization block requires approximately 20 FPOA objects.

Other Blocks

Additional objects are used for memory interfaces, input/output blocks, configuration and “glue logic.” These extra functions combine to bring the total object count to about one half of the available 800 FPOA objects across two Arrix FPOAs. In addition, the CABAC entropy encoder is implemented in a small FPGA. The CABAC block is relatively small and is completely defined by the MPEG-4/H.264 standard.

Summary

The MPEG-4 Part 10/H.264 standard comprehends both the dramatically increasing demand for high definition video and the broad range of compression requirements across professional video market segments. The MathStar Arrix FPOA provides a high performance, highly flexible, cost effective platform for implementing an H.264 encoder. The overall H.264 encoder implementation uses approximately 50 percent of the resources of two Arrix FPOAs operating at 800 MHz and supports High 4:2:2 profile at HD resolutions (Level 4.1, 1080i/720p). The FPOA is completely re-programmable in the field, enabling an implementation of MPEG2, JPEG2000 or other encoders and decoders on the same hardware platform. The total implementation cost of the H.264 encoder is less than half that of alternative re-programmable solutions.

Tom Diamond joined MathStar, Inc. as Marketing Director in September, 2006. Tom has over 20 years experience in marketing and management roles in leading consumer electronics, test and measurement, and high volume silicon manufacturing companies. Most recently, he managed the roadmap planning and product definition for Intel Corporation’s LAN Access Division.