Multicore processing chips answer the power-energy challenge while increasing processing capability. For many years, increasing the computational power of processing chips has been achieved by increasing the core-clock frequency. However, this boosts the required input power and cooling by the cube of the frequency. A partial answer is achieved by using superscalar, single-instruction multiple-data (SIMD) architectures and larger memory caches. Multicore technology takes this trend a step further through the use of multiple full-function cores, each with its own local memory, on a single chip. Surprisingly, the movement toward multicore technology has not been driven by military applications, but instead by the video gaming industry.
Circular Synthetic Aperture Radar
Airborne Synthetic Aperture Radar (SAR) has been used for reconnaissance in environments where electro-optic and infra-red technologies cannot be used such as during the night or when the scene is obscured by weather conditions. SAR provides the war fighter with a flyby snapshot of an area that can be used for terrain mapping, targeting, and assessment of troop movement. In addition, because SAR uses coherent processing, it can literally find “footprints in the sand” where the other technologies cannot. Because of the power of today’s modern processors, many operations such as these can be performed with a single-processor core. However, the mission is changing.
Today’s counter-insurgency operations require more than reconnaissance. They require surveillance. The ability to observe an area on a 24/7 basis enables the war fighter to monitor real-time activity and perform forensic analysis. This capability is provided by Circular-SAR (CSAR). Instead of a straight-line flyby, the airborne platform circles the area of observation, continuously illuminating an area with radar (see figure 1). Instead of snapshots, a SAR movie is created, increasing the required processing by the frame rate. This new mission requires much more processing in order to reformat and coherently integrate the radar signals. Additionally, this must be accomplished in the power- and energy-constrained airborne environment.
Figure 1: CSAR has increased data bandwidth and processing requirements when
compared to conventional SAR due to high-PRF and non-linear flight path.
CSAR processing and data-bandwidth requirements are much more intense than conventional stripmap or spotlight SAR. First, CSAR applications require more data per frame. This is in part due to the increased pulse rate required for a typical CSAR application and due to the fact that an entire cone of data is taken from CSAR rather than a small annular section. Secondly, instead of using Fourier-transform based image processing, CSAR must currently use backprojection-based image processing to compensate for the non-linear flight path. This takes CSAR processing out of the realm of using a single-core processor and requires the use of multiple processing elements connected by a high-speed fabric. In fact, these requirements exceed the capabilities of today’s multicore chips, and require a scalable multicore-based multicomputer.
The IBM Cell Broadband Engine
A prime example of this new generation of chips is the IBM Cell Broadband Engine (BE) processor developed originally by IBM, Sony, and Toshiba for the Sony Playstation 3 video game console and other consumer electronics devices (see figure 2).
Figure 2: The IBM Cell Broadband Engine (BE) processor offers dramatically improved performance
for graphic-intensive workloads and computationally intensive applications.
The Cell BE is essentially a distributed-memory, multiprocessing system on a single chip. It consists of a ring bus that connects a single PowerPC Processing Element (PPE), eight Synergistic Processing Elements (SPEs), a high-bandwidth memory interface to the external XDR main memory, and a coherent interface bus to connect multiple Cell processors together.
The SPE is the heart of the computational capability of the Cell processor. Each SPE incorporates a 128-bit wide SIMD vector processing unit, 128 128-bit general-purpose registers, a 256 KB Local Store (LS) memory, and a Memory Flow Controller (MFC) that controls the DMA transfer of data between the between the off-chip XDR memory and the on-chip SPE Local Store.
The PPE and the eight SPEs are connected together by the Element Interconnect Bus (EIB) that can transfer up to 205 GBps. The EIB also connects the processing elements to an XDR memory controller that can provide 25.6 GBps to off-chip memory, and a 20-GBps Coherent Interface Bus that permits two Cell chips to be connected for symmetric multiprocessing (SMP).
Building a Scalable System
Modern radar imaging systems require a scalable approach to accommodate the size, weight, power, and cooling constraints imposed by various airborne platforms. Configuring a hardware solution around the IBM BladeCenter offers a commercial off-the-shelf (COTS) solution to this scalability. The resulting optimized solution reduces racks of commodity servers into a single rack with a dramatic reduction in power, space, and cost, and with significantly superior performance levels not economically possible before. The system can be configured with up to 28 blades in a 25U rack or up to 42 blades in a 42U rack. External network connectivity is provided via Gigabit Ethernet interfaces or optional high-speed fabric connections, including InfiniBand and 10 Gigabit Ethernet.
The Mercury Dual Cell-Based Blade 2 (DCBB2) has two Cell BE processors operating in SMP mode with full cache and memory coherency (see figure 3). Each processor running at 3.2 GHz has 205 single-precision GFLOPS of performance in the SPE array, for a total of 410 GFLOPS on each blade. The maximum configuration, containing 42 DCBB2 boards, provides 17.2 single-precision TFLOPS of performance in a 42U cabinet, with room to spare for additional support components.
Figure 3: Mercury’s Dual Cell-Based Blade 2 is a flexible solution based on the Cell BE processor
and the open BladeCenter standard for managed server chassis.
Each Cell BE processor has a “companion chip” that adds rich functionality and high performance to complement the processor. Each companion chip incorporates a high-performance, multi-channel DMA engine with striding and list DMA support. It also includes a low-latency mailbox mechanism for intra-blade event notification between Cell processors. Each companion chip implements 24 lanes of PCI Express with a total sustained theoretical bandwidth to the blade of almost 10 GB/s simultaneously in each direction.
The DCBB2 includes 1 GB of XDR DRAM per Cell BE processor. The XDR DRAM device architecture enables the highest sustained bandwidth for multiple, interleaved randomly addressed memory transactions. In addition, two DDR2 DIMM sockets are attached to each companion chip supporting up to 1GB DIMMs each. In total, the DCBB2 can support up to 10 GBytes of memory.
Input/output connectivity is provided by two PCI Express x8 interfaces connecting to a high-speed daughtercard site that can accept cards such as the InfiniBand 4X HCA expansion card, which provides dual 4X InfiniBand to the midplane. In addition, a dedicated Gigabit Ethernet controller chip is connected to the dual Gigabit Ethernet ports on the BladeCenter midplane.
Harnessing the Power of the Cell BE
Challenges to implementing a parallelized SAR algorithm on the Cell BE include efficiently partitioning and moving the data and tasks across the SPEs while maintaining a high-computational efficiency through the use of efficient-instruction pipelining, register use, and exploitation of the SIMD processing engines.
In order to harness the full-performance potential of the Cell BE processor, developers need the help of a software framework that supports its computation model and heterogeneous, distributed-memory architecture. Mercury’s MultiCore Framework (MCF) middleware provides precise control of fine-grained data distribution and assignment of processing resources on a multicore processor, but relieves them of the hardware-specific details involved. It supports resource allocation and data distribution for the implementation of real-time processing goals:
- Minimizing latency
- Efficiently utilizing resources
- Overlapping computation and communication
- Controlling computational granularity
- Maximizing SPE computational efficiency
MCF provides a simplified parallel-programming model based on the Data Reorganization Interface (http://www.data-re.org/). It offers multiple levels of programming that trade off efficiency with programming simplicity. This permits implementers to focus on those parts of the algorithm with the highest value. It provides this multi-level support through the use of high-level data channels that simplify data distribution and reconstitution, and lower-level functions such as message queues, barriers, semaphores and DMA instructions that permit direct transfer of data between SPE Local Store memories without going through XDR. MCF also provides functions to allocate data buffers that are aligned to the memory in order to achieve optimum DMA performance.
To test the applicability of the Cell BE to SAR, key components of SAR algorithms such as interpolation, corner-turns, and range and azimuth compression were selected for parallelization. It should be noted that the interpolation algorithm performs non-uniform 2D coherent interpolation that can dominate the processing cycle. For this reason, the implementation of the interpolation algorithm was the particular focus of this effort.
The parallelized SAR algorithms were implemented using a “function-offload” programming model, where the PPE acts as a manager directing the work of the SPEs. Sections of the algorithm were loaded into the SPEs as individual “tasks.” Input frames residing in XDR memory were divided into “tiles” and distributed to the SPEs where they were processed using the “tile channel” construct (see figure 4). The tile channel automatically partitions and synchronizes the scatter-gather dataflow between the main XDR memory and the SPEs’ Local Store memory. Processed tiles were then gathered back to the XDR memory using a return channel. In general, there is a one-to-one relationship between the input and output tiles which results in a simple get-tile/put-tile loop being executed on each SPE.
Figure 4: In the MultiCore Framework middleware, tile channels facilitate the communication and data movement between the manager and the workers for lower latency and efficient resource utilization.
Parallelization of the algorithms also required the data tiles to overlap by 12.5%; this was easily accomplished using the built-in overlap feature of the tile channel. (Essentially, the parallelized version had to process 12.5% more data.) Utilization of the SPEs was maximized by minimizing the number of trips the data must take to the XDR memory.
These algorithm benchmarks were then run for various image sizes and numbers of SPEs to examine scalability. The image sizes were varied independently in both range and azimuth. The results showed a high degree of linear scalability of the algorithm both with the size of the problem and the number of SPEs dedicated. When compared to a non-parallelized algorithm implemented on a single-core 500 MHz PowerPC, the parallelized algorithm ran 42x faster and was 3x more energy efficient on the 3.2 GHz Cell processor, clearly showing the advantages of multicore processing.