feature article
Subscribe Now

Cray Goes FPGA

Algorithm Acceleration in the New XD1

When I was in college, I knew the future of supercomputing. The supercomputers of the 21st century would be massive, gleaming masterpieces of technology. They would not be installed into buildings, but rather buildings would be designed and constructed around them – particularly to house the cooling systems. The design specifics were fuzzy, but I was reasonably sure that very low temperatures would be involved for either superconducting connectivity, SQUIDs, or Josephson junction-esque switching. Silicon would certainly have been long abandoned in favor of Gallium Arsenide or some even more exotic semiconductor material. I believed that Cray, Inc., as the preeminent developer of supercomputers, would be able to leverage these techniques to gain perhaps a full order of magnitude of computing performance over the machines of the day.

A few years later, when Xilinx rolled out their first FPGAs, I could see the future of that technology as well. FPGAs would act as a sort of system-level silicon super-glue, sitting at the periphery of the circuit board and stitching together incompatible protocols. With the simple addition of an FPGA, anything could be made to connect to anything else, and programmability insured that we could adapt on the fly and change our design to leverage any new, improved component without having to abandon the rest of our legacy design.

As I gazed into my crystal ball (looking way out past the distorted reflection of my feathered hair, Lacoste polo shirt, and wayfarer sunglasses), I could not envision any connection between these two seemingly unrelated technology tracks. Supercomputers would be designed and built from the ground up, using carefully matched and optimized homogenous components, while FPGAs would be the duct-tape of electronics design, helping to hold together aging multi-generational systems for a few more years of life in the field before they were retired altogether. In my crystal ball, the two paths were obviously diverging.

I was right about the Cray part.

The Cray XD1, one of the latest innovations from the world’s best-known supercomputer manufacturer, leverages Xilinx’s FPGA technology to provide massive algorithm acceleration through hardware-based implementation of compute-intensive algorithmic tasks. While we in the editorial community were idly debating whether FPGAs might be useful as reconfigurable computing engines after all, Cray was busy at work back in the lab building the thing. “We are continually researching new ways to gain greater application performance for our customers,” says Geert Wenes, business manager responsible for emerging markets at Cray. “With the Cray XD1 direct connect architecture combined with the new generation of FPGAs, we saw an opportunity to gain orders of magnitude speed-up for some of our customers’ most challenging applications. Applications that are highly parallel on a fine-grained level and spend much of their computation time on integer and fixed point calculations, such as adaptive optics simulations, seismic imaging, or even molecular docking applications in life sciences stand to gain 10 times or more overall application performance improvement with FPGA application acceleration. In many cases, such speed-ups are necessary to make the application a viable one for our customers.”

“Our alliance with Cray was a natural fit for Xilinx,” said Sandeep Vij, vice president of worldwide marketing at Xilinx.” Both companies have established technical leadership in our respective markets, and we share the same fundamental values of providing customers with leading edge products and unprecedented service. We were extremely impressed with the technical prowess of the people at Cray. This was one of those rare instances where collaboration with a customer directly benefited our own product.”

The XD1 architecture takes advantage of what Cray calls “Rapid Array Interconnect” to couple Xilinx Virtex II Pro devices directly to the AMD Opteron processors in each blade through a 3.2 GB/s bi-directional connection. “By treating the FPGAs as integral system components rather than peripherals and linking them directly to the processors through high-speed connections, we were able to remove one of the biggest bottlenecks to FPGA co-processing, which is the loss of speed transferring data to and from the co-processor,” Wenes continues. “You also need Linux-like commands for administering the FPGA-based computer. We’ve developed a set of about twenty commands that manage, monitor, and check the FPGA.”

The XD1 also includes a copious 16MB of 12.8 GB/s Quad-data rate SRAM cache connected to both the Opteron and the FPGA to facilitate maximum utilization of both ends of the processor/co-processor pipe. The cache is memory mapped into the Opteron’s user space so the application running on the AMD processor can access the FPGA’s cache at speed.

Viewed from the top level, the XD1 architecture consists of a backplane/chassis that accommodates 6 blades. Each blade includes two 64-bit AMD Opteron 200 series processors and a single Virtex II Pro XC2VP50-7 with 16MB QDR RAM attached. Each Virtex II Pro can be accessed by any Opteron in the cluster, offering maximum flexibility for algorithm acceleration. The whole thing is coupled to the outside world with big pipes, including 4 PCI-X slots that can take dual-port gigabit ethernet cards or dual-port fibre channel HBA. It’s all running under the control of Cray’s HPC-enhanced version of Linux.

From a hardware perspective, the system has incredible performance potential, which is perhaps limited at this time primarily by the rather Rube-Goldbergian requirements for taking advantage of the huge performance boost possible with FPGA-based acceleration. The soft-focus vision of the future is seamless compilation of high-level algorithmic code into an optimized mix of sequential software and parallelized hardware accelerators. Today, however, the super-advanced tool capability required for that vision is not yet in place. What we have instead is an awesome hardware platform that requires a significant time investment to fully harness. Algorithms must be carefully analyzed by experts and their innermost compute-intensive loops carved out for potential FPGA implementation. These chunks must then be tackled by hardware-savvy engineers who can either create a suitable hardware architecture in RTL or leverage one of the fledgling technologies for converting algorithmic descriptions (such as those written in C or C++) into hardware architectures.

To address these issues, Cray is working with companies like Xilinx to provide development tools to fill the gap and promoting the idea of re-usable IP for common algorithm acceleration. At the same time, they’re also tracking the advances of algorithmic synthesis technologies such as those offered by Starbridge, Celoxica, and Mentor Graphics for compiling software directly into optimized hardware architectures. While algorithm compilation into optimized hardware is the most significant engineering challenge posed by this architecture, the potential gain is so large that several companies are actively developing technology to address the issues.

But, who needs all this computing performance anyway? After all, the computing speed most of us always dreamed about is apparently available in the laptops we carry around to keep up with e-mail. For some applications, however, Moore’s Law’s pace on commodity machines simply isn’t getting the job done. Seismic imaging customers, for example, typically employ vast arrays of Linux boxes flying in formation to seek out subtle patterns in voluminous sensor data. An accelerated high-performance computer (HPC) technology like the XD1 can have a profound impact on the cost of processing the huge amount of data they gather in trying to generate images and models of the sub-surface world. With the FPGA-based acceleration in the XD1, they can obtain 40X-50X improvement over conventional processors for certain algorithms.

In life sciences, biomedical applications like DNA sequence alignment are extremely compute intensive and have algorithms that are well suited to hardware acceleration. As these traditional research areas intersect high-commercial-value domains like drug discovery, a large market may be created for commercial applications of supercomputing. In a more universal sense, even relatively routine tasks like random number generation can be accelerated to great benefit in many simulation and modeling applications.

It is clear that, to date, the potential of FPGAs as reconfigurable computing enablers has barely been touched. High performance hardware implementations like Cray’s XD1 open the door for development of breakthrough synthesis and compilation technology that will make algorithm acceleration a routine and seamless process, much like high-level language compilation is today. If that eventually comes to pass, we may forget what a pure Von Neumann architecture computer even looks like as we enter a new era of performance with programmable logic acceleration as a key component of the computer of the future.

But, what about superconductivity and exotic materials? Do you feel a secret longing for the omniscient aesthetic of Dr. Forbin’s Colossus? If it will help, you can always install a soft-serve machine behind your XD1 to approximate the sounds of a high-powered cooling system. Also, since the XD1 doesn’t have the seven-, eight-, or nine-digit price tag we expected in our fantasy supercomputer, you’ll have plenty of budget left over to adorn yours with some snappy-looking blinking lights and a really nice monitor.

Leave a Reply

featured blogs
Nov 24, 2021
The need for automatic mesh generation has never been clearer. The CFD Vision 2030 Study called most applied CFD 'onerous' and cited meshing's inability to generate complex meshes on the first... [[ Click on the title to access the full blog on the Cadence Community site. ]]...
Nov 24, 2021
I just saw an amazing video showing Mick Jagger and the Rolling Stones from 2021 mixed with Spot robot dogs from Boston Dynamics....
Nov 23, 2021
We explain clock domain crossing & common challenges faced during the ASIC design flow as chip designers scale up CDC verification for multi-billion-gate ASICs. The post Integration Challenges for Multi-Billion-Gate ASICs: Part 1 – Clock Domain Crossing appeared f...
Nov 8, 2021
Intel® FPGA Technology Day (IFTD) is a free four-day event that will be hosted virtually across the globe in North America, China, Japan, EMEA, and Asia Pacific from December 6-9, 2021. The theme of IFTD 2021 is 'Accelerating a Smart and Connected World.' This virtual event ...

featured video

Achronix VectorPath Accelerator Card Uses PCIe Gen4 x16 to Communicate with AMD Ryzen PC

Sponsored by Achronix

In this demonstration, the Achronix VectorPath™ accelerator card connects to an AMD Ryzen based PC using PCIe Gen4 x16 interface. The host PC issues commands to have the Speedster™7t FPGA on the VectorPath accelerator card write and read to external GDDR6 memory on the board. These data transactions are performed using the Speedster7t FPGA’s 2D network on chip or NoC which eliminates the need to write complex RTL code to design the host PC to GDDR6 memory interface.

Contact Achronix for a Demonstration of Speedster7t FPGA

featured paper

10BASE-T1L single-pair Ethernet – closer network edge and fewer cables

Sponsored by Texas Instruments

What is single-pair Ethernet? Understand the key benefits of 10BASE-T1L single-pair Ethernet and see an example for a long-distance (>1 km) two-wire networking at 10 Mbps, in remote industrial, building and process automation applications.

Click to read more

featured chalk talk

Traveo II Microcontrollers for Automotive Solutions

Sponsored by Mouser Electronics and Infineon

Today’s automotive designs are more complicated than ever, with a slew of safety requirements, internal memory considerations, and complicated power issues to consider. In this episode of Chalk Talk, Amelia Dalton chats with Marcelo Williams Silva from Infineon about the Traveo™ II Microcontrollers that deal with all of these automotive-related challenges with ease. Amelia and Marcelo take a closer look at how the power efficiency, smart IO signal paths, and over the air firmware updates included with this new MCU family will make all the time-saving difference in your next automotive design.

Click here for more information about Cypress Semiconductor Traveo™ II 32-bit Arm Automotive MCUs