Building on commodity hardware and industry standard software programming methods, Baylor group architects a $10,000 reconfigurable computing cluster.
Accelerated computing – using programmable logic and other non-traditional processing resources to augment clusters – has become increasingly popular. Most recently, the NSF announced that $20,000,000 in funding is available for research groups to push beyond petascale computing. Accelerated computing represents the leading edge of the high performance computing wave, as evidenced by the world’s fastest supercomputer, the Roadrunner cluster located at Los Alamos National Labs. Roadrunner makes use of commodity processors coupled with IBM Cell processors, representing a hybrid, accelerated computing system. Field Programmable Gate Array (FPGA) devices also show great promise for accelerated computing, which is why a group of students at Baylor is architecting an FPGA-based reconfigurable cluster system. Unlike the world’s top-500 supercomputers, however, the Baylor system uses technology one removed from cutting edge and totaling under $10,000 in cost. This lower-cost approach promises fast bring-up from existing, proven technologies, at a much lower cost. Advances in FPGA programming tools also provide access to millions of lines of legacy C code that can be readily refactored for FPGA-based parallel processing.
Algorithms with significant non sequential processing elements, large I/O requirements and less need for closely coupled memories are good candidates for FPGA based reconfigurable computing (RC). In FPGA RC configurations the underlying programmable hardware can be optimized for specific algorithm types and geometries. For this reason, high end, hardware accelerated servers and clusters are now being offered by companies including SGI and Cray. But what can the poor graduate student do in the meantime? Existing RC systems also frequently are hardware dependent. I.e. they are constructed of very specific components and often feature programming models that are not easily accessible or easily portable. Clusters are good because they are scalable. They grow as needed.
Baylor’s Department of Electrical and Computer Engineering began research in the area of Rconfigurable Computing using an SRC-6e Reconfigurable Computer located at the Naval Postgraduate School in Monterey, California. The SRC-6e is built by SRC Computers, Inc. (http://www.srccomp.com/). Their Carte C language provides an untimed C programming environment that we found to be easy to learn and use. However, the SRC-6e was an expensive and proprietary solution. The hardware platform was not under our control and could not be expanded easily. As our interest in reconfigurable computing grew, we decided to build our own low-cost reconfigurable computing platform to serve as a vehicle for learning and research.
The first step in the process was to set design goals for Baylor’s system. The goals that we developed were:
- The system should be powerful enough and versatile enough to support a wide range of useful research projects.
- The system should be as inexpensive and non-proprietary as possible to make it affordable for us and easier for others to mimic.
- The system should be scalable and support easy upgrades as Moore’s Law results in technological advances.
- The system should be “user friendly”. At a minimum it should allow students in both the computer engineering and computer science fields to perform research with a minimal learning curve. Preferably, it should support ease of use in other fields, such as the physical sciences and others. This is particularly important as Baylor’s School of Engineering and Computer Science does not currently have a PhD program. This means that Master’s students must be able to get up to speed quickly with little support from more experienced students.
- The system should support easy porting of existing applications.
After exploring multiple potential system architectures, we decided to base our platform on the cluster computer model popularized by the Beowulf cluster concept (http://www.beowulf.org/). This choice was made as Beowulf clusters have proven to support low-cost, scalable, high performance computing applications for over ten years. They also enjoy widespread use in academia. Given this decision, we needed to determine how to add the “reconfigurable” component to the cluster. To address this issue we needed to select both the FPGA hardware and the programming environment.
The Beowulf philosophy of using low-cost commodity computing components led us to select standard FPGA development boards as the platform of choice. These boards are typically mass produced by FPGA vendors to allow developers to gain experience with their latest FPGAs. They are often sold at, or below the bill of materials cost. They typically contain FPGAs with embedded processors, a variety of memory and interface options, and a high speed Ethernet interface; all of which are needed for our application. We chose the Xilinx University Program (XUP) Virtex 2 Pro development board as our reconfigurable computing node. The board is low cost and ubiquitous so other universities can easily duplicate our configuration and research. When we purchased these boards the FPGAs represented the state of the art.
The board is versatile enough to support many potential research projects. The FPGA contains two PowerPC processors that can be used to run the operating system and application software. The board also supports up to 2 GB of Double Data Rate (DDR) SDRAM and various Flash ROM options. The board supports multiple interfaces including low-speed parallel, high-speed differential parallel, and multi-gigabit serial interfaces that utilize low cost IDE and SATA cables. These can be used to explore alternatives for directly connecting multiple development boards together. We plan to upgrade to newer FPGA development boards periodically in the future. As later noted, the Impulse C programming tools we selected are intrinsically device independent so migrating to Virtex-4 or -5 boards will be an incremental effort rather than a restart.
The XUP boards are the computing nodes of the cluster. The remaining hardware components, shown in Figures 1 and 2, are typical for a cluster computer. They include:
- A Mini-ITX form factor PC used as the head node of the cluster
- D-link Ethernet Broadband Router DL-704uP
- Netgear 24 port 10/100/100 MBIT Gigabit Ethernet Switch JGS524
- Lambda HK150A-5/A Power Supplies
- Off the shelf cables and a custom rack that can hold up all of the equipment and up to 16 XUP boards.
Figure 1. Reconfigurable Cluster Components
Figure 2. Block Diagram
The resulting cluster is homogeneous, featuring identical architectures at each node. There is also an argument for heterogeneous computing where reconfigurable cluster become part of a larger grid comprised of standard microprocessors. In this configuration the reconfigurable cluster would act as a configurable resource on the grid.
While one group assembled the hardware, other researchers worked on developing the software environment and test applications for the cluster. The software environment consists of three major pieces:
- The operating system for the PowerPC processors
- Software to support normal parallel programming
- Software to translate a high level language to hardware to take advantage of the reconfigurable circuitry of the FPGAs
We are currently working with both the QNX (http://www.qnx.com/) and MonteVista Linux (http://www.mvista.com/) for the operating system. We plan to use a minimal version of MPI as our initial parallel programming environment. After experimenting with various system design languages, we have chosen the Impulse C language (http://www.impulsec.com/) to develop our reconfigurable applications. Our choices for the operating system and MPI are based on both our familiarity with them and the fact that they are used by a wide variety of standard clusters. We chose Impulse C for several reasons:
- It is implemented as a set of libraries that works with any ANSI-C development environment, minimizing the learning curve and maximizing portability
- It is based on the Communicating Sequential Process Model which is the same as MPI
- It uses untimed C, which is more natural for programmers from all disciplines
- It is not targeted to proprietary systems, but it does come with support for many platforms and interfaces. In the case of Xilinx it supports FSL, OPB and PLB interfaces with a drop down menu selection.
Results of initial applications are very encouraging. To test the performance of the cluster, we implemented a sonar application that we had previously used on the SRC-6e computer. On the SRC-6e, we used a mix of Carte C code and hand-coded VHDL to obtain a speedup of 65 times faster than was available using a 1.8 GHz Pentium processor. Using hand-coded VHDL, we were able to match this speedup with just one XUP board. Using three boards nearly tripled the speedup.
Using Impulse C the preliminary results show that we should be able to obtain the same speedup. Though pending refinement we’re using more boards to do so. Currently we have the application split between two boards, but we are using a combination of hardware and software to communicate between the boards. This communication method will need to change to achieve the desired speedup. The inter-board communication will need to be implemented at the hardware level. Impulse C supports this by allowing the user to define custom interfaces. We are in the process of implementing several communication methods in VHDL. Once these interfaces have been defined, the rest of the application, and other similar applications, should require no additional VHDL knowledge.
The reconfigurable computing cluster was assembled for a cost of less than $10,000. Initial testing demonstrated that for applications with significant non sequential processing elements, large I/O requirements and less need for closely coupled memories, the cluster can provide performance matching or exceeding that obtained from a much more expensive commercial reconfigurable computer. As more hardware and software components are developed for the cluster, it is anticipated that it will be easier to port existing parallel programs to the cluster and take advantage of the significant increases in computing power provided by the FPGA computing elements.
About the Author
Associate Professor Russ Duren of Baylor University. Russell W. Duren (S’76–M’78–SM’96) received the B.S. degree in electrical engineering from the University of Oklahoma, Norman, in 1978 and the M.S. and Ph.D. degrees in electrical engineering from Southern Methodist University, Dallas, TX, in 1985 and 1991, respectively. He spent 17 years in industry. The majority of this time was spent designing avionics at the Lockheed Martin Aeronautics Company, Fort Worth, TX. After that, he spent seven years teaching and performing research in the fields of avionics and reconfigurable computing at the Naval Postgraduate School, Monterey, CA. Currently, he is an Associate Professor in the Department of Electrical and Computer Engineering, Baylor University, Waco, TX. He is the author of over 30 publications. His research interests include avionics, embedded systems, FPGA digital design, and reconfigurable computing. Dr. Duren is the recipient of the 1991 Frederick E. Terman Award for Outstanding Electrical Engineering Graduate Student from Southern Methodist University, the 1991 Myril B. Reed Outstanding Paper Award from the 34th IEEE Midwest Symposium on Circuits and Systems, and the 2002 Naval Postgraduate School Award for Outstanding Instructional Performance.