The news has been slowly leaked like the plot to an upcoming summer blockbuster movie. First, there is “the teaser.” In movies, this is a 30 second preview that gives only the most basic hint of the film. In our case, this was Xilinx’s ASMBL architecture announcement that came out in December 2003. Xilinx outlined the next-generation floorplan, explaining that it would be rich in hard IP, grouped into what the company called “columns”. They also revealed that the new family would enable a number of product variants focused on different application domains. Each variant would have a different mix of hard IP optimized for a particular type of application.
As the summer approaches, the audience is treated to “the trailer,” which is a more involved preview, typically up to a couple minutes in length, showing many of the plot elements of the final film. Virtex 4’s “trailer” came out in June, announcing details of the family and putting much of the speculation generated by the architecture announcement to rest. No, there would not be a substantial change to the LUT-based FPGA fabric architecture. No, there would not be hundreds of variations of Virtex 4 devices aimed like ASSPs at particular target applications. There would be three flavors initially. One optimized for DSP, one optimized for high-speed serial I/O and embedded processing, and one more general-purpose “classic” FPGA.
In movies, the next stage is usually a “test screening.” This is a limited showing of the final film to a selected audience to gauge reaction and to solidify plans for the eventual production launch. Virtex 4 has now completed its test screening, and Xilinx is evidently pleased with the results. Their early access customers have completed their first designs, and they’re ready to move on to the next level of deployment.
The final step before the big release is, of course, the “sneak preview.” This is a version of the show for general audiences who can’t wait to be the first to experience the film. They often wait in long lines late at night, or perhaps compete in radio station contests to snag one of the early admission passes. In FPGAs, this is the engineering sampling stage. As Xilinx announced this week, the first members of the Virtex 4 family are now at that point.
With Virtex 4 now joining Altera’s Stratix II, the 90nm flagship derby can begin in earnest. While the two companies’ strategies may sound different, the underlying technologies and target audiences are remarkably similar, with only a few serious differences of opinion separating them. This market is one place where competition definitely improves the breed. As a result, today’s high-end FPGAs are far cheaper, faster, lower power, higher density, and more capable than their predecessors.
The real advantages of this new generation aren’t just “bigger, faster, cheaper,” though. A lot of features are packed into these devices, and each generation gets smarter with their design and allocation. Picking an FPGA now is a lot like picking a new car for your family. You don’t just check the specs and pick the fastest top speed or the best gas mileage. There are a lot of decision factors that correlate to your individual needs, such as number of passengers, type of driving, how much cargo you carry, and whether you plan to tow a boat. With the advent of complex configurations of hard-IP blocks in FPGAs, the job of picking the best platform for any application is certainly more difficult than in the past, but the results will also be much more rewarding. Let’s take a look at some of the more important categories and compare Virtex 4 with the current competition and with its predecessors.
Multiplication is an expensive logic function, so a few years ago vendors started experimenting with throwing a few hardware multipliers into the mix in high-end FPGAs. While the idea sounded great on paper, these optimized blocks didn’t make much difference at first. In fact, many synthesis tools couldn’t recognize when to infer the new multipliers from the HDL code, so they’d often sit unused while slower, soft implementations of multipliers soaked up precious LUTs. Over time, however, synthesis tools improved and began to make some use of the new hardware. At about the same time, a new type of customer, the DSP designer, was attracted to FPGA specifically by the promise of squadrons of multipliers crunching away in formation. This kind of parallelism promised to deliver DSP performance more than an order of magnitude greater than even the fastest DSP processors.
Simple multipliers don’t quite fit the bill, however. Many DSP applications work better with a dedicated multiply-accumulate (MAC). In the previous technology generation (Xilinx’s Virtex II and Altera’s Stratix), Stratix had the advantage of offering a MAC, while Virtex II offered only multipliers with the accumulate function implemented in the FPGA fabric. Virtex 4, however, rectified that deficiency and upped the ante with a sophisticated cascade/carry/control network that allows many DSP functions to be implemented using only the dedicated hardware with no LUTs.
The DSP version of the Virtex 4 family (labeled SX) has 3 members with 128, 192, and 512 MACs. Altera’s Stratix II groups MACs in “DSP Blocks,” each of which contains four MACs, so the members of the family have 48, 64, 144, 196, 252, and 384 MACs. Xilinx claims that their DSP blocks run at 500MHz, while Altera rates theirs at 370MHz. Curiously, Altera claims a maximum throughput of up to 284 GMACs (384 MACs running at 370MHz), while Xilinx claims 256 GMACs (512 MACs at 500MHz). Confused? It seems the math breaks down like this: Virtex 4 has a maximum of 512 18X18 MACs at 500MHz = 256 GMACs. Stratix II is more complicated. A Stratix II block can support up to 8 9X9 MACs, so 96 blocks X 8 MACs = a maximum of 768 (9X9) MACs. At 370MHz that gives 284 GMACs. Comparing apples-to-apples, however, Stratix II’s blocks can each do 4 18X18 MACs which yields a total theoretical throughput of 142 GMACs. Stratix II wins for flexibility, Virtex 4 for overall horsepower.
Which one will win a race on your design? There’s no way to tell without taking them for a test drive. Either family will more than double the DSP throughput of the previous generation in a much lower-cost device. Regardless of GMACs claims, both families will also turn in performance significantly faster than the most expensive DSP processors. This is good news for DSP design with FPGAs. Also, in any case, the actual results seen by DSP designers is likely to be more a function of the design methods and tools used than of the subtle differences between these architectures. An upcoming feature article will look at these DSP tools in detail.
One of the hottest new features to land in FPGA country is high-speed serial I/O. The challenge for vendors in providing this functionality boils down to standards and performance. A vast number of standards have emerged, and it seems that more are arriving every day. A winning solution will support the most useful set of those standards while simultaneously achieving the highest (and most scaleable) throughput. Virtex 4 has improved on its predecessor by offering a range of throughputs from 11.1 Gbps down to 622 Mbps, pushing the envelope on both ends.
On the high end of throughput, Xilinx has been touring the world demonstrating 10Gbps working over actual backplanes to prove that the performance is real and can be achieved with today’s devices. Altera has yet to announce a “GX” (their label for Stratix SERDES-equipped devices) version of the Stratix II family, so at the moment Virtex 4 is left to compete with Altera’s previous generation Stratix GX.
Virtex 4 has an impressive SERDES capability with support for a plethora of protocols and a wide and scalable range of bandwidths. The programmable RocketIO transceivers enable and simplify the design of complex switching applications such as multi-rate line cards. For more conventional connectivity, Virtex 4 also offers up to four integrated Tri-Mode EMACs, which provide 10/100/1000 Ethernet right on your FPGA.
We could write several articles, or maybe a book on this subject (and we certainly intend to do the former). For now, suffice to say that the addition of high-performance processors to FPGAs created the most compelling embedded computing platform available on the market today. We are now in at least the second generation of processor-laden programmable logic, and the lessons of the past are clearly informing today’s designs. Virtex 4 offers a selection of three different processor options: the 8-bit soft-core PicoBlaze, the 32-bit (also soft-core) MicroBlaze, and the 32-bit hard-core PowerPC. The PowerPC appears only in the “FX” version of the Virtex 4 family, while the PicoBlaze and MicroBlaze processors can be implemented in any Virtex 4 device.
Looking at the competitive picture, Altera’s Stratix II family is compatible with their highly successful Nios and Nios II 32-bit soft-cores. This is one area where the two companies’ strategies diverge somewhat. Altera has put tremendous development and marketing emphasis on their Nios line and achieved an early market lead in the soft-core game. Xilinx, meanwhile, has worked to bring a range of processors (both hard and soft) to the table and is now starting to get more traction in the hard-core space with their PowerPC.
For raw single-processor performance, Virtex 4’s PowerPC gets the nod. It gives up flexibility to the soft-core processors but excels in straight-ahead performance. To boost performance even more, Xilinx offers a new auxiliary processing unit (APU) that makes it easy (or at least easier) to connect co-processors or hardware accelerators to the PowerPC. Judicious use of the APU can give a dramatic speed boost to the already speedy 450MHz PowerPC 405. The FX series of Virtex 4 devices has 6 members with varying densities, and the largest 4 all have dual PowerPC cores.
On the small end of the processor scale, Xilinx sees huge demand for its 8-bit PicoBlaze soft-core processor. The PicoBlaze was one of the first soft processors to hit the FPGA market, and, interestingly, it started life as an application note. The app-note processor apparently gradually gained a progressively larger following until Xilinx decided to formalize it and market it as an official product.
Again in the processor game, it’s the tools that matter most. We’ll have a future article series looking exclusively at the tools for embedded system design with FPGAs.
With all these processors, serial I/O, and DSP going on, you’re going to need a lot of memory. The Virtex 4 family has internal block RAM (that can be configured as RAM or FIFOs) ranging from 864Kb on the smallest “LX” device to 9,936Kb on the largest “FX” model. The internal RAM in each family has been chosen to balance the target applications for that group, keeping in mind buffering, software/firmware, and temporary storage requirements.
Looking over the competitive fence, Altera’s Stratix II family offers a range of on-chip memory from 419Kb to 9,383Kb over its 6 sizes. Stratix II’s memory blocks are an assorted mix of sizes and configurations, including 512 bit blocks, 4Kb blocks, and 512Kb blocks.
Connecting to external memory is also a critical challenge for FPGAs, given the demanding specifications of the latest-generation RAMs. To facilitate external memory interface, Xilinx now offers dedicated circuitry in the FPGA to ease data capture and frequency division, and a development board and kit aimed specifically at that task. The development board allows various memory interfaces to be tested in hardware using your actual design on the FPGA. Given the number and variety of RAM interfaces now on the market, this should prove to be a very popular development tool.
While less and less marketing emphasis is placed on the LUT fabric itself, the architecture and performance of the basic logic blocks is still improving, generation after generation. Virtex 4 has taken advantage of smaller geometry, flip-chip configuration, and the lessons of previous generations to weigh in with their best performance and flexibility ever, including better signal integrity, improved clocking, and a host of other advantages.
Altera’s Stratix-II introduced a completely revised variable-width LUT structure that marked a radical departure from their previous generations. At the moment, we have no benchmark data to show us how the two architectures perform for various design types, and the results are likely to depend heavily on how well the synthesis and place-and-route software is able to take advantage of the features of each.
Everyone expected 90nm to be faster than 130nm, and Virtex 4 bears out that assumption. Xilinx claims that the Virtex 4 fabric can clock at 500MHz, and, given the array of hard-core IP available, total system performance is likely to be many times better than with their previous Virtex II/Pro family. Competitively, Altera also claims 500MHz for their Stratix II offering. Our hunch is that few applications on either platform will achieve this operating frequency, as there are a number of variables that conspire to de-rate the data rate. You can be sure that your designs will run faster with this generation, however, probably by a factor of at least two.
What everyone did not expect at 90nm was better power consumption. Industry lore and speculation ventured that 90nm devices would be unusable power hogs. As usual, however, these harbingers of doom were shortsighted. A number of technical advances led to a significant improvement in power efficiency and held off the demons of leakage current for at least another process node. Among the innovations being disclosed by Xilinx is a new “triple-oxide” technology where transistors can have one, two, or three times the standard oxide width, depending on the speed demands placed upon them. Configuration transistors, for example, do not operate at high frequency and can benefit substantially from leakage-reducing triple-thick oxide. As configuration transistors are thought to be in a majority on the device, the power saving potential is obvious.
Altera’s Stratix II line uses a low-K dielectric process as a dynamic power reduction strategy. While the benefits of the two process optimizations are different (dynamic power reduction for low-K versus static power reduction for triple oxide) the overall goal is similar. What is not obvious is the “behind the scenes” effect on yield which could manifest itself in delayed delivery schedules or higher costs if either vendor’s fabrication partners stumble. Industy-wide, low-K has more mileage than triple-oxide, but it’s too early to tell the actual impact of either process choice in FPGA.
Overall, Xilinx claims that Virtex 4 is more than 50% more power efficient than their previous generation, with 40% lower quiescent power and 50% better dynamic power. Additionally, any functions that can now be implemented using the available hard IP will be inherently more power efficient than those same functions implemented in LUT fabric in previous generations.
How will the battle of the 90nm flagships play out? It’s too soon to tell, really. One can’t know at the preview stage whether a movie will flop like “Ishtar” or set sales records like “Lord of the Rings”. Only the first few weeks’ ticket sales can provide the answer. As with our last week’s article, if you have experience with either of these device families, we’d love to hear from you. One thing we suspect we’ll hear is what our FPGA market survey has already told us. The most important factor in choosing an FPGA device is previous experience with the vendor. Overall, though, with the lower cost, higher performance, and incredible IP wealth of these platforms, we’re betting on four-star reviews.