feature article
Subscribe Now

Toward Ten TeraFLOPS

Altera Kicks Up Floating Point

The Cray-2, the world’s fastest computer until about 1990, was capable of almost 2 GigaFLOPS (Billion Floating Point Operations per Second) – at an inflation-adjusted price of over $30 million. A decade later, ASCI Red – selling for a cool $70 million or so – topped one teraFLOPS (Trillion Floating Point Operations per Second). The machine was twice as expensive, but the price per performance had dropped from ~$15M/GFLOPS (Cray) to ~$70K/GFLOPS (ASCI Red). That’s a shocking improvement. Moore’s Law would have us believe in a ~32x gain over the course of a decade, but real-world supercomputers delivered over 200x in just ten years. Take that, Dr Moore!

Sometime in 2015, according to Altera, we will have a single FPGA (yep, that’s right, one chip) – designed by Altera and manufactured by Intel – capable of approximately TEN teraFLOPS. Let’s do some math on that, shall we? We don’t know exactly what a Stratix 10 FPGA will cost, but it almost doesn’t matter. This device should put us in the realm of $1/GFLOPS. Or, compared to ASCI Red, an additional 70,000x improvement in cost per performance. Compared to the 1990’s Cray-2 (a quarter century earlier), That’s a 15,000,000x improvement – in a time span when an optimistic interpretation of Moore’s Law says we should have less than 10,000x improvement. This is all very fuzzy math, but it appears that high-performance computing will have outpaced Moore’s Law by some 1,500x since 1990. 

Whoa!

Now, before you all pull out your slide-rules and start shouting about everything from our underlying assumptions to Altera’s marketing “techniques,” let’s see what’s changed to make that possible. We all know that the underlying technology – semiconductors – have tracked pretty straight with Moore’s Law (if you live in a bizarre logarithmic land where you count a 50-year exponential as “straight”). That means our computing hardware has made some serious gains in places other than the number of transistors packed onto a single die.

What kinds of engineering innovation give us this extra three orders of magnitude of “goodness”? The case we’re examining – the most recent innovation announced just this week – is Altera’s hardening of their floating point arithmetic units. IEEE 754 Single Precision Floating Point is now fully supported in optimized hardware – in the DSP blocks of both the current Arria 10 and in the upcoming Stratix 10 FPGAs. This brings a major performance boost to floating point applications targeting FPGAs.

Hey there Horace, haven’t hardware multipliers been around for at least three decades?

Yes they have. Even back in the 1980s, the venerable 8086 shamelessly rode the coattails of its lesser-known but harder-working sibling, the 8087, to floating point fame and fortune. What Altera has done, however, is to combine the fine-grained massively-parallel capabilities of a modern FPGA with a very large number of floating-point-capable DSP blocks. While FPGAs have been routing von Neumann processors for years on fixed-point datapath throughput, their supercomputing achilles heel was always their floating point architecture (or, more precisely, their lack thereof).

Modern FPGAs contain sometimes thousands of DSP units. You can construct a massively parallel datapath/controller architecture using the FPGA fabric that can significantly outperform even the fastest DSP processors in big math-crunching algorithms. Even more significant is the extreme power savings of an FPGA-based implementation compared with a software solution executed by conventional processors. Numerous benchmarks have demonstrated the superiority of FPGAs compared to DSPs, conventional processors, and even GPUs for datapath-oriented computing – both in raw performance and in computational power efficiency.

However, there have always been two major barriers to the adoption of FPGAs for high-performance computing. First is the difficulty of programming. Where a conventional processor or a DSP requires software expertise in a high-level language like C++ (or, even FORTRAN, believe it or not, for some high-performance computing projects), FPGAs have always required a background in digital hardware design and fluency in a hardware description language such as VHDL or Verilog. This means that getting your algorithm running on an FPGA has historically required adding a hard-to-find hardware/FPGA guru to your team and a few months to your schedule, and those are two luxuries that many teams do not have.

Altera’s solution to the programming challenge is an elegant one. Since the emergence of GPUs as high-performance computing platforms and the explosion of languages like Nvidia’s CUDA or the Apple-developed (but now open) OpenCL, software engineers have been moving closer to the task of defining explicit parallelism in their code. Altera met those OpenCL programmers more than halfway by providing a design flow that maps OpenCL directly to hardware on Altera FPGAs. If you’re already writing OpenCL implementations of your algorithm to run on GPUs, you can take that same code and target it to FPGAs – with reportedly outstanding results.

The caveat on that OpenCL flow (until now) has been floating-point math. Since the DSP blocks on FPGAs have always been fixed-point, floating point arithmetic required going outside the DSP blocks and implementing the logic in FPGA LUT fabric. While this was still a “hardware” implementation, it was much less power- and logic-efficient than a custom-designed hardware floating-point unit. With this announcement, Altera has plugged that gap – bringing fully-optimized hardened single-precision floating point to their DSP blocks.

Apparently, these nifty hardened floating point units have already been hiding in Altera’s Arria 10 FPGAs – just waiting for support in the design tools. Now, when design tool support is turned on, Altera’s 20nm, TSMC-fabbed Arria 10 FPGAs will suddenly be capable of up to 1,500 GFLOPS. This performance can be tapped via the OpenCL flow, the DSPBuilder flow, or even old-school with “FP Megafunctions” instantiated in your HDL code. 

Where this gets really interesting, however, is with Altera’s upcoming Stratix 10 family – based on Intel’s 14nm Tri-Gate (FinFET) process. With Stratix 10, Altera claims they’ll have up to ten teraFLOPS performance in a single FPGA. That’s staggering by any standard, and we should have it sometime in 2015.

It is perhaps appropriate at this point to debunk some of the derisive rumors being manufactured and spread by one of the industry’s less-reputable pay-for-play shill blogs. There is absolutely no evidence to support rumors of Altera leaving Intel and going back to TSMC for Stratix 10. On the contrary, at this moment, Altera has working test chips in house that were fabricated with Intel’s 14nm Tri-Gate process. Altera is using these test chips to validate high-speed transceivers, digital logic, and hard-IP blocks (perhaps, even hardened floating-point DSP blocks, although the company hasn’t shared that specifically). Now, maybe this is all innocent and the bloggers in question were simply “confused” because Altera is still very actively partnering with TSMC as well – on the aforementioned 20nm Arria 10 line. Or, perhaps, Altera and Intel didn’t pony up for protection from the blogger mob, so they got kneecapped with some vicious and baseless rumors. As of this writing, however, Altera and Intel are still working hard together on Stratix 10 with 14nm Tri-Gate technology – and apparently it is coming along quite nicely.

Hardening the floating point processing has the obvious advantages one would expect, plus some less-obvious ones. Of course, optimized floating-point hardware is much faster than floating-point processors built from FPGA LUT fabric. Also of course, power consumption is greatly reduced. Less obvious is the fact that, since Altera has just freed up all those FPGA logic cells that were doing floating point before (a great number of them, it turns out), we are suddenly gifted with a huge helping of extra FPGA fabric. In other words, if you were using your old FPGA for floating point, that FPGA just got a whole lot bigger.

Following onto that advantage, the old floating-point modules were some of the most difficult parts of many designs to successfully route and bring to timing closure. Now, with these hardened floating point blocks, those routes no longer need to be routed and those paths no longer need to suffer the agony of timing closure. Your design tool drama and runtimes just took a big turn in the right direction.

There is an industry significance to this announcement that is also not obvious. For decades now, FPGA companies have dueled it out for their slices of the lucrative communications infrastructure pie. While that market has always been the leading revenue generator for FPGAs, the technology is clearly applicable in many other markets and application areas. However, the requirement to have an FPGA expert on the team has thrown a wet blanket on many of those new-market opportunities. High-performance computing is clearly one of those under-served, high-potential applications for FPGAs. If FPGAs can get past a critical proof-point, a whole new market opens up. When a software engineer can write code in a programming language like OpenCL, target that code to an FPGA with equal ease to targeting that same code to something like a GPU, and get some combination of faster performance, lower cost, and lower power consumption, then we have reached our proof-point, FPGAs have a new market, and Altera is then competing with companies like Nvidia rather than their traditional rivals.

You can get started designing now with Arria 10 using any of Altera’s supported design flows. Today, your code will map to soft-core floating-point units implemented in the FPGA fabric. In the second half of this year, when Altera turns on hardened floating point support, your same design should automatically re-map to take advantage of the new hardware. Then, when Stratix 10 comes out next year, you’ll be ready to really turn up the boost. Altera says they have pin-compatible versions of Arria 10 and Stratix 10, so that migration step should be pretty seamless as well.

21 thoughts on “Toward Ten TeraFLOPS”

  1. Altera has added hardened floating-point to their DSP blocks in both the current Arria 10 and upcoming Stratix 10 FPGAs and SoCs. They claim that brings a Stratix 10 device up to 10 teraFLOPS territory. Do you think this will finally break FPGAs into HPC in a major way?

  2. It certainly will, especially if they fine tune C and System C synthesis to include OpenMP optimizations around the DSP blocks and Memory blocks. If the application is developed on a conventional processor with a coding style targeting FPGA C synthesis, then real application profile data becomes a valuable input to OpenMP synthesis to allocate/prioritize logic to balance and improve overall performance.

    The really cool part of using FPGA’s for HPC applications is the distributed memory blocks that FPGA’s offer. In conventional processors the memory channel is often the dominate performance bottleneck, which certainly isn’t breaking Moore’s Law at the same pace. Having many independent memories can scale system level performance significantly.

    On a completely different approach, doing pipelined bit serial floating point can solve some problems significantly better than even floating point DSP blocks … and frequently at a lower power and higher computational density without needing a lot of memory for intermediate terms. What would be really cool, would be to provide a huge number of floating point bit serial DSP blocks in an FPGA.

    The downside to using FPGA’s for HPC is there is some significant sensitivity to Single Event Upsets, which Altera has made significant steps in mitigating with their dynamic configuration ram scrubbing. Although other static memory cells in the FPGA are still at risk of data corruption. While for a single FPGA the failure rates are relatively small, once you start putting several thousand of these into a system, system level failure rates drop to days/weeks, especially above sea level in places like Colorado or Sandia National Labs. However, a very large aquarium as a system enclosure can provide a very nice visual presentation for the system, along with necessary shielding and thermal sink for unintended shutdowns of cooling.

  3. This is good news and has been a long time coming. Having viable floating point processing on FPGA will open up new markets for HPC applications. To really compete in those markets Altera will need to increase the domain-specific content in their OpenCL programming flow. But most importantly, there needs to be an economic driver to get FPGAs into computing hardware. In that respect, Altera is competing with GPUs which driven by gaming and video are already inside the hardware. That pre-existing volume gives GPUs an order of magnitude price advantage over the big FPGA devices. Once Altera (and Intel) find an application with the necessary volume, then every Cloud server will include FPGAs to deliver super-computing processing speeds to the masses. It will be fun to see what applications pop up then!

  4. If this article is supposed to describe the mid to far future then in my mind it fulfills it’s target.
    This article DOESN’T describe the present. In the present it isn’t hard to find a hardware engineer which can write the code needed. It is though very hard to write an optimized code for the Stratix in OpenCL. You have to use the libraries from Altera and the programmer has to optimize the code (here is the fun part)by understanding the structure of the Stratix on one hand and obeying the optimization instructions from Altera on the other hand. Good Luck finding those programmers. I’ve seen a demo given by Altera when they introduced their new family of devices, in this demo they wanted to show that using OpenCL you get better results utilizing the devices then using HDL – you get better results but the programming isn’t trivial. I agree that the future (not the near future) will be using a high level language to program FPGA. This future isn’t around the corner, unlike the opinion expressed in the article.

  5. Is there some more detailed info on architecture and interface of such blocks?

    Shall we take a conservative factor of 1 over 100 or more in real use? so to say that Altera claims 10 TF/s but in real application maybe 10-50 GF/s, if embedded processors must be mapped.

    The real question is how to handle TF/s and I/O.

    Next steps will be to embed ADC’s, then integer 2 float conversions to arrive to … a FP multicore ASIC-like!

  6. @TotallyLost, I agree. I should have gone into some detail on the memory bandwidth issue in the article. Memory access is not only a dominant performance bottleneck, it is a dominant factor in power consumption as well. For many HPC applications, power is the ultimate limitation. You can always stack more processors in a rack/room – until you can’t get the heat out anymore.

    I don’t agree that SEUs are a major issue with current FPGAs. Yes, there is a finite risk of an errant neutron flipping a bit, but any system with any non-error-correcting storage element has that risk (although not at the same probability of the configuration logic in an FPGA). Also, I don’t think any practical amount of shielding (even thought the aquarium idea sounds cool) can mitigate the SEU risk. Design techniques like TMR, safe state machines, etc – can mitigate the risk in FPGA-based systems that are highly vulnerable (in orbit, for example).

    @jjussel, Also agreed, and it would be intriguing to see Intel server blades with FPGAs on board as compute accelerators.

    @eorenstain, It is difficult to write “optimized” code for anything. Trying to optimize OpenCL for a GPU also requires knowledge of the target hardware, and goes well beyond just describing your algorithm in a high-level language. In fact, I’ll wager that the more “optimized” the code is for a particular GPU, the worse it will perform mapped to an FPGA. Any time you have to explicitly specify parallelism in your programming language, you are faced with the task of scheduling and resource allocation – for a known, fixed set of resources (number and type of processors in a GPU, for example). The good thing about an FPGA is that the architecture can be altered to adapt to the software, rather than the software needing to be adapted to fixed hardware.

    Regarding the future/present nature of the technology – there are already a number of different flows that can produce excellent results using FPGAs in this manner – custom RTL (of course), model-based design starting from tools like Matlab and Simulink, high-level synthesis from C/C++/SystemC, and mapped languages like OpenCL. All of these approaches have strengths and weaknesses, and most of them require some knowledge of the underlying hardware and of hardware design in general.

    @70billy, Yes, there is more detailed info on the architecture available. I’ll see what I can come up with.

  7. @Kevin,

    The big mistake is to assume you really want to cool a multi-megawatt system with air. At some point the energy cost to cool/move the air, is significantly more than the electronics itself — reflected not only in direct energy costs, but increased building size as well. Add to that repair costs (labor + parts) to maintain many hundreds/thousands of fans.

    Reducing the core computational engine, and memories, into a high density stack with active liquid cooling, also significantly reduces interconnect latencies. Done well, this means the core HPC system is only a few cubic meters. I’ve proposed development of such systems before, where backplane gigabit ethernet is the primary interconnect, rather than using copper/fiber cables with traditional air cooled blade designs.

    High energy neutrons can be easily shielded with a meter or two of water … thus a computational core that is a few cubic meters in size, is easily shielded with a several thousand gal aquarium, with the computational core at the center. A practical extension of the Cray-1 physical design.

    We can “agree to disagree” about the impacts of both SEU and SET errors, with very large, high speed systems that have significant risk points besides the configuration memories. When major simulation models take weeks/months to complete, having to run the model multiple times to validate results are free of SEU/SET errors can be expensive. Especially during periods of high energy solar flares.

  8. @70Billy FPGAs have the ability to implement the data flow graph directly in hardware. This means that the 10 Teraflops for the Stratix 10 is very usable. Currently the top FPGAs without hardened floating point are rated at 500 GF and they get 350 GF. The 10TF you hears about is only from the DSP units there are millions of logic blocks that could be add to that so I suspect that you will get 10TF if the system you put around the FPGA can deliver the data.

    @eorenstain The Altera OpenCL compiler also lets you add Verilog HDL as a kind of “assembly” although I find I have not needed to do this yet. There is a new flow for the OpenCL complier that will surprise people in Altera’s upcoming 14.0 release.

    The Stratix 10 will enable some designs to run at up 1GHz. That together with the floating point and OpenCL brings the future into the very near future and the present (with Arria 10).

  9. One key subtle point, but absolutely necessary to get performance on applications, is the need for the hard DSP to support the floating point Multiply-Accumulate (MAC) unit…as in how many times fast can you say “dot-product”…
    The diagram I’ve seen in another article seems to indicate that is the case…hopefully so!

    The other subtle point is integrating the mapping and scheduling into the software stack in a way that allows all those pesky libraries to just-be-there for performance and usability, as they are for other device families.

  10. It’s all about the dot. We have a low latency recursive mode embedded in the circuitry that does the job. Just like the old DPS blocks the new floating point DSP can pass results from one DSP to the next and this makes floating point DSP blocks run very fast on dot products.

    The high end frequency for the new DSP blocks are faster than the Stratix 5, the current generation of FPGAs, DSP blocks even while doing floating point MACs and dot products.

    The DSP builder tool makes it easy to use these features for the hardware programmer and there will be direct support in the OpenCL compiler so you won’t ever need worry about these features, they will just get used.

  11. @beercandyman

    The HPC uses for this new architecture look awesome.

    The OpenCL optimization for this architecure is totally awesome. Any chance the same optimization effort can be applied to OpenMP C-based design entry too so more applications/algorithms are easily portable to this architecture?

    M20K memories have ECC support in wide mode for adjacent bit flips, which should handle some/many SEU/SET data corruptions in these memories. Memory constructed from MLAB’s appear at risk. Configuration memory driving mux functions appears protected.

    Any comments about SEU/SET error rates on other logic/memory cells in your Stratix 10 family during peak solar flares? Has there been any significant SET testing?

    Using high performance serial HMC appears to consume the fast serdes resources pretty quickly, while limiting HPC chip to chip bisection bandwidths at the same time. Any comments about this contention for HPC uses?

  12. When it comes to SEU type errors we have done some work but quite frankly TMR is the only way to be sure. Banks currently have three or four processors work on the same problem and vote. In a normal processor they have ECC ram but no one has an ECC multiplier or ALU. There are transistors all over the place that can flip at any time. So in FPGAs we are working on mitigation and constant checking of our soft spots (so to speak).

    I don’t think that HMC will be used for all memory. It has a place in a system architecture along with DDR and QDR memories. The good thing about FPGAs is they can adopt new memory architectures very quickly. OpenCL can address heterogeneous memory subsystems and you can tell the compiler to use certain memories over others (if they are in your system).

    I think it’s totally possible to have a C + OpenMP + hardware MPI compiler. We currently do OpenCL tasks which are basically C routines (NDrange(1,1,1)). So the capability is there but I don’t think it’s currently on the roadmap.

  13. It’s much more of a problem than many people are aware of, as SEU/SET failures have a lot of wierd symptoms that are not readily obvious unless you are actually looking for them.

    After the C8.3 flare on 5/14/14 last week I saw several of my customers Canopy radio’s either lock up or watchdog reboot, which is actually pretty common each year with a couple hundred deployed in the Colorado Mountains, and easy to monitor with the bigger solar flares.

    http://www.tesis.lebedev.ru/en/sun_flares.html?m=5&d=15&y=2014……

    We also took two hard drive “failures” from this flare, out of about 70 that were spinning at the time – one in a NAS RAID 5 array with 350+ sectors that instantly went “bad”, the other in a mail server with 1200+ sectors that instantly went bad. Both drives actually had the positioner servo go active while the write enable gate was set on a head, causing a regularly spaced arc of corrupted sectors between the start and ending cylinders traversed by the errant positioner motion. This is obvious when you look at the cyl range and corrupted sector spacing.

    After clearing the remapped sector tables, and retesting the drives, both are actually fine, and none of corrupted sectors actually have media flaws.

    For Linux systems, after fsck -c collects all the corrupted sectors into the bad block list inode, and resolves duplicates by reallocation, the sectors involved become obvious. In the mail server that was luckily almost entirely contained in the journal log in this case. But what really tells the tale is the uniform cyl to cyl corrupted sector spacings that clearly identifies a heads write enable was briefly latched on for a bit while the positioner was active.

    Some might find this fun, and interesting. It’s really worth doing your homework here, when not so random failures occur in a few hours after a major solar flare.

    John

    ————— FYI ———————

    Running additional passes to resolve blocks claimed by more than one inode…
    Pass 1B: Rescanning for multiply-claimed blocks
    Multiply-claimed block(s) in inode

  14. @TotallyLost

    That sounds like some very impressive failure analysis right there. Fascinating!

    In FPGA-based systems, an SEU failure could literally look like anything – since the routing and/or logic functions themselves could be randomly altered. It seems like that would make such failures nearly impossible to diagnose.

  15. Pingback: GVK Biosciences
  16. Pingback: pezevenk
  17. Pingback: indica
  18. Pingback: friv

Leave a Reply

featured blogs
Dec 10, 2018
Last weekend, I took my son to the Dickens Fair for the second year in a row. As we did last year, we had a marvelous time, and I was reminded of the holiday special I wrote last year, as well. So,... [[ Click on the title to access the full blog on the Cadence Community sit...
Dec 10, 2018
We may think of prosthetics as devices created to enable basic functions, such as walking or grasping an object. Regarded as a necessity, but not an asset, prosthetics of the past......
Dec 7, 2018
That'€™s shocking! Insulation Resistance and Dielectric Withstanding Voltage are two of the qualification tests that Samtec performs in-house during part qualification testing. These tests will ensure that when a connector is used in environmental conditions at the rated wo...
Nov 14, 2018
  People of a certain age, who mindfully lived through the early microcomputer revolution during the first half of the 1970s, know about Bill Godbout. He was that guy who sent out crudely photocopied parts catalogs for all kinds of electronic components, sold from a Quon...