Data Center Duel Deux

It’s clear that programmable logic and FPGA technology will capture an increasing share of the value in conventional and cloud data-center deployments. While FPGAs have always been used in connectivity and storage, there is an ever-building push to have high-end FPGAs take over a crucial role in computation as well. FPGAs pack a potent combination of massive computational throughput, low latency, and power efficiency that is unmatched by any rival technology. With the huge growth of data-center demand fueled by IoT, continuing to power the cloud exclusively with conventional processors is just not feasible. Heterogeneous deployments of conventional processors and FPGAs working together have the potential to boost computational performance many times over and, more importantly, dramatically cut power consumption.

There is, however, a series of substantial obstacles to widespread use of FPGA technology in computing. The first of these is application development. Programming a heterogeneous processing system with FPGAs is extremely difficult and requires significant hardware design expertise in addition to traditional software skills. Both Xilinx and Intel (as well as a few third parties) are working hard to lower the bar on developing applications (and porting legacy applications) so that specialized FPGA expertise is not required – or is at least of minimum importance. This has put us in a “battle of the tools” situation that will be playing out for a long time to come.

In addition to continuously improving the tools, FPGA companies need to jump-start development of accelerated applications. And, most importantly, they want to jump-start that development in a way that prefers their architecture versus their competitor’s. This month, Intel announced a win with Alibaba, one of the “super 7” cloud providers, to deploy Intel Arria 10 FPGAs (along with the company’s Xeon processors, of course) in Alibaba Cloud (Aliyun). This will allow Alibaba Cloud customers to take advantage of FPGA-based acceleration in their rent-a-cloud applications.

This announcement parallels Xilinx’s announcement last November that it had won a deal to provide the FPGAs for Amazon’s FPGA Cloud, via so-called “EC2 F1 Instances.” At this point, we know a lot more about Amazon’s (Xilinx) FPGA Cloud than about the new Alibaba (Intel/Altera) Cloud. Amazon’s cloud is already in customer-preview mode, while Alibaba’s is just announced with few details, other than that there will be a “pilot program.”

It is important to contrast these Amazon/Alibaba-hosted cloud announcements with proprietary, in-house deployments of FPGA-accelerated services such as Microsoft’s Catapult (which uses Intel/Altera FPGAs). Hosted cloud deployments of FPGA-based acceleration will give us exactly the “jump-start” effect that both Intel and Xilinx are after by lowering the barrier to entry for application development teams who are creating high-value applications to take advantage of FPGA technology.

Hosted services give application developers and service providers the ability to develop and deploy FPGA-based applications without having to buy (or, more likely build) FPGA cards for their specific deployment. It also allows them to scale their deployment according to demand, rather than having to build their own infrastructure to handle their peak load. Finally, standardized hosted servers allow software components from different sources to be potentially mixed-and-matched, putting together, for example, video-processing acceleration with neural-network acceleration.

There is an enormous catch to this, unfortunately.

Just as technologies like virtualization seem to be on the cusp of insulating application developers from the vagaries of server and processor architectures, allowing easily portable development of key data center and cloud applications, FPGAs come along and ruin the whole thing. There is no “virtualization layer” equivalent that will allow easy portability of FPGA-accelerated applications. In fact, Xilinx and Intel are each doing everything they can to make porting more difficult. This turns the data center duel into a high-stakes “winner take all” game, where application teams must decide which horse to ride. Do you want to develop your neural network on Amazon or Alibaba? Chances are, it will be almost a complete do-over if you want to support both. Choose the wrong one at your own peril.

Both partnerships will likely be initially wooing developers of important “infrastructure” applications that can be broadly applied. Amazon is specifically looking for genomics research, financial analytics, real-time video processing, big data search and analytics, and security. They are encouraging developers to use their tools (much of which are simply Xilinx tools) to develop accelerated applications and offer them to other customers via the AWS marketplace. Alibaba says they are targeting machine learning, data encryption and media transcode.

There are already numerous differences apparent between the Xilinx and Intel approaches to FPGA cloud acceleration. Xilinx/Amazon are targeting Xilinx’s latest, biggest, 16nm FinFET Virtex UltraScale+ devices. Intel is at least a year behind Xilinx in delivering FinFET FPGAs, so their platform is the Arria 10 mid-range FPGA, fabricated on a 20nm planar TSMC process. (Yep, that’s right. TSMC is manufacturing BOTH the current Xilinx and Intel FPGAs for data center applications. Who is the guaranteed winner here?)

Looking at the architecture, Intel seems to be pursuing a fine-grained pairing of processor and FPGA, mating Arria 10 devices with Xeon processors in the same physical package. Xilinx/Amazon, on the other hand, are deploying FPGAs in clusters, connected more loosely, presumably, to Intel Xeon processors. (There are also initiatives with ARM-based processors in data centers, but that’s orthogonal to this discussion). Obviously it’s to Intel’s advantage to sell a Xeon for every Arria, and they certainly block an ARM incursion with this approach, but this fundamental difference in system architecture profoundly affects application development strategy. It is too early to know which architecture will perform best for the key data-center applications. Of course, each company makes a case for their approach being superior.

At the chip level, there are key differences between Intel and Xilinx as well. Intel/Altera added hardware floating-point support to their FPGAs a couple years ago. The idea there is that much computation is floating-point, and fixed-hardware floating-point will outperform software or programmable logic floating-point by a significant margin. Xilinx, on the other hand, claims that adding floating-point support comes at a cost in terms of fixed-point performance, and that, as a result, their devices have a significant advantage in fixed-point workloads. Xilinx claims this difference favors them in applications such as neural-network inference, for example.

Another key capability required for data-center deployments is partial (or rapid) reconfiguration of the FPGA fabric as various accelerators are swapped in and out. In this arena, Xilinx has a long historical lead, as they’ve successfully supported partial reconfiguration for years. Intel/Altera are newer to the partial reconfiguration game, and less is known about the mechanics and efficacy of their solution for data center applications.

The most critical battle, of course, is the slow burn of tool evolution. Intel (Altera) fired the first shot across the bow here several years ago, embracing the OpenCL initiative by developing tools that compile OpenCL code (typically used for general-purpose GPU programming and acceleration) on Altera FPGAs. Xilinx, on the other hand, was leading with their high-level synthesis (HLS) technology, which can create optimized hardware implementations of C/C++ code. Since those early days, each company has both innovated and responded. Xilinx announced OpenCL support, and Intel has now (quietly) announced HLS. Xilinx has released a number of tool/software/IP suites specifically attacking computation such as their SDAccel development environment, aimed at teams doing FPGA-based compute acceleration, and (more recently) their reVISION “stack” aimed specifically at vision applications. Thus far, Intel has taken a more generic approach with their tool suite evolution.

This data center war is just beginning. For Intel, the stakes are much higher, as the discontinuity created by an industry-wide migration to heterogeneous computation with FPGAs has the potential to put their much-larger monopoly on data center processors at risk. In many ways, Xilinx has an early technological lead, and they have chalked up some key victories in the market. Obviously, Intel has the potential to exploit their existing data-center dominance in ways that will benefit them, but so far there is no visible strategy for doing that. It will be interesting to watch.

@beercandyman — I have to agree. A decade ago, while working on FpgaC, we prototyped a compile-load-and-go mode based on some proprietary data inadvertently released in the Java Bits (JBits) tools. Xilinx was extremely hostile about releasing it for open source work. There are certainly some work-arounds based on dynamically partitioning the HDL in the tools and dynamically creating hard macros from the partitioned parts. That limits rebuilding (place/route/timing iteration) only to macros that have source changes, and when the size/fit of changed macro forces global routing/timing changes.

@kevin — The virtualization part isn’t nearly as hard as it might seem, given the right supporting hardware design with multiple identical FPGA/Memory blocks, paired with operating system call backs to allocation, load, and unload both the FPGA configuration along with additional state and memory. This is a rather trivial operation in comparison to the vast state that the OS already has to manage for every process/thread. As a former UNIX/FreeBSD/Linux cross architecture OS porting developer/expert … things are only scary that you have never done before. Not that much more difficult than saving the state for a thread/process — just some more pieces in different places.

Think of the FPGA configuration memory space, as just another memory space to be dynamically allocated, loaded, and unloaded … along with controlling state registers for the hardware.

Just need to design the FPGA infrastructure with security partitions that can be dynamically allocated, loaded, and unloaded and attached to a VM or process in a secure way. The rest just becomes transparent with a good user interface design for the supporting libraries.

From an FPGA HDL (C code) design perspective, one typically matches the state machine clock rate, to match the state with the longest combinatorial path delay. When translating HDL (C code) into state machines, it becomes useful to fix the state machine clock to a useful, but fairly fast rate that most common states fit, then have the HDL tools fit the longer combinatorial path states into multiple clocks (state 2a, 2b, …, 2n) and gate the state registers on the last clock for that state.

Using programmable clocks for the FPGA state machines gives the fit tools two degrees of freedom for incremental coding changes … slight increase/decrease of clock rate to match small changes in state timings with each iteration, and when necessary breaking a state into additional clocks.

When the FPGA becomes a shared computational resource, the programmable clocks are part of the state that has to be managed with task switches that reload FPGA configurations.

The double sided PCI board with a 2x5x2 tightly coupled FPGA matrix design we used for the RC5 challenge allowed a communications architecture implemented as either a 2×10 torus and/or dual hypercubes. All 20 FPGA’s had fully symmetrical interconnects, except for the south side of one FPGA that implemented the PCI interface. A JTag interface bootstrapped the FPGA cards from the host parallel port.

Multiple boards hosted in a Fedora Linux machine, allowed any, or all, the FPGA’s to be allocated, and dynamically loaded from one or more Linux processes, via a simple driver interface.

That was a decade ago. I can see much better ways of doing this today, including creating “typed” fpga executable objects that could be automatically loaded and run, much the same way that Wine manages DOS/Win executable emulation.

3 thoughts on “Data Center Duel Deux”

beercandyman says:

March 28, 2017 at 1:34 pm

The biggest problem with using FPGA tools is the time it takes to place and route. It’s interesting that Xilinx and Intel want you to compile your software into hardware but they won’t even try and accelerate their place and route tools. A long time ago Xilinx let me prove that you could accelerate the place and route tools. Xilinx had just bought NeoCad and they let me take a run at accelerating their placer. I choose one subroutine that represented 30% of the performance and sped it up by 9.8x. That took about 2 weeks. Then they told me that the new software was 10x faster so they didn’t need to speed up their place and route. They clearly missed the point and that has not changed in 15 years. If Intel and Xilinx want people to believe that speeding up software in their FPGAs is a good thing they should do that themselves.

Eat your own dog food, FPGA vendors, if you want everyone to buy into your story.

Log in to Reply
StephaneM says:

March 29, 2017 at 9:44 am

Good analysis Kevin.
Indeed Altera OpenCL and Xilinx SDAccel are a step in the right direction. Experience shows that this model of computation is applicable nicely to a certain type of processing workload but that, for a large portion of Cloud workloads, a data flow model is more appropriate.
Beyond the model of computation, another important aspect that also needs to be addressed is the IP business model. In a Cloud environment where the FPGA End User is not necessarily the FPGA Developer, and where such accelerators are sold by the hour, it becomes clear quickly that the traditional IP business model (with high upfront cost) does not work, and that a pay per use business model needs to be applied throughout the supply chain.
I’d like to point to what we are doing at Accelize (www.accelize.com). I am biased, so I will not say anything more here and encourage people intrigued/interested to learn more by visiting the Accelize website.

Log in to Reply
TotallyLost says:

April 4, 2017 at 4:28 pm

@beercandyman — I have to agree. A decade ago, while working on FpgaC, we prototyped a compile-load-and-go mode based on some proprietary data inadvertently released in the Java Bits (JBits) tools. Xilinx was extremely hostile about releasing it for open source work. There are certainly some work-arounds based on dynamically partitioning the HDL in the tools and dynamically creating hard macros from the partitioned parts. That limits rebuilding (place/route/timing iteration) only to macros that have source changes, and when the size/fit of changed macro forces global routing/timing changes.

@kevin — The virtualization part isn’t nearly as hard as it might seem, given the right supporting hardware design with multiple identical FPGA/Memory blocks, paired with operating system call backs to allocation, load, and unload both the FPGA configuration along with additional state and memory. This is a rather trivial operation in comparison to the vast state that the OS already has to manage for every process/thread. As a former UNIX/FreeBSD/Linux cross architecture OS porting developer/expert … things are only scary that you have never done before. Not that much more difficult than saving the state for a thread/process — just some more pieces in different places.

Think of the FPGA configuration memory space, as just another memory space to be dynamically allocated, loaded, and unloaded … along with controlling state registers for the hardware.

Just need to design the FPGA infrastructure with security partitions that can be dynamically allocated, loaded, and unloaded and attached to a VM or process in a secure way. The rest just becomes transparent with a good user interface design for the supporting libraries.

From an FPGA HDL (C code) design perspective, one typically matches the state machine clock rate, to match the state with the longest combinatorial path delay. When translating HDL (C code) into state machines, it becomes useful to fix the state machine clock to a useful, but fairly fast rate that most common states fit, then have the HDL tools fit the longer combinatorial path states into multiple clocks (state 2a, 2b, …, 2n) and gate the state registers on the last clock for that state.

Using programmable clocks for the FPGA state machines gives the fit tools two degrees of freedom for incremental coding changes … slight increase/decrease of clock rate to match small changes in state timings with each iteration, and when necessary breaking a state into additional clocks.

When the FPGA becomes a shared computational resource, the programmable clocks are part of the state that has to be managed with task switches that reload FPGA configurations.

The double sided PCI board with a 2x5x2 tightly coupled FPGA matrix design we used for the RC5 challenge allowed a communications architecture implemented as either a 2×10 torus and/or dual hypercubes. All 20 FPGA’s had fully symmetrical interconnects, except for the south side of one FPGA that implemented the PCI interface. A JTag interface bootstrapped the FPGA cards from the host parallel port.

Multiple boards hosted in a Fedora Linux machine, allowed any, or all, the FPGA’s to be allocated, and dynamically loaded from one or more Linux processes, via a simple driver interface.

That was a decade ago. I can see much better ways of doing this today, including creating “typed” fpga executable objects that could be automatically loaded and run, much the same way that Wine manages DOS/Win executable emulation.

Log in to Reply

Data Center Duel Deux

Related

3 thoughts on “Data Center Duel Deux”

Leave a Reply Cancel reply

featured video

How NV5, NVIDIA, and Cadence Collaboration Optimizes Data Center Efficiency, Performance, and Reliability

featured chalk talk