feature article
Subscribe Now

Xilinx Puts a Feather in its ACAP

Final block in Xilinx’s 7nm Everest Architecture is Detailed at Hot Chips 30 in Cupertino

He stuck a feather in his cap and called it macaroni – Yankee Doodle

CEO Victor Peng announced Xilinx’s 7nm Everest architecture, dubbed ACAP (the Adaptable Computing Acceleration Platform), earlier this year in March. (See Kevin Morris’ “Xilinx Previews Next Generation: What the Heck is ACAP?”)  Peng walked through the block diagram in detail, with one exception. That exception was a bright, Xilinx-red block labeled “HW/SW Programmable Engine,” which appears in the block diagram shown in Figure 1 below.

Figure 1: Block Diagram of Xilinx’s 7nm Everest architecture

All of the other major blocks in the above diagram appear in some form throughout the Xilinx 16nm Zynq, Virtex, and Kintex UltraScale+ families and need little elaboration. Those blocks include the Arm application and real-time processors, the programmable logic, HBM (high-bandwidth memory)—a stacked-die DRAM array attached to the Xilinx chip using a silicon interposer and 2.5D assembly techniques, RF ADCs and DACs, and high-speed SerDes ports.

But that big red box was a mystery. Intentionally so. It’s all part of the company’s dance of the seven veils that slowly reveals Everest/ACAP product details to maintain public interest while the device is still being designed. (Who would have guessed that noted writer and playwright Oscar Wilde would develop a marketing technique favored by high-tech marketers and reality TV celebs a century later?)

One or two of the veils hiding the mystery of the HW/SW Programmable Engine fluttered to the stage floor last week at the Hot Chips 30 conference held in Cupertino, California. That’s when Juanjo Noguera, engineering director of the Xilinx Architecture Group, gave a detailed presentation titled “HW/SW Programmable Engine: Domain Specific Architecture for Project Everest.” Noguera’s presentation provided many, many additional hardware details while keeping a few of the most interesting details veiled. Wilde would have been pleased.

The Everest HW/SW Programmable Engine is a tiled array of coarse-grained, software-programmable, VLIW vector processors connected to each other in multiple, hardware-programmable ways. According to Noguera, the VLIW vector processors can handle a variety of fixed- and floating-point data types. The HW/SW Programmable Engine tile array arrangement appears in the upper-left corner of Figure 2 below, while the individual tile detail with the interconnect scheme appears lower and to the right in the figure.

Figure 2: Detail of Xilinx’s tile-based HW/SW Programmable Engine for its Everest architecture

The HW/SW Programmable Engine array communicates independently with the PS (processor system) and the PL (programmable logic) in the Everest design. Each tile in the array consists of a software-programmable, VLIW vector processor coupled to local memory and a data mover (a DMA machine).

Three types of interconnect link tiles in the array. The first type, represented by the small red arrows in Figure 1, are parallel, bidirectional, word-level interfaces linking each tile to its four nearest neighbors using an NSEW (north, south, east, west) arrangement. In addition, the small green arrows shown in the diagram are unidirectional cascade interfaces that permit one tile to directly pass partial results to its adjacent, right-hand neighbor.

These short, local, point-to-point connections are reminiscent of the local interconnects used for short-range LUT-to-LUT communications incorporated into FPGA arrays, and the cascade interfaces resemble the carry bits between DSP slices. Certainly that’s not a coincidence, given that these sorts of communication paths have long been common in Xilinx devices.

For longer communications paths within the tile array, the HW/SW Programmable Engine uses a 200Gbytes/sec, non-blocking, deterministic NOC (network on chip). The NOC, plus the parallel local interconnect, constitute the “hardware-programmable” aspect of the HW/SW Programmable Engine.

You can connect the vector-processing tiles in a variety of ways to implement varied processing arrays. Noguera discussed five such configurations as examples in his talk. These examples appear in Figure 3 below.

Figure 3: Sample processing configurations for the Everest HW/SW Programmable Engine.

Configuration 1 is a simple one-dimensional, unidirectional dataflow pipeline. Each processing tile partially processes an incoming stream of data and then passes the result through a local memory buffer to the next tile in the pipeline. Configuration 2 implements a dataflow graph, which might be considered a 2D, 3D, or more-D version of a pipeline. Essentially, it’s still a pipeline but with pipes running in multiple dimensions. Configuration 3 takes advantage of the NOC to multicast results from one processing tile to two or more subsequent tiles simultaneously. Configuration 4 uses input and output memory buffers to match variable, differential processing rates for tiles connected over the NOC. Configuration 5 uses the tiles’ cascade interface to pass intermediate results from one tile to the next without consuming other interconnection resources.

These are only five configuration examples. There are certainly more processing configurations to be invented using this new processing resource and, just as certainly, there’s room for some clever automation that can synthesize and optimize configurations to meet high-level performance and power goals. Of course, these interconnect schemes can be used in combination as well.

The HW/SW Programmable Engine’s capabilities can be extended using Everest’s on-chip PL, which is attached to the HW/SW Programmable Engine’s NOC through multiple NOC nodes and CDC (clock-domain-crossing) buffers, as shown in Figure 4. (Clearly, the architecture needs CDC buffers because the HW/SW Programmable Engine runs at a different clock rate than the PL. I’m guessing it doesn’t run slower.)

Figure 4: The HW/SW Programmable Engine’s NOC connects to the device’s programmable logic through multiple NOC nodes and clock-domain-crossing buffers.

As the figure states, the aggregate interconnect bandwidth between Everest’s HW/SW Programmable Engine (abbreviated “PE” in the figure) and its PL is on the order of Tbytes/sec. That figure that will cause many system architects to rethink their assumptions about processing architectures.

You can use the connected PL to augment the HW/SW Programmable Engine’s capabilities in multiple ways. For example, you can use the PL’s RAM arrays to increase the vector processing tiles’ access to on-chip SRAM (BRAM and UltraRAM) blocks in the PL with low latency through the deterministic NOC. It’s also possible to use the PL to implement hardware accelerators that can perform specialized computations faster than the tiles’ vector engines. Noguera suggested that you can also use the PL to create “ISA extensions” for the tiles, but he did not elaborate.

The performance results for machine-language inference and 5G wireless signal processing are impressive, as shown in Figure 5 below.

Figure 5: HW/SW Programmable Engine results relative to programmable-logic implementations.

The 20x improvement in ML (machine-language) inference is particularly noteworthy for its magnitude. FPGAs are already pretty quick when it comes to ML inferencing because the inference calculations involve many, many multiplications and additions. The thousands of DSP slices in an FPGA can perform those calculations quickly, but the HW/SW Programmable Engine appears to be even faster. The 4x improvement in 5G wireless processing is also significant, said Noguera, because it means that the Everest architecture can directly handle a transmission bit -rate of 2Gsamples/sec.

That statement led to some detailed questions from the audience during the Q&A following Noguera’s presentation. The first question was about the clock rate for the HW/SW Programmable Engine. Noguera answered by saying that he could not yet answer that question directly but that he was trying to provide indirect guidance by stating that the engine could handle a 5G transmission bit rate of 2Gsamples/sec. Operating frequencies for the HW/SW Programmable Engine’s processors would be “on the order of Giga,” he stated.

Another question dealt with the number of tiles in the HW/SW Programmable Engine’s array. The answer is that there will be tens to hundreds of tiles in each HW/SW Programmable Engine, depending on the device family member. ACAP devices with hundreds of vector processors in their HW/SW Programmable Engine will be massively parallel. This range is very consistent with the way all FPGA vendors, including Xilinx, place varying quantities of resources into individual members in broad device families, and it telegraphs Xilinx’s intent to develop a family of ACAP devices.

However, the first Everest device has yet to tape out. That milestone is scheduled to happen later this year. Meanwhile, the dance of the seven veils continues. Noguera promised more details would be disclosed at the Xilinx Developers Forums being held later this year in San Jose, Beijing, and Frankfurt. You might consider registering.

One thought on “Xilinx Puts a Feather in its ACAP”

Leave a Reply

featured blogs
Sep 25, 2020
What do you think about earphone-style electroencephalography sensors that would allow your boss to monitor your brainwaves and collect your brain data while you are at work?...
Sep 25, 2020
Weird weather is one the things making 2020 memorable. As I look my home office window (WFH – yet another 2020 “thing”!), it feels like mid-summer in late September. In some places like Key West or Palm Springs, that is normal. In Pennsylvania, it is not. My...
Sep 25, 2020
[From the last episode: We looked at different ways of accessing a single bit in a memory, including the use of multiplexors.] Today we'€™re going to look more specifically at memory cells '€“ these things we'€™ve been calling bit cells. We mentioned that there are many...
Sep 25, 2020
Normally, in May, I'd have been off to Unterschleißheim, a suburb of Munich where historically we've held what used to be called CDNLive EMEA. We renamed this CadenceLIVE Europe and... [[ Click on the title to access the full blog on the Cadence Community site...

Featured Video

Product Update: Synopsys and SK hynix Discuss HBM2E at 3.6Gbps

Sponsored by Synopsys

In this video interview hear from Keith Kim, Team Leader of DRAM Technical Marketing at SK hynix, discussing the wide adoption of HBM2E at 3.6Gbps and successful collaboration with Synopsys to validate the DesignWare HBM2E IP at the maximum speed.

Click here for more information about DesignWare DDR IP Solutions

Featured Paper

Designing highly efficient, powerful and fast EV charging stations

Sponsored by Texas Instruments

Scaling the necessary power for fast EV charging stations can be challenging. One solution is to use modular power converters stacked in parallel.

Learn More in our technical article

Featured Chalk Talk

ROHM BD71847AMWV PMIC for the NXP i.MM 8M Mini

Sponsored by Mouser Electronics and ROHM Semiconductor

Designing-in a power supply for today’s remarkable applications processors can be a hurdle for many embedded design teams. Creating a solutions that’s small, efficient, and inexpensive demands considerable engineering time and expertise. In this episode of Chalk Talk, Amelia Dalton chats with Kristopher Bahar of ROHM about some new power management ICs that are small, efficient, and inexpensive.

Click here for more information about ROHM Semiconductor BD71847AMWV Programmable Power Management IC