feature article
Subscribe Now

Can HLS Partitioning Speed Up Placement and Routing of FPGA Designs? Yes, Oh Yes!

Centuries-Old Divide-and-Conquer Technique Still Works

FPGA place-and-route software goes too fast, said no one ever. In fact, FPGA vendors have spent considerable effort in making their design software run faster on multicore processors. A paper recently presented at the ACM’s FPGA 2022 conference titled “RapidStream: Parallel Physical Implementation of FPGA HLS Designs,” describes a very interesting approach to pushing HLS designs through FPGA design software running on multicore processors faster. The paper – authored by a large team of researchers at UCLA, AMD-Xilinx, Ghent University, and Cornell University – describes RapidStream, which is an automated partitioning algorithm that slices up a dataflow design into multiple “islands,” inserts registered “anchor regions” between the partitioned islands, and then stitches up the entire design by routing signals from each island through the registers in the anchor regions.

The purpose behind all this partitioning and stitching is to chop up the HLS design into bite-sized chunks that can be handed off to the many cores in a modern server. It’s the age-old divide-and-conquer strategy used by engineers for centuries, now adopted to accelerate FPGA development.

This process has three main HLS-level constraints:

  1.  Non-overlapping partitioning – To parallelize the physical implementations of different islands, each island must contain a unique and non-overlapping partition of the design.
  2.  Pipelined inter-island connections – Each inter-island connection is pipelined to meet timing and achieve timing closure.
  3.  Direct neighbor connections – Each island can have direct connections only with immediately adjacent islands. This constraint turns out to be critical when parallelizing design placement and routing.

(Note: These constraints are nothing like the various constraints you use to control logic synthesis. They’re at a higher level.)

RapidStream’s creators define a dataflow design as a collection of parallel processing elements (PEs) and a set of FIFOs that connect the PEs to each other as defined by the design’s dataflow needs. Each PE can be arbitrarily complex internally, but it must communicate data to other PEs only through FIFO interfaces. As mentioned above, RapidStream divides the FPGA fabric into two types of regions: equally sized islands and anchor regions placed in thin columns and rows between the adjacent islands. Interestingly, RapidStream seems to have been specifically built for AMD-Xilinx Virtex UltraScale+ FPGAs, which are 2.5D devices made with FPGA chiplets – Super Logic Regions or “SLRs” in AMD-Xilinx speak – bonded to a silicon substrate. Xilinx pioneered this construction technique for FPGAs and has been using it for several FPGA generations, for about a decade. The reason this fact is important is because there’s a natural partition between AMD-Xilinx SLRs, and RapidStream seems to have been written and its algorithms specifically designed to take this SLR physical partitioning into account, as the figure below, taken from the FPGA 22 paper, illustrates.

RapidStream partitions an HLS design into islands containing Processing Elements (PEs) and anchor regions that provide FIFO communications registers between communicating islands. 

When you feed a dataflow design described in HLS to RapidStream, it partitions the design using the three-phase process:

Phase 1: Slice the HLS dataflow design into roughly equal partitions (or PEs) that will fit into the predefined islands. RapidStream takes advantage of the dataflow design’s elasticity to make certain that every inter-island (or inter-PE) connection is pipelined through an anchor register, which ensures that the partitioned design’s timing will provide the isolation required for parallel placement and routing.

Phase 2: Place and route the disjoint islands and insert the anchor registers. Rapidstream uses a simple distance-driven placement algorithm for this phase, which achieves similar timing quality compared to a standard FPGA placer, but it runs much faster. Further, each PE island can be placed and routed by a different processor core because they’ve been made independent of each other during partitioning. A clock management scheme ensures that the clock skew is consistent among the islands when they are routed and later stitched together. This step avoids hold violations after stitching.

Phase 3: Stitch the placed and routed islands together through the pre-placed anchor registers. Because of constraint 1 above, the inter-island connections are anchored, so the stitcher only needs to route each island to its surrounding anchors and then on to the adjacent island. This scheme greatly simplifies the stitching task, with the exception of the design’s global clock, which is a global net that fans out to all islands.

Of course, I’ve skimmed over a lot of the details. You can get those from the paper. The real question is, “How well does this idea work?” The answer: pretty well.

The paper contains a couple of charts that describe how well RapidStream works. The first chart shows the clock rates achieved by six different dataflow designs after partitioning compared to the pipelined and non-pipelined versions of the same design without partitioning.

RapidStream partitioning improved the clock speed of five out of six RTL designs compared to a pipelined but non-partitioned design.

The RapidStream results are the blue bars, which show faster clock rates than all of the non-partitioned, non-pipelined versions of these designs. You’d expect that result. Pipelining is at the very core of clock-speed improvement in FPGA design. However, the RapidStream results are better than the pipelined RTL versions of the same design in five of the six cases. That result should really get your attention.

Next, here are the place-and-route timing results from the paper:

RapidStream improved place-and-route run times for six out of six RTL designs compared to a pipelined or non-pipelined, non-partitioned designs.

RapidStream’s place-and-route runtime results are much better than the results for the unpartitioned design. Again, that’s because RapidStream can hand each partition off to a different processor core for placement and routing. Although FPGA vendors have tried to make their place-and-route algorithms work faster with multicore processors, RapidStream’s developers empirically discovered that there’s not yet much benefit from more than about two processor cores (2.1 processor cores, more precisely) when running the AMD-Xilinx Vivado Design Suite of tools if the FPGA design isn’t partitioned.

By now, you should have your interest piqued by RapidStream if you’re developing HLS designs with FPGAs — particularly AMD-Xilinx FPGAs. You can find more information about the RapidStream project on its GitHub page.

 

One thought on “Can HLS Partitioning Speed Up Placement and Routing of FPGA Designs? Yes, Oh Yes!”

  1. With a Flowpro parallel computational machine all of the objects and tasks are already petitioned at the source level of the design. It seems that a true parallel substrate machine such as Flowpro could easily take advantage of multicore synthesis.

Leave a Reply

featured blogs
Apr 25, 2024
Structures in Allegro X layout editors let you create reusable building blocks for your PCBs, saving you time and ensuring consistency. What are Structures? Structures are pre-defined groups of design objects, such as vias, connecting lines (clines), and shapes. You can combi...
Apr 25, 2024
See how the UCIe protocol creates multi-die chips by connecting chiplets from different vendors and nodes, and learn about the role of IP and specifications.The post Want to Mix and Match Dies in a Single Package? UCIe Can Get You There appeared first on Chip Design....
Apr 18, 2024
Are you ready for a revolution in robotic technology (as opposed to a robotic revolution, of course)?...

featured video

MaxLinear Integrates Analog & Digital Design in One Chip with Cadence 3D Solvers

Sponsored by Cadence Design Systems

MaxLinear has the unique capability of integrating analog and digital design on the same chip. Because of this, the team developed some interesting technology in the communication space. In the optical infrastructure domain, they created the first fully integrated 5nm CMOS PAM4 DSP. All their products solve critical communication and high-frequency analysis challenges.

Learn more about how MaxLinear is using Cadence’s Clarity 3D Solver and EMX Planar 3D Solver in their design process.

featured paper

Designing Robust 5G Power Amplifiers for the Real World

Sponsored by Keysight

Simulating 5G power amplifier (PA) designs at the component and system levels with authentic modulation and high-fidelity behavioral models increases predictability, lowers risk, and shrinks schedules. Simulation software enables multi-technology layout and multi-domain analysis, evaluating the impacts of 5G PA design choices while delivering accurate results in a single virtual workspace. This application note delves into how authentic modulation enhances predictability and performance in 5G millimeter-wave systems.

Download now to revolutionize your design process.

featured chalk talk

IoT Data Analysis at the Edge
No longer is machine learning a niche application for electronic engineering. Machine learning is leading a transformative revolution in a variety of electronic designs but implementing machine learning can be a tricky task to complete. In this episode of Chalk Talk, Amelia Dalton and Louis Gobin from STMicroelectronics investigate how STMicroelectronics is helping embedded developers design edge AI solutions. They take a closer look at the benefits of STMicroelectronics NanoEdge-AI® Studio and  STM32Cube.AI and how you can take advantage of them in your next design. 
Jun 28, 2023
34,611 views