In today’s world where time to market is a very crucial criterion for success, most semiconductor companies depend on automation tools to get results faster. The first step towards making this possible is to use tools to develop virtual platforms for architecture exploration and software/firmware validation. However, this does not reduce the effort for transforming these architectures to register transfer level (RTL) code, which most engineers continue to do manually. Coding the RTL manually also creates a discontinuity between models written in two different languages — a high-level and a hardware description. This language gap introduces a high potential for error and a great deal of effort in the translation of one language into another. To maintain the required level of productivity and innovation, the electrical engineering industry needs a tool that automates the implementation of a hardware design from a high-level description.
High-level Synthesis (HLS), also known as algorithmic or behavioral synthesis, is just such an automated design process. It interprets an algorithmic description of a desired behavior and creates an RTL hardware description. Later, this is synthesized to the gate level using a logic synthesis tool. Catapult® C Synthesis from Mentor Graphics® is the HLS tool we used to reduce the time-to-market and eliminate the C-to-RTL gap.
Specific to our design, the main challenge was to demonstrate a working video analytics algorithm on an FPGA within an extremely aggressive schedule of six weeks. On previous projects, we overcame similar challenges by using Catapult C to produce a hardware design from an existing C algorithm. Evaluation results from these projects have shown a 71 percent reduction in effort, compared to a manual RTL approach. This was due to a significant reduction in the implementation and verification effort.
The VLCD System on FPGA
Virtual Line Crossing Detection (VLCD) is one of the video analytics algorithms used in video surveillance applications. It detects and tracks animate and inanimate objects introduced into a predefined boundary.
The VLCD algorithm supports the following features:
- Recognition and display of peripheral infringement
- Adaptive background modeling
- Efficient foreground extraction
- Object separation
- Input video frame sampling ratio of 4:4:4
- Input video frame size of 352 x 288 (CIF format)
The VLCD algorithm executes the following steps.
- Background Synthesis: An adaptive model that learns and models background scenes statistically and heuristically. The background is determined and continuously maintained by the synthesis of the non-changing areas of video streams, establishing a static part of a camera image.
- Foreground Component Extraction: Separates foreground from background object components.
- Blob Connectivity: Connects foreground object components (blobs) to form meaningful foreground objects.
- Shape Segmentation: Shapes that move in the foreground are separated from the background by image processing algorithms.
- Object Tracking: Object shapes are tracked within the field of view of the camera continuously until they disappear from the scene completely.
- Feature Extraction: Object unique identifiers, such as frequency signatures, can be determined.
The VLCD system was developed using VLCD RTL, Altera’s NIOS processor, and peripheral IP. The VLCD system was tested on an Altera FPGA evaluation board.
The hardware setup of the VLCD system consists of a PAL camera with an SDI interface, a BT656 Spectrum Digital Video Decoder daughter card, and an Altera DSP board with an Altera Stratix II EP2S180 FPGA and a VGA monitor.
The block diagram of the VLCD system architecture consists of four main modules:
- Frame Grabber
- VLCD Algorithm RTL with Wrapper
- VGA Controller
- NIOS II Processor
Frame Grabber is a module that processes and manipulates the frame data required for the VLCD algorithm. It is responsible for the following:
- Converting BT-656 clocked video format to Avalon-ST video format
- Chroma re-sampling from 4:2:2 to 4:4:4
- Converting interlaced video format to progressive video format
- Scaling the video frames from 768 x 576 to 352 x 288
The VLCD Algorithm RTL with Wrapper instantiates the VLCD algorithm RTL and the on-chip memory as a DPRAM. An Avalon-ST compatible memory controller manages the reading and writing of frame data from and into the DPRAM.
The VGA Controller is a module that processes the frame data for VGA display. It is responsible for the following:
- Converting the frame data from the YCbCr to the RGB color space.
- Scaling the video streams from 352 x 288 to 640 x 480
- Buffering the video frames into on-board SDRAM
The NIOS-II Processor application software configures the decoder chip (daughter card) through the I2C interface.
The VLCD system was built using Altera IP provided as MegaCore functions in the SOPC builder of the Quartus II tool. A wrapper is developed around the VLCD RTL for supporting the Avalon-ST compatible Interface.
Our HLS Flow
Our high-level synthesis flow begins with the existing video analytics C algorithm, generates production-ready, “correct by functionality” RTL, and then verifies the generated RTL. The following describes the GUI-based, step-by-step flow of Catapult C to generate our hardware.
Set Global Hardware Constraints: By accessing the interface control setting in the Catapult C constraint editor, we easily chose a design frequency of 100 MHz and the Start and Done process-level handshake signal settings. When selected, Start becomes 1-bit input and Done becomes 1-bit output in the Catapult C generated RTL. These signals provide a CPU-like handshake. After reset, the Catapult C generated RTL will set Done to high and then waits for Start to go high. After the Start goes high, the Done signal goes low until the hardware is finished processing. Since this interface was not part of the original C code, Catapult C allowed us to build this interface without modifying the C code.
We also used the constraint editor to select Stratix II EP2S180, with a speed grade value of 3, as the target device technology. Catapult C uses a technology library to estimate the area and delay information of the components (such as adders and multipliers) required to build the datapaths. Based on the clock frequency settings, Catapult C schedules the operations in the appropriate clock cycles such that they meet timing in the target technology and are optimally time-shared in the resulting micro-architecture. Generating RTL this way reduces the overall design cycle by avoiding multiple iterations caused by timing problems.
Since our algorithm used arrays, we also selected RAM and ROM libraries. Large arrays should always be mapped to memories. If these technology-specific components are made available for synthesis, Catapult C will automatically take advantage of them whenever required. Yet the user has the option to change the Catapult C default selection if so desired.
Set Architectural Constraints: In Catapult C, the top-level C/C++ function’s arguments become the design’s I/Os. Designers use the interface synthesis constraints to specify properties of each hardware port. Interface synthesis gives control over parameters (such as bandwidth, protocol, timing, and handshake) associated with each I/O. We used Catapult C architecture constraints editor to select the I/O resource type and memory mapping settings. We mapped the I/O array to a single port memory and the internal design array variables to ROMs. For this design, latency was an important design goal, so we ran the Catapult C scheduler in Latency mode. In this mode, Catapult C produces the fastest possible implementation optimized for area.
One of our major design tasks was to strike a balance between parallelism and pipelining in the design to achieve the right tradeoff between area and speed. Catapult C allows designers to explore these structures through loop unrolling and pipelining. This is accomplished using constraints. We had some loops with a very small iteration count and a small number of operations. We completely unrolled these loops to explore parallelism that enabled a faster implementation with a very small gain in area. By default, Catapult C always tries to merge two sequential loops to minimize the latency of the design with a negligible gain in area. Again, the designer can always change this by applying the appropriate constraints in Catapult. We kept loop merging on in order to achieve the design’s targeted latency.
Schedule Analysis: The relative runtime of each loop was analyzed using the Gantt chart in the Catapult Schedule window. In the first iteration, scheduling was done without unrolling any loops. Then we analyzed the loops. The loops that had high runtimes and a smaller iteration count were selected for full unrolling during the second iteration. By doing this, we were able to reduce the latency from 14 ms to 2.18 ms when processing one video frame in the common intermediate format (CIF).
Generate RTL: After meeting the timing goals of the design, the RTL was automatically generated by Catapult C Synthesis.
SystemC Verification: The RTL verification flow provided by Catapult C is also an automated process. The Catapult SC Verify flow generated the required SystemC based infrastructure, leveraging the existing C testbench and applying the same stimulus to both the golden C/C++ model and the generated RTL. Then it compared the output to verify that the generated RTL functions the same as the golden C++ model.
There was no additional effort required for developing RTL verification environment. As a result, it significantly reduced the effort and time required for the RTL verification as compared to the traditional manual RTL design methodology.
Performance Statistics and Results
The common intermediate format (CIF) of the YUV sequence was used as the VLCD algorithm’s input video signal. The target design was generated using global hardware and the Catapult architectural constraint settings described in the “Our HLS Flow” section.
The generated RTL design had a latency of 218,418 cycles using a 100 MHz clock frequency. The design performance was 101,376 pixels, which is equivalent to one frame per 2.18 ms.
Quartus fitter resource statistics for the Stratix II EP2S180 device are given in Table 1.
Adopting the HLS design flow for RTL generation and verification reduced our time-to-market significantly while maintaining performance figures close to a traditional RTL flow. HLS consistently increases productivity (in terms of reduced effort) in the 70 percent range on all of the projects we’ve used it. The automation of the RTL generation process closes the high-level to hardware-description language gap and eliminates all of the errors introduced by manual coding. Automation of RTL verification and reuse of high-level verification components at the RTL contribute further to productivity and design quality.
FPGA was used as a prototyping path and the design is being moved to ASIC using the Catapult C constraint editor, which provides the selection of different synthesis tool constraints and target technologies.