Which would you prefer for your next embedded project: flexible system elements so that you can easily customize your specific design, or extra performance headroom in case you need more horsepower in the development cycle? Why should embedded engineers put themselves under undue development pressure and settle for one or the other? Soft processing and customizable IP offer the best of both worlds, integrating the concepts of custom design and co-processing performance acceleration into embedded design.
Embedded engineers often struggle with the challenges of improving performance or changing system characteristics after they have already completed the general architecture partition. Discrete processors offer a fixed selection of peripherals and some kind of performance ceiling capped by clocking frequency. It’s common to change processors in next-generation projects because one needs another peripheral or a different one than was offered with the traditional fixed processor. If you want three UARTS in one design, it would be nice to just configure your system that way rather than search data books for a specific chip with exactly those features. Software engineers know that they can inefficiently spend man-days or weeks re-writing code to get functions to run just a little bit faster. Wouldn’t it be interesting to explore co-processing acceleration methods, which can quickly yield 5X, 20X or more improvements, rather than re-writing software to get a 1% speed-up on a traditional hard processor core?
Embedded FPGAs, by comparison, offer platforms upon which you can create a system with a choice of multiple customizable processor cores, flexible peripherals, and even co-processing offload engines. You now have the power to architect an uncompromised custom processing system to satisfy the most aggressive project requirements while maximizing system performance by implementing accelerated software instructions in the FPGA hardware. With FPGA fabric acceleration, not only do you start with flexible building blocks, but you can explore numerous ways to optimize performance well into the development cycle.
Soft processors are highly flexible because they can be built out of the logic gates of any Platform FPGA, and embedded engineers can customize the processing IP peripherals to meet their exact requirements. With customizable cores and IP, you create just the system elements you need without wasting silicon resources. When you build a processing system in a programmable device like an FPGA, you will not waste unused resources as you would in a discrete device, nor will you run out of limited peripherals if you require more than are offered (say your design requires three UARTs and your discrete device offers only one or two). Additionally, you are not trapped by your initial architecture assumptions; instead, you can continue to dramatically modify and tune your system architecture and adapt to changes in newly required features or changing standards.
A recently published Xilinx FIR filter design example for some public workshops had a 32-bit soft processing system configured with an optional internal IEEE 754-compliant floating-point unit (FPU), which facilitates a significant performance increase over software-only execution on the processor core. By including optional soft processing components, you can quickly improve the performance of your application.
A side benefit of these optional internal components is that they are fully supported by the accompanying C compiler, so source-code changes are unnecessary. In the FIR filter design example, including the FPU and recompiling the design saw immediate performance improvements, as calls to external C library floating-point function call are automatically replaced with instructions to use the new FPU.
Utilizing specialized hardware-processing components improves processor performance by reducing the number of cycles required to complete certain tasks by orders of magnitude over software recoding methods. The simple block diagram in Figure 1 represents a soft processing system with an internal FPU IP core, local memory, and choice of other IP peripherals such as a UART or a JTAG debugging port. Because the system is customizable, we could very well have implemented multiple UARTs or other IP peripheral cores from the supplied processor IP catalog, including a DMA controller, IIC, CAN, or DDR memory interface.
Figure 1 – Simple MicroBlaze processor block diagram
The IP catalog provides a wide variety of other processing IP – bridges, arbiters, interrupt controllers, GPIO, timers, and memory controllers – as well as customization options for each IP core – baud rates, parity bit – to optimize elements for feature, performance, and size/cost. Additionally, you can configure the processing cores with respect to clock frequency, debugging modules, local memory size, cache, and other options. By merely turning on the FPU core option, we built a soft processing system that optimized our FIR implementation from 8.5 million CPU cycles to only 177,000 CPU cycles, enabling a performance improvement of 48x with no changes to the C source file.
In the next example, we’ll build on an additional design module, implementing an IDCT engine for an MP3 decoder application that will accelerate the application module by more than an order of magnitude. You can easily create both processor platform examples referenced here with a development kit like the one depicted in Figure 2. The integrated hardware/software development kit pictured includes a specific hardware reference board that directly supports both hard and soft processor designs. The kit also includes all of the compiler and FPGA design tools required, as well as an IP catalog and pre-verified reference designs.
With the addition of a JTAG probe and system cables, this kit allows you to have a working system up and running right out of the box before you start editing and debugging your own design changes. Development kits for various devices and boards are available from FPGA suppliers, as well as numerous distributors, and other third-party embedded partners.
Locate Bottlenecks and Implement Co-Processing
An intelligent tool suite is included in the pre-configured Embedded Development Kit example pictured in Figure 2. The tool suite is the Integrated Development Environment (IDE) used for creating the embedded HW/SW system. If you have a common reference board or create your own board description file, then intelligent tools can drive a design wizard to quickly configure your initial system.
Figure 2 –Integrated HW/SW Development Kit
Intelligent tools reduce errors and learning curves so you can focus your design time on adding value in the end application. After the basic configuration is created, you can spend your time iterating on the IP to customize your specific system and then develop your software applications.
A true “embedded” tool suite provides a powerful software development IDE based on the Eclipse framework for you power coders. This environment is ideal for developing, debugging, and profiling code to identify the performance bottlenecks that hide in otherwise invisible code execution. These inefficiencies in the code are often what makes a design miss its performance requirement goals, but they are hard to detect and often even harder to optimize.
Using techniques like “in-lining code” to reduce the overhead of excessive function calls, you can commonly improve application performance on the order of 1%~5%. But with programmable platforms, more powerful design techniques now exist that can yield performance improvements by an order or two in magnitude.
Figure 3 shows a montage of embedded IDE views for performance analysis. This figure displays profiling information in a variety of forms so that you can identify trends or individual offending routines that spike on performance charts. Bar graphs, pie charts, and metric tables make it easy to locate and identify function and program inefficiencies so that you can take action to improve those routines that leverage the most benefit for total system performance.
Figure 3 – Platform Studio embedded tool suite
Soft-Processor Cores with Their Own IP Blocks
For the MP3 decode example that I described earlier, we built a custom system (Figure 4), starting with the instantiation of multiple FPGA soft processors. Because we are using soft-core processor technology, we can easily build a system with more than one processor and balance the performance loading to yield an optimal system.
In Figure 4 you can clearly see the top soft processor block with its own bus and peripheral set separate from the bottom processor block and its own, different peripheral set. The top section of the design runs embedded Linux as an OS with full file system support, enabling access to MP3 bitstreams from a network. We offloaded the decoding and playing of these bitstreams to a second soft processor design, where we added tightly-coupled processor offload engines for the DCT/IMDCT (forward and inverse modified discrete cosine transform) functions and two high-precision MAC units.
Figure 4 – MicroBlaze MP3 decoder example
The IMDCT block covers data compression and decompression to reduce transmission-line execution time. DCT/IMDCT are two of the most compute-intensive functions in compression applications, so moving this whole function to its own co-processing block greatly improves overall system performance. Whereas we implemented an internal FPU in the earlier FIR filter example reference, the MP3 example has implemented flexible processing customizations and added external dedicated hardware within the FPGA.
Co-Processing + Customizable IP = Performance
By offloading compute-intensive software functions to co-processing “hard instructions,” an engineer can develop an optimal balance for maximizing system performance. Figure 4 also shows a number of IP peripherals for the Linux file system module, including UART, Ethernet MAC, and many other memory controller options. The coder/decoder application block, by comparison, uses different IP customizable for different system capabilities.
The Xilinx MicroBlaze 32-bit soft processor, one of EDN’s Hot 100 Products of 2005, utilizes the IEC (International Engineering Consortium) award-winning Xilinx Platform Studio (XPS) embedded tool suite for implementing the hardware/IP configuring and software development. In this example, the second MicroBlaze soft core is slaved to the first MicroBlaze processor and acts as a task engine decoding the MP3 bitstream. The decoder algorithm with the addition of specific IP cores connected directly in the FPGA fabric hardware resources through a Xilinx Fast Simplex Link (FSL) connection interface. This co-processing design technique takes advantage of the parallel and high-speed nature of FPGA hardware compared to the slower, sequential execution of instructions of a stand-alone processor.
Direct linkage to the high-performance FPGA fabric introduces fast multiply accumulate modules (LL_SH MAC1 and LL_SH MAC2 on Figure 4) to complement dedicated IP for the DCT and IMDCT blocks. The long-long MAC modules provide higher precision while offloading the processing unit. You will also notice high-speed FSL links utilized for the AC97 controller core to interface to an external AC 97 codec, allowing the implementation of CD-quality audio input/output for the MP3 player.
The co-processing system depicted in Figure 4 results in a cumulative 41x performance acceleration over the original software application with the additive series of component boosts. Comparing a pure “software-only” implementation (see the top horizontal bar as illustrated in Figure 5) to each subsequent stage of the hardware instruction instantiations, you can see how the performance improvements add up. Moving software into the IMDCT alone yields a 1.5x improvement, while adding the DCT as a hardware instruction moves it up to 1.7x. Another larger improvement is realized by implementing a long-long multiply accumulate to reach 8.2x.
Figure 5 – Co-Processing acceleration results
Implementing all of the software modules in hardware through co-processing techniques yields a 41x improvement – with the added benefit of reducing the application code size. Because we have removed multiple functions requiring a large number of instructions and replaced them with a single instruction to read or write the FSL port, we have far fewer instructions and have thus achieved some code compaction. In the MP3 application section, for example, we saw a 20% reduction in the code footprint.
Best of all, design changes through intelligent tools like Xilinx Platform Studio are easy, quick, and can still be implemented well into the product development cycle. Software-only methods of performance improvement are time-consuming and usually have a limited return on investment. Balancing the partition of software application, hardware implementation, and co-processing in a programmable platform, you can attain much more optimal results.
Based on the examples described in this article, we were able to easily customize a full embedded processing system, edit the IP for the optimal balance of feature/size/cost, and additionally squeeze out huge performance gains where none appeared possible. The Xilinx Virtex-4 and Spartan-3 Platform FPGAs offer flexible soft-processor solution options that can be designed and refined late into the development cycle. The soft-processor core combined with an intelligent tool suite provides a powerful combination to kick-start your embedded designs.
Co-processing techniques, such as implementing compute-intensive software algorithms as high-performance FPGA hardware instructions, allow you to accelerate your performance of modules by 2x, 10x, or as much as 40x+ in our common industry example. Imagine what it can do for your next design – think about the head room and flexibility available for your design late in the development cycle, or being able to proactively plan improvements to the next generation of your product.
For more information on Xilinx embedded processing solutions, visit www.xilinx.com/processor.