feature article
Subscribe Now

DSP to a Different Drummer

Stretch Debuts S6

FPGAs, possibly the most powerful processors in existence today by many measures, were never intended to be processors at all. Conceived as general-purpose programmable logic devices, their simple arrays of logic elements were not designed to accelerate computationally intensive tasks. Instead, they fell into the role in a Rube-Goldbergian fashion – evolving their processing prowess over a decade or more with engineering’s version of emergent behavior rather than ground-up purposeful design.

Today, FPGAs are used in many applications for heavy lifting chores such as video processing, often paired with conventional processors charged with handling the housekeeping. While the hardware is clearly capable, the ad-hoc programming model is anything but straightforward. Typically, FPGAs achieve “accelerator” status only after the algorithm goes through a bizarre set of transformations from some high-level description to a bit-true, cycle-accurate representation, then to register-transfer-level descriptions, then through logic synthesis to a netlist and finally to a bitstream after place-and-route and mapping. Does this seem convoluted yet?

Several companies have been founded in the past several years on the proposition that there’s got to be an easier way. Surely we could design something from the ground up to accelerate computationally intense algorithms, avoiding the inherent complexities of FPGA design, while maintaining the performance and power benefits of highly parallel processor architectures.

Stretch – one of the best known of those companies, has just launched a new processor, dubbed the S6, that incorporates Tensilica’s Xtensa core with Stretch’s Instruction Set Extension Fabric (ISEF). At this level, the architecture resembles many modern FPGAs – particularly those with integrated processor cores. The ISEF has the same function in an accelerated processor that the LUT fabric has in an FPGA – to provide parallel processing power for algorithmic hot spots. This part of the S6 architecture is nothing new – Stretch’s previous-generation processor took the same approach, although the new family has significant refinements. The big news with S6 is a selection of dedicated acceleration elements pre-designed for multimedia processing algorithms.

With the new S6 design, Stretch has created almost a multi-media ASSP coloring book with plenty of blank pages to bring in the specific value-added features of any given multimedia application. With the new dedicated programmable accelerator, highly-optimized implementations of several functions commonly used for video and image processing, software defined wireless protocols, and audio processing are pre-implemented and available to the programmer through APIs (meaning no hardware design is required). Specifically, there are blocks for motion estimation (for video encoding), entropy encoding (for H.264 video), encryption/decryption (including AED, DES, 3DES), and audio CODECs (AAC, AC3, MP3, etc.).

As an example of the acceleration available, Stretch says that the motion estimation function can process an entire 16X16-pixel macroblock sum-of-absolute-differences (SAD) operation in a single cycle.. With the pipelining available in the hardware accelerator, 256 SAD calculations are made on each clock cycle, and all 41 possible H.264 sub-macroblock combinations are returned with corresponding SAD values.

With S6, Stretch has also streamlined the process of parallelizing processors into a multiprocessor array. The new architecture employs what Stretch calls a processor array interface that abstracts away inter-chip communications, allowing multiple devices to collaborate. While this in no way trivializes the difficult task of optimizing a multi-processor environment, it should at least simplify the housekeeping on the hardware side.

While every “alternative” architecture processor can crunch mountains of data with ferocious abandon, the problem of getting that data on and off the chip is often overlooked. Stretch has built a robust and purposeful I/O infrastructure into S6, aimed at the specific class of applications they’re targeting and designed to reduce the need for external I/O devices and glue logic. These include a “quad data port” comprised of four 10-bit data ports that can interface directly to a variety of video devices using standards like BT656 (standard definition) and BT1120 (high definition) or directly with raw video. Also included is a 10/100/1000 Ethernet MAC, DDR/DDR2 interfaces, serial interfaces, and eGIB/GPIO.

The ISEF also warrants discussion, as the basic arrangement of the ISEF and the Xtensa processor is what most differentiates the Stretch approach. The Tensilica Xtensa VLIW processor core is configured to allow complete hardware accelerators for user-supplied algorithms to be called as single instructions by the host processor. This makes high-level invocation of the accelerator a pure programming task with no direct connection to the mechanism of hardware acceleration. Since the parts of programs that need accelerating typically amount to arithmetic problems in large, looped arrays, the Stretch ISEF is built to mirror just that. The S6 ISEF contains 4096 ALUs that can be configured in various groupings to achieve a variety of bit widths. There are also 64 dedicated 8X16 hardware multipliers that can be ganged to create wider operations. A rich set of registers, muxes, priority encoders and shifters allows these datapath elements to be configured in a variety of ways and provides storage for coefficients and intermediate results.

To provide programmatic access to all that hardware, Stretch provides a tool flow that automates the creation of extension instructions in the ISEF. The algorithm is described in C, and the compiler and cycle-accurate simulator allow the programmer to see the performance improvement provided by the ISEF accelerator. Code fragments are tagged for compilation to the ISEF, and the compiler searches for opportunities for parallelism by unrolling loops and analyzing data dependencies. The parallelized structure is sent through a place-and-route algorithm to map onto ISEF resources. At this point, the compiler generates a report showing how much of the ISEF resources are used and how much remains for further acceleration.

Since the ISEF can be reconfigured very quickly (Stretch claims 27 microseconds), code can be architected so that the ISEF is reconfigured for different tasks at different stages of the process. This effectively multiplies the acceleration available when complex algorithms can be broken down into sequential stages, each with highly parallelizable components.

The ISEF is fed by 32 128-bit-wide registers that load data into the ISEF. The ISEF also contains 64KB of embedded RAM distributed through the fabric in 32 banks of 2KB. In much the same fashion as FPGA block RAM, this RAM can be used as storage for intermediate data, coefficients, etc. This RAM is mapped into the address space so it can be loaded directly by the processor, and it also has a dedicated DMA channel so it can be loaded without processor intervention – a frequent bottleneck in some other custom-instruction-based acceleration schemes.

As with any exotic, alternative-architecture processor, we believe the proof in the pudding for S6 will be in the programming model. If Stretch has created an architecture whose features can be easily harnessed by the average programmer, the performance, price, and power consumption of the device will be compelling enough reasons for adoption in a wide variety of devices. If the programming model proves too cumbersome, however, all that elegant hardware will be waiting for only the few with the wherewithal to dive in and master yet another new high-performance computing programming paradigm.

Leave a Reply

featured blogs
Apr 24, 2024
Diversity, equity, and inclusion (DEI) are not just words but values that are exemplified through our culture at Cadence. In the DEI@Cadence blog series, you'll find a community where employees share their perspectives and experiences. By providing a glimpse of their personal...
Apr 23, 2024
We explore Aerospace and Government (A&G) chip design and explain how Silicon Lifecycle Management (SLM) ensures semiconductor reliability for A&G applications.The post SLM Solutions for Mission-Critical Aerospace and Government Chip Designs appeared first on Chip ...
Apr 18, 2024
Are you ready for a revolution in robotic technology (as opposed to a robotic revolution, of course)?...

featured video

MaxLinear Integrates Analog & Digital Design in One Chip with Cadence 3D Solvers

Sponsored by Cadence Design Systems

MaxLinear has the unique capability of integrating analog and digital design on the same chip. Because of this, the team developed some interesting technology in the communication space. In the optical infrastructure domain, they created the first fully integrated 5nm CMOS PAM4 DSP. All their products solve critical communication and high-frequency analysis challenges.

Learn more about how MaxLinear is using Cadence’s Clarity 3D Solver and EMX Planar 3D Solver in their design process.

featured paper

Designing Robust 5G Power Amplifiers for the Real World

Sponsored by Keysight

Simulating 5G power amplifier (PA) designs at the component and system levels with authentic modulation and high-fidelity behavioral models increases predictability, lowers risk, and shrinks schedules. Simulation software enables multi-technology layout and multi-domain analysis, evaluating the impacts of 5G PA design choices while delivering accurate results in a single virtual workspace. This application note delves into how authentic modulation enhances predictability and performance in 5G millimeter-wave systems.

Download now to revolutionize your design process.

featured chalk talk

Unlock the Productivity and Efficiency of a Connected Plant
In this episode of Chalk Talk, Amelia Dalton and Patrick Casey from Schneider Electric explore the multitude of benefits that mobility brings to industrial applications. They investigate how Schneider Electric’s Harmony Hub can simplify monitoring and testing, increase operational efficiency and connectivity openness in industrial plants, and how NFC technology can bring new innovation possibilities to IIoT applications.
Apr 23, 2024
153 views