feature article
Subscribe Now

DSP to a Different Drummer

Stretch Debuts S6

FPGAs, possibly the most powerful processors in existence today by many measures, were never intended to be processors at all. Conceived as general-purpose programmable logic devices, their simple arrays of logic elements were not designed to accelerate computationally intensive tasks. Instead, they fell into the role in a Rube-Goldbergian fashion – evolving their processing prowess over a decade or more with engineering’s version of emergent behavior rather than ground-up purposeful design.

Today, FPGAs are used in many applications for heavy lifting chores such as video processing, often paired with conventional processors charged with handling the housekeeping. While the hardware is clearly capable, the ad-hoc programming model is anything but straightforward. Typically, FPGAs achieve “accelerator” status only after the algorithm goes through a bizarre set of transformations from some high-level description to a bit-true, cycle-accurate representation, then to register-transfer-level descriptions, then through logic synthesis to a netlist and finally to a bitstream after place-and-route and mapping. Does this seem convoluted yet?

Several companies have been founded in the past several years on the proposition that there’s got to be an easier way. Surely we could design something from the ground up to accelerate computationally intense algorithms, avoiding the inherent complexities of FPGA design, while maintaining the performance and power benefits of highly parallel processor architectures.

Stretch – one of the best known of those companies, has just launched a new processor, dubbed the S6, that incorporates Tensilica’s Xtensa core with Stretch’s Instruction Set Extension Fabric (ISEF). At this level, the architecture resembles many modern FPGAs – particularly those with integrated processor cores. The ISEF has the same function in an accelerated processor that the LUT fabric has in an FPGA – to provide parallel processing power for algorithmic hot spots. This part of the S6 architecture is nothing new – Stretch’s previous-generation processor took the same approach, although the new family has significant refinements. The big news with S6 is a selection of dedicated acceleration elements pre-designed for multimedia processing algorithms.

With the new S6 design, Stretch has created almost a multi-media ASSP coloring book with plenty of blank pages to bring in the specific value-added features of any given multimedia application. With the new dedicated programmable accelerator, highly-optimized implementations of several functions commonly used for video and image processing, software defined wireless protocols, and audio processing are pre-implemented and available to the programmer through APIs (meaning no hardware design is required). Specifically, there are blocks for motion estimation (for video encoding), entropy encoding (for H.264 video), encryption/decryption (including AED, DES, 3DES), and audio CODECs (AAC, AC3, MP3, etc.).

As an example of the acceleration available, Stretch says that the motion estimation function can process an entire 16X16-pixel macroblock sum-of-absolute-differences (SAD) operation in a single cycle.. With the pipelining available in the hardware accelerator, 256 SAD calculations are made on each clock cycle, and all 41 possible H.264 sub-macroblock combinations are returned with corresponding SAD values.

With S6, Stretch has also streamlined the process of parallelizing processors into a multiprocessor array. The new architecture employs what Stretch calls a processor array interface that abstracts away inter-chip communications, allowing multiple devices to collaborate. While this in no way trivializes the difficult task of optimizing a multi-processor environment, it should at least simplify the housekeeping on the hardware side.

While every “alternative” architecture processor can crunch mountains of data with ferocious abandon, the problem of getting that data on and off the chip is often overlooked. Stretch has built a robust and purposeful I/O infrastructure into S6, aimed at the specific class of applications they’re targeting and designed to reduce the need for external I/O devices and glue logic. These include a “quad data port” comprised of four 10-bit data ports that can interface directly to a variety of video devices using standards like BT656 (standard definition) and BT1120 (high definition) or directly with raw video. Also included is a 10/100/1000 Ethernet MAC, DDR/DDR2 interfaces, serial interfaces, and eGIB/GPIO.

The ISEF also warrants discussion, as the basic arrangement of the ISEF and the Xtensa processor is what most differentiates the Stretch approach. The Tensilica Xtensa VLIW processor core is configured to allow complete hardware accelerators for user-supplied algorithms to be called as single instructions by the host processor. This makes high-level invocation of the accelerator a pure programming task with no direct connection to the mechanism of hardware acceleration. Since the parts of programs that need accelerating typically amount to arithmetic problems in large, looped arrays, the Stretch ISEF is built to mirror just that. The S6 ISEF contains 4096 ALUs that can be configured in various groupings to achieve a variety of bit widths. There are also 64 dedicated 8X16 hardware multipliers that can be ganged to create wider operations. A rich set of registers, muxes, priority encoders and shifters allows these datapath elements to be configured in a variety of ways and provides storage for coefficients and intermediate results.

To provide programmatic access to all that hardware, Stretch provides a tool flow that automates the creation of extension instructions in the ISEF. The algorithm is described in C, and the compiler and cycle-accurate simulator allow the programmer to see the performance improvement provided by the ISEF accelerator. Code fragments are tagged for compilation to the ISEF, and the compiler searches for opportunities for parallelism by unrolling loops and analyzing data dependencies. The parallelized structure is sent through a place-and-route algorithm to map onto ISEF resources. At this point, the compiler generates a report showing how much of the ISEF resources are used and how much remains for further acceleration.

Since the ISEF can be reconfigured very quickly (Stretch claims 27 microseconds), code can be architected so that the ISEF is reconfigured for different tasks at different stages of the process. This effectively multiplies the acceleration available when complex algorithms can be broken down into sequential stages, each with highly parallelizable components.

The ISEF is fed by 32 128-bit-wide registers that load data into the ISEF. The ISEF also contains 64KB of embedded RAM distributed through the fabric in 32 banks of 2KB. In much the same fashion as FPGA block RAM, this RAM can be used as storage for intermediate data, coefficients, etc. This RAM is mapped into the address space so it can be loaded directly by the processor, and it also has a dedicated DMA channel so it can be loaded without processor intervention – a frequent bottleneck in some other custom-instruction-based acceleration schemes.

As with any exotic, alternative-architecture processor, we believe the proof in the pudding for S6 will be in the programming model. If Stretch has created an architecture whose features can be easily harnessed by the average programmer, the performance, price, and power consumption of the device will be compelling enough reasons for adoption in a wide variety of devices. If the programming model proves too cumbersome, however, all that elegant hardware will be waiting for only the few with the wherewithal to dive in and master yet another new high-performance computing programming paradigm.

Leave a Reply

featured blogs
Sep 17, 2021
Dear BoardSurfers, I want to unapologetically hijack the normal news and exciting feature information that you are accustomed to reading about in the world of PCB Design blogs to eagerly let you know... [[ Click on the title to access the full blog on the Cadence Community s...
Sep 16, 2021
I was quite happy with the static platform I'd created for my pseudo robot heads, and then some mad impetuous fool suggested servos. Oh no! Here we go again......
Sep 15, 2021
Learn how chiplets form the basis of multi-die HPC processor architectures, fueling modern HPC applications and scaling performance & power beyond Moore's Law. The post What's Driving the Demand for Chiplets? appeared first on From Silicon To Software....
Aug 5, 2021
Megh Computing's Video Analytics Solution (VAS) portfolio implements a flexible and scalable video analytics pipeline consisting of the following elements: Video Ingestion Video Transformation Object Detection and Inference Video Analytics Visualization   Because Megh's ...

featured video

Accurate Full-System Thermal 3D Analysis

Sponsored by Cadence Design Systems

Designing electronics for the data center challenges designers to minimize and dissipate heat. Electrothermal co-simulation requires system components to be accurately modeled and analyzed. Learn about a true 3D solution that offers full system scalability with 3D analysis accuracy for the entire chip, package, board, and enclosure.

Click here for more information about Celsius Thermal Solver

featured paper

Keep Your System Up and Running With a Single Supercapacitor

Sponsored by Maxim Integrated (now part of Analog Devices)

This design solution presents a novel solution for backing up system power in both battery and line-powered systems. The elegant architecture runs from a single supercapacitor, provides a tightly regulated 5V output at up to 3A, and features 94% efficiency.

Click to read more

featured chalk talk

Flexible Power for a Smart World

Sponsored by Mouser Electronics and CUI Inc.

Safety, EMC compliance, your project schedule, and your BOM cost are all important factors when you are considering what power supply you will need for your next design. You also need to think about form factor, which capacitor will work best, and more. But if you’re not a power supply expert, this can get overwhelming in a hurry. In this episode of Chalk Talk, Amelia Dalton chats with Ron Stull from CUI Inc. about CUI PBO Single Output Board Mount AC-DC Power Supplies, what this ??ac/dc core brings to the table in terms of form factor, reliability and performance, and why this kind of solution may give you the flexibility you need to optimize your next design.

Click here for more information about CUI Inc PBO Single Output Board Mount AC-DC Power Supplies