feature article
Subscribe Now

Yet Another Twist on Making Software Faster

CriticalBlue Creates Custom Coprocessors

We seem to have this love/hate relationship with software. We like it because it’s so durn flexible and we can implement changes quickly. Well, unless we really hose things up. But you have to be a real goober to need a change that takes longer than a hardware change*. I mean a real hardware change, like a silicon spin.

But there’s a problem with software: it’s slow. Since the dawn of time, man has labored to find ways to make software go faster. OK, so maybe more like the dawn of Timex, but whatever, for a long time. Some of that effort has been on making software execute more efficiently through better coding and better-tuned processors. But another avenue pursued has been that of using hardware to make software faster, and this has taken a number of forms.

One of the earliest was the dedicated coprocessor – essentially a specialized microprocessor that did a few things very fast, like floating point math. These are still found in computers today for handling graphics, disk access, and other such peripheral tasks, but the coupling has decreased, meaning that the processor can hand things off and assume the job will get done right without much supervision.

Another more recent approach has been to take the tasks that were going to be handled by software and simply turn them into hardware. This is the so-called C-to-RTL problem. It can be useful in a number of scenarios, but for the moment let’s focus in on the situation where software is being replaced wholesale with hardware.

C-to-RTL has had a rocky history and, quite frankly, has left a bitter taste in a lot of mouths. Anyone coming up with a better C-to-RTL mousetrap is automatically faced with the immediate hurdle of convincing people why they’re not just more of the same old thing. While the focus of this article isn’t on C-to-RTL in this context, the stumbling blocks that became an issue for early adopters apply in some measure to any practical attempt to accelerate software:

• The full ANSI C language wasn’t supported.

• The software had to be annotated to provide guidance to the RTL synthesis tool, especially with respect to timing.

• The software had to be restructured locally to provide a more efficient order of operations – and in some cases had to be completely restructured.

• The software had to be partitioned manually to provide better parallelism.

Success in any new C-to-RTL solution or, in fact, in any other software-accelerating solution, can partly be predicted by its ability to address these four items. More recent forays into this space – the most visible of which are Mentor’s Catapult C and Synfora’s Pico Express – have taken on full ANSI C support and don’t require annotation. Efficient implementation may still require specific coding styles, and partitioning is manual. Cadence has just announced participation in this market as well; we’ll discuss that in an upcoming article.

The other thing to keep in mind is that hardware, in any practical modern context, is simply not as flexible as software. Automatic synthesis of hardware from C certainly speeds things up as compared to a laborious manual process, but it’s still not as fast and easy as tweaking a few lines of C code and re-running a software build.

Another approach to accelerating software is to create small custom hardware accelerators for critical portions of the software. There are some very specific computer-science-purist distinctions between an accelerator and a coprocessor, but for our purposes here this can also be thought of as a coprocessor. C-to-RTL tools might be used to create the hardware for the accelerators. In the FPGA space, both Altera and Xilinx have made use of this approach with their soft processor cores. Altera’s Nios processor includes the concept of the “custom instruction,” which boils down to a call to a specialized hardware block. Xilinx’ Microblaze processor has what they call an APU, or Auxiliary Processor Unit, which is effectively a place where a hardware coprocessor can be attached, again allowing user-defined instructions. The success of this approach with respect to the four key elements is directly tied to the abilities of the C-to-RTL tool being used to generate the accelerator.

A further approach extending these methods appeared with Teja Technologies’ ability to generate a custom multicore fabric out of soft FPGA processor cores, each of which could have one or more custom hardware accelerators. This provided more flexibility in partitioning between hardware and software. Software could be somewhat accelerated through the use of a multicore architecture and, where necessary, further accelerated through the use of hardware coprocessors. The benefit was the ability to keep more of the code in software, keeping that flexibility intact. While it succeeded on the first two out of the four criteria, its multicore nature slowed adoption, and the third criterion was never really proven out. Partitioning was manual.

Yet another approach is now on offer from CriticalBlue. They’ve got a different spin on the problem: rather than using a standard processor and generating dedicated hardware to accelerate things, they generate a custom processor and microcode for use as a coprocessor. They actually started this by transforming binary code and generating an FPGA processor but have since shifted their sights towards SoC design. They have also more recently announced the ability to generate multiple coprocessors, further leveraging parallelism in the algorithms being accelerated.

What’s interesting here is that they are, strictly speaking, using C to generate RTL. But the RTL they generate isn’t logic implementing the algorithm, but rather logic implementing a processor that is optimized for the algorithm; alongside the RTL, microcode is also created containing the actual algorithm.

CriticalBlue attempts to exploit two different independent levels of parallelism: what they call instruction-level and task-level. Instruction-level parallelism reflects low-granularity opportunities to do more than one thing at the same time. If two sums are being generated, neither of which depends on the other, they can be done in parallel, and the fact that the instructions are initially specified as sequential is nothing but an artifact of our programming paradigm, which says that everything must come before or after something else. The Cascade tool analyzes the code to identify these kinds of parallelism and then generates a dedicated processor/microcode set. Because the algorithm is still implemented in software, strictly speaking, it can be changed just like any other software program. But the coprocessor set is optimized for specific code, so changes only to the code may result in a less efficient implementation of the new code, since the processor wasn’t re-optimized to account for the new code.

Their Multicore Cascade tool further takes advantage of higher-granularity task parallelism. Loosely speaking, you might think of a task as something that could be assigned to a separate thread since it can be executed in parallel with a different task. Each such task can get its own coprocessor. In this case, CriticalBlue provides analysis tools that allow the designer to play with different solutions, identify and eliminate dependencies, and optimize the task partition. They don’t automatically create the solution, since they claim that their customers don’t want their software automatically partitioned.

A big part of the focus of this solution is the ability to take in legacy code. This means they have to, at the very least, be successful at the first three criteria above. Any significant changes required to millions of lines of legacy code will be a non-starter; it ain’t gonna happen. They do handle the full ANSI C language, and no annotation is required. There’s nothing saying that specific coding styles are required to be effective – but then again, none of the solutions openly advertise that, even if it’s necessary. So that requires some history and user testimonials to validate.

As with the other approaches to acceleration, the partitioning remains the guaranteed manual step. This is really one of the huge multicore bugaboos – how to exploit task-level parallelism and assign independent tasks automatically. To date, no adequate solution has been found. At least one very large company with lots of money to throw at the problem spent a lot of effort on a possible solution, only to change its mind at the last minute and abandon it. It is to the point where those “in the know” will sagely counsel that this is not a tractable problem, that no algorithm will ever be able to do as good a job as a human. Which, of course, means that some solution will probably arise, but if it does, it’s not evident today. (Remember when early FPGA adopters insisted on being able to twiddle their own bits since no place-and-route tool would be able to best a human?)

So this gives designers three different ways to try to optimize software (one of the four above is no longer on the market). They all involve some level of compromise between software flexibility and hardware speed, and all have at least some proving, if not improving, to do with respect to the need to structure C code carefully for synthesis into some accelerated form. They all have something of a nichey feel to them, one of those things where either the user has to be open to lots of new ideas or the pain has to be high enough to nudge a recalcitrant conservative designer kicking and screaming into something unorthodox.

Let’s check back in a couple years and see which solutions have prevailed.

* I say this safely from my writer’s keyboard, smug in the assurance that there isn’t, at this very moment, a gaggle of coders amassing to launch a march, pitchforks and torches in hand, to manifest their collective wrath at being called goobers. Please call my mother if you haven’t heard from me in a couple weeks.

Leave a Reply

featured blogs
Sep 26, 2022
Most engineers are of the view that all mesh generators use an underlying geometry that is discrete in nature, but in fact, Fidelity Pointwise can import and mesh both analytic and discrete geometry. Analytic geometry defines curves and surfaces with mathematical functions. T...
Sep 22, 2022
On Monday 26 September 2022, Earth and Jupiter will be only 365 million miles apart, which is around half of their worst-case separation....
Sep 22, 2022
Learn how to design safe and stylish interior and exterior automotive lighting systems with a look at important lighting categories and lighting design tools. The post How to Design Safe, Appealing, Functional Automotive Lighting Systems appeared first on From Silicon To Sof...

featured video

PCIe Gen5 x16 Running on the Achronix VectorPath Accelerator Card

Sponsored by Achronix

In this demo, Achronix engineers show the VectorPath Accelerator Card successfully linking up to a PCIe Gen5 x16 host and write data to and read data from GDDR6 memory. The VectorPath accelerator card featuring the Speedster7t FPGA is one of the first FPGAs that can natively support this interface within its PCIe subsystem. Speedster7t FPGAs offer a revolutionary new architecture that Achronix developed to address the highest performance data acceleration challenges.

Click here for more information about the VectorPath Accelerator Card

featured paper

Algorithm Verification with FPGAs and ASICs

Sponsored by MathWorks

Developing new FPGA and ASIC designs involves implementing new algorithms, which presents challenges for verification for algorithm developers, hardware designers, and verification engineers. This eBook explores different aspects of hardware design verification and how you can use MATLAB and Simulink to reduce development effort and improve the quality of end products.

Click here to read more

featured chalk talk

Reduce Power System Needs with Multichannel Power Monitors

Sponsored by Mouser Electronics and Microchip

Power monitors can be very effective in terms of power management for a variety of designs and the use of a multichannel power monitors can not only lower your overall system power but also lower your code overhead, simplify prototyping and event detection. In this episode of Chalk Talk, Amelia Dalton chats with Mitch Polonsky from Microchip about the benefits of multichannel power monitors and how Microchip’s PAC194x and PAC195x can help you monitor your power in your next design.

Click here for more information about Microchip Technology PAC194x & PAC195x Monitors