We seem to have this love/hate relationship with software. We like it because it’s so durn flexible and we can implement changes quickly. Well, unless we really hose things up. But you have to be a real goober to need a change that takes longer than a hardware change*. I mean a real hardware change, like a silicon spin.
But there’s a problem with software: it’s slow. Since the dawn of time, man has labored to find ways to make software go faster. OK, so maybe more like the dawn of Timex, but whatever, for a long time. Some of that effort has been on making software execute more efficiently through better coding and better-tuned processors. But another avenue pursued has been that of using hardware to make software faster, and this has taken a number of forms.
One of the earliest was the dedicated coprocessor – essentially a specialized microprocessor that did a few things very fast, like floating point math. These are still found in computers today for handling graphics, disk access, and other such peripheral tasks, but the coupling has decreased, meaning that the processor can hand things off and assume the job will get done right without much supervision.
Another more recent approach has been to take the tasks that were going to be handled by software and simply turn them into hardware. This is the so-called C-to-RTL problem. It can be useful in a number of scenarios, but for the moment let’s focus in on the situation where software is being replaced wholesale with hardware.
C-to-RTL has had a rocky history and, quite frankly, has left a bitter taste in a lot of mouths. Anyone coming up with a better C-to-RTL mousetrap is automatically faced with the immediate hurdle of convincing people why they’re not just more of the same old thing. While the focus of this article isn’t on C-to-RTL in this context, the stumbling blocks that became an issue for early adopters apply in some measure to any practical attempt to accelerate software:
• The full ANSI C language wasn’t supported.
• The software had to be annotated to provide guidance to the RTL synthesis tool, especially with respect to timing.
• The software had to be restructured locally to provide a more efficient order of operations – and in some cases had to be completely restructured.
• The software had to be partitioned manually to provide better parallelism.
Success in any new C-to-RTL solution or, in fact, in any other software-accelerating solution, can partly be predicted by its ability to address these four items. More recent forays into this space – the most visible of which are Mentor’s Catapult C and Synfora’s Pico Express – have taken on full ANSI C support and don’t require annotation. Efficient implementation may still require specific coding styles, and partitioning is manual. Cadence has just announced participation in this market as well; we’ll discuss that in an upcoming article.
The other thing to keep in mind is that hardware, in any practical modern context, is simply not as flexible as software. Automatic synthesis of hardware from C certainly speeds things up as compared to a laborious manual process, but it’s still not as fast and easy as tweaking a few lines of C code and re-running a software build.
Another approach to accelerating software is to create small custom hardware accelerators for critical portions of the software. There are some very specific computer-science-purist distinctions between an accelerator and a coprocessor, but for our purposes here this can also be thought of as a coprocessor. C-to-RTL tools might be used to create the hardware for the accelerators. In the FPGA space, both Altera and Xilinx have made use of this approach with their soft processor cores. Altera’s Nios processor includes the concept of the “custom instruction,” which boils down to a call to a specialized hardware block. Xilinx’ Microblaze processor has what they call an APU, or Auxiliary Processor Unit, which is effectively a place where a hardware coprocessor can be attached, again allowing user-defined instructions. The success of this approach with respect to the four key elements is directly tied to the abilities of the C-to-RTL tool being used to generate the accelerator.
A further approach extending these methods appeared with Teja Technologies’ ability to generate a custom multicore fabric out of soft FPGA processor cores, each of which could have one or more custom hardware accelerators. This provided more flexibility in partitioning between hardware and software. Software could be somewhat accelerated through the use of a multicore architecture and, where necessary, further accelerated through the use of hardware coprocessors. The benefit was the ability to keep more of the code in software, keeping that flexibility intact. While it succeeded on the first two out of the four criteria, its multicore nature slowed adoption, and the third criterion was never really proven out. Partitioning was manual.
Yet another approach is now on offer from CriticalBlue. They’ve got a different spin on the problem: rather than using a standard processor and generating dedicated hardware to accelerate things, they generate a custom processor and microcode for use as a coprocessor. They actually started this by transforming binary code and generating an FPGA processor but have since shifted their sights towards SoC design. They have also more recently announced the ability to generate multiple coprocessors, further leveraging parallelism in the algorithms being accelerated.
What’s interesting here is that they are, strictly speaking, using C to generate RTL. But the RTL they generate isn’t logic implementing the algorithm, but rather logic implementing a processor that is optimized for the algorithm; alongside the RTL, microcode is also created containing the actual algorithm.
CriticalBlue attempts to exploit two different independent levels of parallelism: what they call instruction-level and task-level. Instruction-level parallelism reflects low-granularity opportunities to do more than one thing at the same time. If two sums are being generated, neither of which depends on the other, they can be done in parallel, and the fact that the instructions are initially specified as sequential is nothing but an artifact of our programming paradigm, which says that everything must come before or after something else. The Cascade tool analyzes the code to identify these kinds of parallelism and then generates a dedicated processor/microcode set. Because the algorithm is still implemented in software, strictly speaking, it can be changed just like any other software program. But the coprocessor set is optimized for specific code, so changes only to the code may result in a less efficient implementation of the new code, since the processor wasn’t re-optimized to account for the new code.
Their Multicore Cascade tool further takes advantage of higher-granularity task parallelism. Loosely speaking, you might think of a task as something that could be assigned to a separate thread since it can be executed in parallel with a different task. Each such task can get its own coprocessor. In this case, CriticalBlue provides analysis tools that allow the designer to play with different solutions, identify and eliminate dependencies, and optimize the task partition. They don’t automatically create the solution, since they claim that their customers don’t want their software automatically partitioned.
A big part of the focus of this solution is the ability to take in legacy code. This means they have to, at the very least, be successful at the first three criteria above. Any significant changes required to millions of lines of legacy code will be a non-starter; it ain’t gonna happen. They do handle the full ANSI C language, and no annotation is required. There’s nothing saying that specific coding styles are required to be effective – but then again, none of the solutions openly advertise that, even if it’s necessary. So that requires some history and user testimonials to validate.
As with the other approaches to acceleration, the partitioning remains the guaranteed manual step. This is really one of the huge multicore bugaboos – how to exploit task-level parallelism and assign independent tasks automatically. To date, no adequate solution has been found. At least one very large company with lots of money to throw at the problem spent a lot of effort on a possible solution, only to change its mind at the last minute and abandon it. It is to the point where those “in the know” will sagely counsel that this is not a tractable problem, that no algorithm will ever be able to do as good a job as a human. Which, of course, means that some solution will probably arise, but if it does, it’s not evident today. (Remember when early FPGA adopters insisted on being able to twiddle their own bits since no place-and-route tool would be able to best a human?)
So this gives designers three different ways to try to optimize software (one of the four above is no longer on the market). They all involve some level of compromise between software flexibility and hardware speed, and all have at least some proving, if not improving, to do with respect to the need to structure C code carefully for synthesis into some accelerated form. They all have something of a nichey feel to them, one of those things where either the user has to be open to lots of new ideas or the pain has to be high enough to nudge a recalcitrant conservative designer kicking and screaming into something unorthodox.
Let’s check back in a couple years and see which solutions have prevailed.
* I say this safely from my writer’s keyboard, smug in the assurance that there isn’t, at this very moment, a gaggle of coders amassing to launch a march, pitchforks and torches in hand, to manifest their collective wrath at being called goobers. Please call my mother if you haven’t heard from me in a couple weeks.