Going Against the Grain

It has long been a dream of big-picture systems engineering visionaries to crack the barrier between hardware and software. Traditionally, the amount of effort required to take software features and accelerate them by turning them into dedicated hardware has been high and not easily done at the last minute when, for example, a coding bottleneck that needs hardware acceleration is uncovered right before release time.

This is, of course, the impetus behind the many flavors of C-to-RTL that have arisen over the years. From the early days of Handel C to the present, much effort has been made – and unfortunately, less-than-proportionate success achieved – in an attempt to make the implementation of a software algorithm fungible in the sense that, given a function, it could be done in software or hardware with no significant effort required for one or the other. While FPGAs have been the primary targets for the resulting hardware implementations, higher-end tools have also targeted dedicated silicon for functions cast in hardware on an SoC.

The critical issue here is primarily that of the concurrency that hardware can provide, since popular ways of expressing software are serial. Software and software-like expressions of functionality are generally thought of as being higher level, more abstract – putting all of this activity squarely in the ESL space. But there’s nothing inherently abstract about writing serially. A serial description of a problem may be easier for the human brain to process in some cases, but, for the most part, engineers are taught serial programming languages, and so have much less of a natural tendency to think of algorithms in a concurrent manner. And the programming languages allowing abstraction are serial. Those engineers that do think concurrently are typically hardware engineers, and there are far fewer of those than software engineers. And they’re typically deep in the implementation phases of the development process and less involved in the high-level architectural design.

Concurrency has actually moved well out of the hardware space with the arrival of multicore. Listening to high-level prognostications in that arena suggests that a simple transition between popular languages like C and something new that could better capture parallelism is nowhere near. Pundits talk about the need to create an entirely new programming paradigm that captures concurrency and to start in the universities so that students are imbued from early on with a sense of parallel operation. In other words, this is a long ways off and won’t be available to help the ESL problem for some time.

So we’re back left with the reality of today’s serial programming languages and parallel hardware implementations. And, like it or not, the software language of choice here is C, or C++ for higher-level descriptions or code that’s not quite as speed-critical. And we’re back to the C-to-RTL world.

The fundamental weakness of the initial C-to-RTL approaches was that, as a developer, you had to annotate the code to indicate where the parallelism was: which things could be concurrent, and which couldn’t. This amounted to “timing” the program and required syntax that was not part of the standard C language.

This gave way to the ability to handle untimed C programs in later tools, but still, certain aspects of the C language – notably pointers and dynamic memory allocation – couldn’t be handled; you had to convert to an alternative implementation, like arrays, that are more deterministic. You don’t even have the concept of a heap in a hardware design (I suppose you could, but that’s very unlike typical hardware setups), so some fundamental notions of C programming – and especially C++ programming – become troublesome. Alternative structures involving streaming, for example, make possible some convenient hardware implementations but still require potentially significant rework to the original software program.

So the “litmus” test, if you will, of any new C-to-RTL tool has become whether it will accept untimed C, whether it requires any annotations with non-C constructs, whether programs need restructuring, and whether the entire ANSI C language is supported. Can I just take my C program and push a button and be done?

There is one simplifying factor in this environment: so far there hasn’t actually been much need for converting entire programs from C to hardware. More typically, there is a function or routine within a larger overall program that needs accelerating in hardware, essentially creating what are variously called accelerators, coprocessors, or custom instructions (with custom hardware to implement them). This simplifies the problem in that the entire computing structure doesn’t have to be created; it can be assumed that there is a microprocessor handling most of the computing and that some interface mechanism allows that processor to hand off a function to an auxiliary hardware unit for acceleration. But this just simplifies the interface; it doesn’t solve the fundamental problem of making C easier to implement.

Given this landscape, one company has taken an approach that is rather counter-intuitive. Mitrionics has targeted high-performance computing as fertile ground for hardware acceleration, but it doesn’t work directly from a C program. They do give a nod to the preference for C by having a language that resembles C in style and that has C in the name – Mitrion-C – but make no claim that it is a minor variant on C. It is a different language. Obviously, having it resemble C as much as possible simplifies things, but departing from C gives them the freedom to create a language that expresses concurrency. The downside is that an engineer has to rewrite ANSI C programs into Mitrion-C.

Clearly, the extra work will make sense only if there is a highly compelling need for the resulting speed-up. The kinds of applications being targeted by Mitrionics include scientific computing (say, genomics) and search/data mining. It is very much text- and integer-oriented; floating point isn’t an area they focus on. They have assembled a Turing-complete set of about 60 functional hardware tiles that are guaranteed to hook together and have been hardware tested. The kinds of tiles include control flow, I/O, and arithmetic elements. The Mitrion-C program is turned into an assembly of these tiles to guarantee a functionally correct implementation of the algorithm.

The Mitrionics tools environment tries to accomplish the so-far unreachable goal: the ability of a software engineer to create hardware without learning how to design hardware. There is no explicit declaration of any hardware in the Mitrion-C program, so all of the hardware is inferred by the compiler. This then generates a VHDL IP core that can be implemented in an FPGA (one of their tools packages actually contains the Xilinx tools for direct implementation into Xilinx FPGAs). So in theory, a complete FPGA design can be done with no hardware knowledge or FPGA experience.

Given the history in this space, potential purchasers of software-to-hardware tools have come away less than impressed with much of what they’ve seen in the past, so there’s a huge amount of skepticism when a new idea like this comes around. If you want any credibility, you pretty much have to be able to do a simple proof that this works, all the way from writing straightforward code to getting it into the FPGA. So demonstration/development boards have been an important part of answering the prevailing “Show me” demands of prospects. Mitrionics and Nallatech recently announced a dev board using a Virtex-4 LX device along with 1 GB of DRAM and 16 MB of SRAM. The board connects to the host via an 8-lane PCI Express connector for a theoretical full duplex bandwidth of 2 GB/s, and it leverages the Intel QuickAssist Accelerator Abstraction Layer to implement the task handoff from the host processor. This provides a means of testing and developing programs from software to hardware in order to exercise the entire flow.

Having demonstrated that this works, there remain two key challenges that may not be hard barriers but can make selling something like this feel somewhat like running in waist-deep water. One is a practical consideration; the other is a cultural issue.

From a practical standpoint, it’s simply a matter of good bookkeeping and attention to detail to be able to take a proven algorithm for converting some language to a hardware representation, create the RTL, and compile and load that into an FPGA. A well-oiled tool flow can do all that at the push of a button, and software guys can push buttons as well as hardware guys. As long as the tools work as advertised, the tools are handling all the obvious tasks that would demand serious hardware design experience, and therefore there’s no problem with a software guy pushing the button.

There’s a gotcha, however, as any FPGA designer can attest. It’s work to create an FPGA design by hand, but the real struggle occurs when you fill the FPGA too full – “too” meaning much greater than 60 or 70%. A design may fit, but it may not meet performance. This is where pushing buttons stops and the serious work of constraining and pushing things around starts. Timing closure isn’t trivial; if it were, it wouldn’t be such a hot buzzword in the EDA world. And it’s not just hardware work: it requires hardware knowledge, an understanding of what’s going on inside an FPGA, and specific tricks and tips that depend on which FPGA is being used.

So if a software guy compiles a function into a device that’s overly large and has plenty of space and speed headroom, things will likely go swimmingly. But if cost is an issue, such that the smallest, slowest possible FPGA is being considered, for example, then making it work will definitely take someone deep deep into hardware territory. This is, as of yet, not a well-solved problem. Mitrionics gets involved with those sorts of designs to help the customer out; that’s fine as far as it goes, but it is not scalable.

The higher-level cultural issue that can stand in the way of this design model is the unfortunate existence of hardware and software silos. Hardware guys are told to build a platform; software guys are told to write programs for that platform. The software guys know how the platform should work, and code accordingly. If a program doesn’t work, perhaps fingers can be pointed, but if it’s a hardware issue, either the hardware guy fixes the problem or the software guy codes around the problem.

What never happens is that the software guy changes the hardware him- or herself. This may sound silly, but there’s an underlying assumption for software writers that the hardware works. Or at least is stable. You might think that giving the software guy a way to change or optimize or fix hardware (using only software-like methods) would be a good thing. But, in fact, software folks often wade into unfamiliar and uncomfortable waters there and are reluctant to take on that power. Writing code to a clean platform works great. Writing software when there is no hardware – when your software will become the hardware – removes a bit of the comfort factor. It’s like having a pill that contains all of the nutrients needed for balanced nutrition – in theory it possibly works, but there’s something unsettling about replacing the good ol’ proven messy make-dinner-and-then-eat process. You have to have a really good reason to make the big mental shift.

There’s nothing suggesting that this mindset has to remain, but it does create some resistance to the uptake of hardware-design-by-software-programmers methodologies. If such methodologies prove their worth in an increasing range of applications, it can become a more standard approach. For now, focused attention on key markets, like the scientific computing apps that Mitrionics is targeting, appears to be the best way to make hardware inroads into the software world.

Links:
Mitrionics
Nallatech

Going Against the Grain

Related

Leave a Reply Cancel reply

featured video

Larsen & Toubro Builds Data Centers with Effective Cooling Using Cadence Reality DC Design

featured chalk talk