I still check occasionally on gizmodo.com or engadget.com, but I’ve pretty much given up hope. It’s now 2005. Throughout my childhood, I was convinced that by this year I’d be flying around in my jetpack, or at least driving my flying car. My personal robot is a bit closer to reality, but still not in the cards for the foreseeable future, unless I just want my floors vacuumed. The one-MIPS supercomputer I had visualized in my basement, however, complete with dumb terminal and tape drives, has far exceeded expectations.
Our view of the future is always distorted, even if we have pretty solid trends to extrapolate. Either an anticipated key technology fails to mature, or an unexpected breakthrough occurs pushing a dark horse into the lead. In 1995, I was certain that RTL design would be dead by now, and that everyone would be designing digital hardware in behavioral VHDL. The enabling technology I was expecting was behavioral synthesis. Like those personal jetpacks, a few first-generation behavioral synthesis efforts got off the ground, but none ever proved sturdy or reliable enough that you’d want to strap-in your career and light the fuse. Instead, RTL design has clung to life in the mainstream, bolstered by increasingly elaborate scaffolding that strains under the weight of bloated semantics with today’s monstrous designs.
We are at one of those interesting junctures where the dominant technology has failed, and no successor has yet been named to fill its shoes. In the vortex created by this vacuum, a number of competing and complementary solutions are swirling around the problem, trying to grasp a stick of the popular consciousness before being sucked forever down the drain of doomed ideas. Contending for the crown are several methodologies ranging from evolution to revolution and from general-purpose to domain-specific.
General or Specific?
There are those who believe that the design methodology must segment as the level of abstraction goes up. The reasoning goes like this: RTL is general-purpose, but very close to the actual hardware implementation because it carries architectural details as part of the specification. As we abstract our specification language away from hardware and more toward design intent, we get closer to the problem that is being solved, and therefore our specification may take on the specialized semantics of the problem space instead of the general characteristics of the implementation technology.
The flagship example cited by this camp is digital signal processing (DSP). It is widely acknowledged that DSP designers like to work in the mathematical abstractions of a tool like Matlab. Most DSP designers would happily embrace a methodology that could go from Matlab’s algorithmic abstraction directly to optimized, parallelized hardware without requiring a deep understanding or detailed analysis of the underlying implementation technology. Companies like AccelChip are betting on this trend and working toward that idealized goal for the DSP crowd.
There are two problems with the domain-specific argument, however. The first of these is the lack of a solid second example. While DSP always comes up quickly in this conversation, it is usually followed by a grand hand-wave with a caption like “…and other domains have their own preferences as well.” Well? If there are, we haven’t found them yet. Please enlighten us.
The second problem is that there is a big, fat counter-example. The world of applications software has been turning out products for years with far more complexity and diversity than we hardware types have managed. Do they have a specialized programming language for each type of application? Initially, yes, but the world has been moving steadily toward plain old C++ as the ubiquitous, general-purpose standard. It makes sense that a workforce of well-trained programmers that are proficient in a robust, general-purpose programming language is preferable to many groups of specialized application-specific experts that are difficult to re-train and re-target as demands change.
Ideal vs. Real
In examining the question of evolution or revolution in design methodology, it pays to look way down the tracks to see where we’re really headed. Why do digital hardware designers even exist? Why is it ever necessary to develop custom, application-specific hardware? The answer probably lies in the inherent inefficiencies in the Von Neumann architecture. If the processing platform were perfect, we digital hardware types would be all finished with our jobs, and we could relax on the beach while programmers populated the world with new products based only on applications software.
However, the Von Neumann architecture is not perfect, requiring us to constantly develop customized hardware compensations. This problem could theoretically be solved by design tools. An ideal applications compiler should be able to take an algorithm described as software and partition it into the perfect mix of efficiently executing sequential code and optimized hardware accelerators with no outside intervention or guidance.
Don’t be buying your tickets to Maui and pre-ordering your mai-tais online just yet, though. These ideal compilers seem to be sitting on the assembly line just behind my flying car. While we’re waiting for them to hit the supermarket shelves, we’ll all still be gainfully employed trying to develop complex systems with new tools and methods that are probably the moral equivalents of those early attempts at flying machines.
The reality today is that we’re at the beginning of a new era. Some of the seminal solutions we describe here will succeed, and others will fail. Over time, there is likely to be a convergence of the techniques and approaches described below into something that comes close to achieving our idealized goals. Let’s take a look at what’s happening now.
What dialect will we all be coding in when the revolution comes? There are as many predictions as there are prognosticators. C, C++, Superlog, System verilog, System C, Handel C, M, and most of the other letters of the alphabet in combination with the word “system” are all struggling for traction. There are two basic approaches at play here. The first is to try to create an essentially new language that meets the needs of a particular design community, and then to get them to adopt it. The second is to adapt or adopt a language already in use by the design community, perhaps in another context.
At this juncture, it looks as if it is easier to get designers to adopt new tools and methods based on a language they already know than it is to get them to try to master a completely new syntax or semantic. It seems that engineers put a high premium on the depth of their experience and expertise with a language, and are reluctant to give up that earned advantage for the promise of greater productivity. If this trend holds true, one could surmise that established languages like C, C++, and M have a decided adoption advantage over the others because of the existing depth and breadth of their usage in other domains.
The difficulty (and also a potential advantage) of C and C++ is that they are high-level languages that carry with them a number of traits that are direct reflections of their intention to target a traditional processor architecture. Concepts like registers, program counters, pointers, and sequential execution are not-so-tacit assumptions in the languages themselves, so re-purposing these languages to target parallelized hardware architectures is an unnatural extension of their original intent. To successfully compile a description in one of these languages to efficient hardware, the sequential nature of the algorithm description must be unraveled into a data flow representation that reflects the possibility of parallelism. A tradeoff must then be made that finds a balance somewhere between a fully sequential implementation and a fully parallel one. The complexity of that tradeoff (which can range into several orders of magnitude of variation in performance and logic area) presents one of the biggest challenges in high-level language synthesis.
The reason this Von Neumann bias might actually be an advantage in the long run is that the majority of applications are fairly straightforward to conceive and describe as software, and they will run efficiently on a conventional processor architecture. The parts that require hardware implementation are those that have inadequate performance in software. It makes sense, then, that the overall application could be written in a “software” language like C/C++, and the parts of the application requiring acceleration could be extracted for compilation/synthesis into hardware. The identification of the parts to be extracted and the creation of efficient hardware to implement those parts is, of course, the tricky bit.
It is so tricky, in fact, that most current emerging methodologies make some major compromise in order to get the job done. Usually this compromise comes in one of two forms. The first and most common form is skipping the high-level language specification entirely and leveraging IP assembly in order to raise the level of abstraction. The second form of compromise is to use modified high-level language source code that includes RTL-like structures to specify hardware concepts like scheduling and interface constraints.
The simplest and most time-proven method for increasing your level of design abstraction is IP integration. Instead of designing at the detailed RTL level, you assemble your design from pre-designed IP blocks, really just like classic board-based system design. The success of this methodology depends on the quantity and quality of the available IP and the tool support for integrating and verifying both the IP and the resulting IP-based design. The availability of IP is strongly dependent upon a robust commercial IP market, and in the FPGA market, commercial IP development has historically been suppressed as a result of free (or almost free) vendor IP and complicated licensing problems. Recently, startups such as OmniWerks and Stelar tools are easing the problem with new offerings like easy-to-license IP blocks and tools for understanding, analyzing, and cleaning an IP-intensive design.
FPGA vendors are also now partnering more aggressively and working to create development environments that are more IP friendly. Altera, for example, has created a standardized platform for IP with their SOPC Builder that allows preview integration of IP on an evaluation basis, easy licensing, and a standardized browser for both their own and partner IP offerings. “SOPC Builder was originally created as a support environment for Nios [Altera’s soft-core RISC processor], but it is now growing beyond that into an IP integration platform,” says Chris Balough, Altera’s director of software and tools marketing. Altera steers IP vendors to support its Avalon switch fabric as a standardized IP interface method. “One of the hardest problems to solve with IP is the ‘dirty work’ at the periphery,” Balough continues. “Making the IP compatible with a standardized switch fabric allows the interface to be configured with a drag-and-drop methodology.”
Among companies espousing high-level-language-based design, the compromise of choice is the embedding of some architecture-specific information in the high-level source code itself. As we mentioned above, the problem of extracting concurrency from a sequential specification such as C or C++ is daunting. One solution to this is to ask the designer to embed constructs in the C code that specify concurrency explicitly. Celoxica has seen considerable success with this approach. Both their Handel-C dialect (for their DK tool suite) and their more recent implementation of System-C (for their Agility tool) give the designer straightforward structures for specifying concurrency and timing in the C source code. Other offerings such as Impulse C, Forte, and Poseidon take a similar approach.
Going against the grain in high-level synthesis is Mentor’s Catapult C. Catapult takes an ANSI C or C++ specification and creates synthesizable RTL without the use of special constructs in the source to specify concurrency. “Catapult has taken a fundamentally different approach, synthesizing directly from the algorithmic abstraction,” says Shawn McCloud, high-level synthesis product manager at Mentor Graphics. “With over 100 man-years of development, Mentor has invested more time and resources into this problem than any other company.” Catapult was formally announced last year, but it has been in production use at partner/customers like ST and Nokia since 2002.
The difference in Catapult’s approach is that it tackles the issue of concurrency algorithmically instead of requiring the designer to specify it in the C++ source. Does this make Catapult C the idealized system described above? No, unlike with a typical software compiler, you’ll still have to make some important decisions and tradeoffs interactively in the tool, but these decisions relate more to your design goals (performance versus area) than to specifics of hardware architecture or scheduling.
It’s All in the Loops
Typically, the reason we want to pull a software module into hardware is to resolve a performance bottleneck. If you’ve ever profiled software, you know that the structure always associated with a bottleneck is a loop, usually several of them nested. Loops are fertile and confusing ground for optimization when it comes to concurrency. If you (or your high-level synthesis tool) can identify the pure data dependencies within a loop structure, there are a wide variety of optimization options including unrolling, partial unrolling, pipelining, and a number of variants of these that can give you virtually any throughput performance you desire, depending on how much hardware you’re willing to commit.
There are also decisions to be made on how many operations can be chained into a single clock cycle depending on the clock period (which is also technically up in the air in high-level synthesis). All this flexibility means that an automatic tool needs some guidance in arriving at a suitable architecture. This guidance could come either in the form of performance requirements (where the high-level synthesis tool should use the minimum hardware required to meet the specification) or in the form of architecture specifications such as desired pipeline stages or the amount of parallelism or resource sharing.
Interfaces are Bottlenecks Too
When creating hardware accelerators, one must choose an interface protocol for getting data into and out of the accelerator module. Often, the performance of the system is limited not by the concurrency that can be achieved in the accelerator itself, but by the bandwidth or scheduling constraints imposed by the choice of interface. Mentor’s Catapult C addresses this problem as well, by providing an interface synthesis capability linked to the scheduling algorithms that control the design of the datapath. “Catapult’s interface synthesis allows the user to do two things,” Mentor’s McCloud continues. “You can do interface exploration which allows you to evaluate the performance impact of various interface choices. Then, you can synthesize the internal hardware to match the performance of the interface you’ve chosen. The tool will create only the amount of parallelism that can be utilized by the interface. It won’t waste resources over-designing the block beyond the capabilities of the I/O.”
Mentor’s theory is that most IP is over-designed, and the summation of all that over- design across all the blocks in a typical design is substantial. “We consistently meet design goals with less area and significantly less design time than hand-coded production RTL,” McCloud concludes. “That’s the primary reason for our rapid adoption in systems companies and for the large number of design wins and tape-outs with Catapult.”
When Will it All Happen?
Don Davis, Senior Manager of High-Level tools at Xilinx, summarizes the situation today: “It seems unlikely that RTL design will be going away any time soon. There are many efforts afoot to extend the capabilities of RTL to address its shortcomings, and RTL design is appropriate when the hardware designer has the time, expertise and need for hand-tuned, optimal performance, area, and power.”
“That said, there is certainly room in the market for higher level design methodologies to address developers or design teams that do not have the time or design expertise to succeed in a pure RTL flow. In addition, as the design space expands to include embedded processors, busses and peripherals, RTL does not lend itself to easy movement between hardware and software implementations, design exploration, system profiling and modeling.”
With all the new development effort focused today on creating technologies to raise the level of design abstraction, FPGA technology stands out as one of the key enablers. Because an FPGA is typically rapidly reconfigurable and because of the current explosion in FPGA-based embedded systems, it is clear that FPGAs offer an outstanding platform for the realization of hardware/software systems. The hardware/software partitioning, compiling, and optimization tools that are emerging can leverage the inherent performance and flexibility of FPGAs to create a powerful reconfigurable computing platform that should offer dramatic efficiency and time-to-market improvements over today’s predominant embedded design methodologies. Convinced? Grab an FPGA development board and strap on your jetpack, the future awaits!