Going for Speed

It may be surprising that such careful engineering attention is required to gain the desired performance from this lowest echelon of racing – cars powered only by gravity and driven by kids aged 9 to 16. It might also be counter-intuitive that some of the most careful engineering for processor performance comes in the very lowest echelon of computing systems – devices where many tiny processors may be put on a single chip in a system that might sell for single-digit dollars at the retail level. Both, however, are driven by the same constraints. When power, size, and cost are all at a premium, engineering excellence in the extreme is a mandate.

The quest for performance in embedded computing is multi-dimensional. As with the soapbox racer, a complex tradeoff space exists where conflicting forces create the need for a careful balance of resources in order to find an optimal solution to the problem. Ironically, the desktop processor designers like Intel and AMD have it comparatively easy. Only recently have they needed to resort to architectural solutions beyond those that naturally fall out of process improvement. The embedded market, because of its more constrained environment, has reached architectural limits much sooner.

In embedded computing, performance has many enemies. The most prominent of these is power. Many embedded applications are required to run on batteries, or need to fit below a minimum supply current limit, or don’t have the form factor that can accommodate extensive cooling provisions. As a result, the desktop processor solution of cranking the clock frequencies into the multiple gigahertz range isn’t practical. Higher clock frequencies equal more power dissipation, and embedded processor architects long ago had to head for greener pastures. Additionally, higher processor frequencies either require much more memory bandwidth or create a system-level imbalance where the processor is faster than the memory.

Another alternative is to increase the processor width. If you’ve hit the maximum frequency you can tolerate in your system, crunching more bits with each clock cycle is one way to increase the throughput. More bits of width means toggling more transistors, however, so you won’t necessarily be gaining on the power problem any. Also, the wider you go, the more you increase the inherent waste in the system architecture. Any operation that requires less than the full width ends up wasting wiggles, and each unneeded flop that flips is more power, area, and cost that you can’t afford in a tightly constrained embedded design. On top of that, the software architecture of a system is tightly coupled to the bit width, so for most applications, there is a very real optimal width that doesn’t offer much room for tuning.

Once we’ve exhausted the low-hanging fruit of speed and width, we get to pipelining. By parallelizing the execution of our instruction pipeline and looking ahead to operations coming down the road, we can gain a significant efficiency improvement in our system. Instead of being surprised by memory access requests, our processor can be working productively on one task while the pipeline phones ahead for reservations on the next. This too has been a desktop computing weapon of choice for many generations now. Here too, however, there are limits to the effectiveness of the technique. The more pipeline we build into our processor, the more logic it requires to do the same task, and our marginal return on extra logic diminishes as we turn up the pipelining knob. In embedded systems, where logic area is at a premium, the use of pipelining is again matched to the rest of the system architecture and doesn’t give the one-dimensional control we’d need for tradeoff-free acceleration.

With frequency, bit width, pipeline depth, and memory bandwidth all balanced and optimized, where do we go for more speed in our system? We have to break the bounds of monolithic processor myopia and multiply our options. Multi-processors, multi-cores, and multi-threading all let us multiply our performance strategically without building an out-of-balance computing system and thus breaking the sweet spot that we’ve achieved. .

The smallest step we can take to multi-fy our system is to use a multithreaded processor or core. Multithreading gives you something akin to a processor and a half, where some resources are shared and other parts of the processor are duplicated for parallel execution. Most embedded applications have a number of tasks that need to be performed simultaneously. In a superscalar architecture, the pipeline frequently stalls waiting for, say, memory access or the UPS truck or something. While it’s busy waiting, your processor could context swap to another process thread and make use of the downtime. Multithreading is useful when some parts of your system are inherently out of balance. In many applications, these elements are processor clock frequency and memory bandwidth, which may be constrained by other aspects of the system design. An example of an embedded processor core that uses the multithreading approach is MIPS 34K architecture, announced early this year.

If your system tends more toward random multitasking, i.e. the nature of the processes executing in each processing thread is not well understood at the time the system is designed, a multi-core solution may be more appropriate. While multithreading re-uses some of the processor logic, making for a smaller footprint than two full large processors, random multitasking can sometimes cause resource conflicts due to task interactions that take away the advantages of multithreading. In these cases, it’s often better to take on the overhead of full multiple cores.

John Goodacre, Multiprocessing Program Manager at ARM explains, “The inevitability of multithreading is that the software has to understand how to take advantage of the particular multithreading architecture in order to get the most work done. With a multi-core approach, there is more scalability and less requirement for the software to accommodate the scheme.” Compared with multithreading, multiprocessing duplicates more of the processor logic (or all of it) so that separate processes can execute completely independently. “Doubling cores can even more than double performance in some cases due to OS overhead for time slicing plus cache sharing,” Goodacre continues. “With that advantage, you can sometimes use a much smaller processor, saving both area and power.”

Multiprocessing schemes offer considerable advantages when it comes to power management. When multiple processes are executing on multiple processors, each processor can be scaled to just the required performance for its designated task, and processors can even be shut down when their process is not required. With a large, monolithic processor, such optimizations are not possible, and the processor must be sized to match the peak processing demands of the system. When the system is operating off peak, the power penalty of the large processor must still be paid.

Once we’re comfortable with the idea of multiple processors being scaled in size for different types of tasks in our embedded system, we can make the next logical leap to customizing the processor architectures for particular tasks as well. As soon as we head down this path, of course, we are giving up the advantage we just gained of software-agnostic processing. Developers of an application for heterogeneous multi processing must absolutely understand the architecture of the target computing environment to take advantage of its inherent efficiencies.

One common approach to heterogeneous multiprocessing is to tailor processors to specific tasks by generating optimized processors using a scheme such as Tensilica’s Xtensa. In these architectures, the processor resources are molded to the task, including custom instructions for performance-intensive functions. In this case, the game is flipped. Instead of needing to understand the processor to design the software, knowledge of the software is required when designing the processor. The result, however, is a highly efficient processor that is well matched to a specific task in a multi-processing system.

Taking the heterogeneous idea to the extreme involves generating datapath accelerators directly in hardware for extreme performance hogs such as signal processing and video processing tasks. In these architectures, even custom instructions put too much demand on a host processor for passing data to the custom processing element. In such cases, the accelerator uses shared memory to directly retrieve and process data, pushing the results back into FIFOs or shared memory locations where it can be accessed by other processing elements. Designing these systems is the most challenging task of all, from both a hardware and software perspective.

Recently, high-level synthesis tools have been developed that assist with this task by compiling C code directly into highly parallelized hardware for algorithm acceleration. These monster datapath elements may include hundreds of arithmetic units that can operate in parallel, giving orders of magnitude acceleration in compute-intensive algorithms. The design challenge, however, is interfacing these blocks to the rest of the embedded system in a way that doesn’t just move the bottleneck. Careful construction of communications protocols and data access is required to take advantage of the full performance and efficiency potential of a hardware accelerator scheme.

None of these solutions is a panacea for performance and power problems in embedded systems. In each case, the particular demands of the application must be carefully weighed against the strengths of each architectural approach. Just as Toby and his dad want to build a race car that will fit the driver, the track, and the rule book while achieving maximum performance, we need to tailor our embedded system design to the particular performance demands of our application. If we don’t, our competitors will.