MIPS Goes Multithreaded

Although most designers don’t often consider it, there are different formulas for best overall system performance from embedded and stand-alone processors, too. Even though there’s no governing league making and changing the racing regulations, parameters like total system cost, power consumption, memory bandwidth, silicon area, and process profile rule the day when choosing a processor for your system design. The tradeoffs that make the best mix of performance on standalone processors can be completely different than those that give the best results in an embedded processor core.

When MIPS designed their new 34K core, which was announced this week, they clearly knew they were working under the usually unspoken embedded core racing formula. In an embedded core, cranking up the clock frequency runs up system cost and power consumption for your entire device, not just the processor portion. Heavily pipelined, superscalar, and very-long-instruction-word (VLIW) architectures directly consume more logic and are difficult to optimize, particularly in embedded applications. Multi-core methods are already implemented by default, simply because you’re dealing with embedded processors, but the licensing fees for multiple cores can run up the system cost tab again. Many of these solutions that work well are not well suited for the embedded core environment.

MIPS opted for a multi-threaded approach with the 34K, for seemingly sound reasons. Typically, an embedded core has a lot of down time waiting around for other parts of your embedded system to finish their jobs. Also typically, there is more than one embedded process needing attention at any given time. Of course, you can use an OS to handle process scheduling for you in software, but multithreading at the processor level can make much more efficient use of logic resources, leading to higher system throughput with lower cost and power consumption.

MIPS’s new 34K is based on their 24KE architecture. The 34K has a nine-stage pipeline “coupled with a small amount of hardware to handle the virtual processors, the thread contexts, and the quality of service (QoS) prioritization,” according to MIPS. Increasing system performance with multi-threading is all about optimizing the resource utilization in the execution pipeline. When one thread is stalled waiting for memory, hanging out and killing time, maybe listening to some tunes on its little sub-micron “iPod femto,” another thread can charge ahead, keeping the hardware busy. If the scheduling process is efficient, and if the processor can swap contexts with little or no overhead (this is key), significant performance gains can be made over a single-thread processor of the same architecture.

The 34K can be configured with up to five sets of thread context (TC) hardware. The TC hardware has a separate instruction buffer for each thread, with pre-fetching, a set of registers, and a program counter. This allows the 34K to switch between threads on a clock-by-clock basis, virtually eliminating context-swap overhead. Each TC shares some resources with other TCs within a larger structure called a “Virtual Processing Element” (VPE). A 34K core can be configured with up to two VPEs.

VPEs can be assigned to completely different environments, even running different operating systems, as the CP0 register used by OS kernels is shared between the TCs in a single VPE. This allows us to partition our application into sections that are completely disparate, but that use the same processor core. One might be a DSP- or quality of service (QoS)-critical application and another might be a user-time application running on a complex OS like Linux. The ability to share such disparate tasks on a single processor core saves system cost, power, and die area when compared with a multi-core solution.

The VPE concept could even allow two different embedded operating systems to be used in a single system and on a single 34K processor. One OS might handle user interface issues while the other managed hard-real-time processes such as digital signal processing. Each VPE would be managed differently, and each OS would work as if it were using its own dedicated processor. Based on your particular configuration options, you can configure a core to provide almost exactly the resources you need.

Speaking of QoS, the 34K comes with a QoS engine that interleaves instructions from multiple threads for maximum throughput. If some threads have specific QoS requirements, it can allocate specific dedicated processor time to those threads, assuring that loading on non-QoS-critical tasks doesn’t interfere. This ensures that QoS requirements are met, while maintaining maximum throughput overall. The VPE also assures that you don’t subject all of your threads to QoS restrictions just because some threads require them.

MIPS claims that the 34K shows an application speedup of 60% over their previous-generation 24KE core, with a 14% increase in die size. If you’re looking to improve your application performance in your system without adding another core, or if you want to consolidate a design that currently has multiple cores such as a DSP and a general-purpose processor, the 34K might be a very attractive option. The improved hardware utilization created by its multi-threaded architecture will save total system cost, improve power consumption, and reduce die area compared with just about any alternative solution. Who could argue with that?

Until someone comes along and changes the formula for embedded processors, MIPS seems to be onto something. Remember, the embedded processor race goes not to the swift, but to those who utilize the logic resources the most efficiently. As long as the formula calls for minimum system cost, lowest power consumption, maximum overall performance, and fastest time-to market, MIPS’s new 34K core is probably a safe bet.