Utilizing Power Management Techniques in Embedded Multicore Devices

There are many accepted reasons that support a move to multicore design in portable devices: scalability, specialty cores, increased performance, and reduced power consumption are just a few. This article, however, takes the approach that there is only one true reason why multicore makes an attractive platform for portable devices.

Before we explore that one reason, let’s debunk a couple of the more common reasons about multicore. For example, scalability is often cited a key reason to move to multicore because if one core is not fast enough then one can add another. If this were true, then we’d have to accept the fact that we cannot find a reasonably faster processor within that processor’s architectural family. For most devices, there are faster chips and chips with greater throughput within the same family. Not only that, but Moore isn’t dead yet, despite what the press says. So we do have the technology to double frequency and create more powerful processors within a uni-core chip family (at least a few more times).

Further, scaling from a uni-core design to a multicore design is not nearly as easy as advocates say, since one almost always uncovers things like race conditions, timing problems, and a host of additional issues that don’t normally manifest on uni-core designs.

Another common reason cited as a reason to move to multicore is the use of specialty cores. Can an application processor perform digital signal processing in a cell phone? Yep, it sure can. It does however, require tens, if not hundreds or thousands of cycles, to do what a DSP can do in just one or in a few cycles.

So specialty cores like DSPs can function a lot faster and use less power while achieving the same goal as general purpose application processors. Some could argue that this is increasing performance. And while it may be true, a case can also be made that utilizing specialty cores is just maximizing the output per watt of power that is needed to achieve the desired performance.

Okay, so enough about these common, misplaced reasons. What then, is the real reason to move to a multicore design? For most applications, the rationale for moving to multicore is reducing overall power consumption. Faster cores require faster clocks and higher voltages, which in turn means more power. If one says, “Hey what about the performance increase?” The answer, in most cases, goes back to maximizing output/watt.

So with reducing power consumptionas the primary reason to move to a multicore platform, let’s look at some areas, techniques if you will, that will allow designers to minimize power consumption.

Selecting the Right Hardware

As previously mentioned, increasing the operating frequency of a processor does indeed lead to increased performance. It also increases power consumption. Of course, the goal for embedded devices is to increase performance without increasing power consumption. One way to increase performance while decreasing power consumption is to select the right hardware. Many, if not most, of today’s mobile devices incorporate multimedia, wireless, or a combination of the two into a platform. Java applications are also successfully utilized. Multimedia, digital signal processing (DSP), and Java are all notoriously expensive in terms of consuming application processor cycles.

Taking advantage of heterogeneous multicore processors is one method to reduce power consumption in several different ways. By taking the System on Chip (SoC) approach and building hardware for a specific purpose, developers gain several magnitudes of performance per watt as opposed to a general purpose application cores. Heterogeneous multicore SOC or IP offerings include Texas Instrument’s OMAP and Freescale’s MXC, which have incorporated DSPs and application processors on the same silicon. ARM offers Java accelerators as well as multimedia engines to reduce power consumption and increase performance. ARM’s IP is incorporated in a number of offerings including Texas Instruments, Freescale, Atmel, and Samsung to name a few.

The method just outlined uses specialty cores or accelerators to do certain functions efficiently. Another method is to employ the “division of labor approach” in which a developer splits a process or set of processes into multiple activities that can then run in parallel on separate cores.

This is one area where common sense does not apply. While each core is doing a percentage of work, the sum total is less than if it were performed on a single uni-core device running at a significantly higher frequency. The ability to reduce power consumption in homogeneous multi-core designs is based on the fact that power consumption is both a function of voltage and frequency (power consumption increases proportionally with frequency and power consumption increases to the square of voltage increase). So dividing the silicon into multiple processing engines; each running at a reduced clock cycle and voltage but their combined processing capabilities is sufficient to get the job done. This approach is somewhat problematic in that the problem domain must be one that can be subdivided.

Software is Equally Important

While selecting the right hardware is a prerequisite for maximizing performance and reducing power consumption, it unfortunately, is not enough. Software drives the hardware and because of this it must be developed to take advantage of the hardware. As Neil Henderson, general manager of the Embedded Systems Division at Mentor Graphics has been known to say, “Software without hardware is an idea, hardware without software is a toaster.”

Software written for uni-cores is a very different beast than software written for multicore. It’s important to note that perfectly good uni-core software may actually run slower and use more power when deployed on a multicore device than when it runs on a uni-core device. Why? Amdahl’s law is a good place to start. Amdahl’s law states:

The non-parallel fraction of the code (i.e., overhead) imposes the upper limit on the scalability of the code”. Thus, the maximum parallel Speedup S(p) for a program that has a parallel fraction f is:

S (p) < 1/ (1-f)

Just to be clear, the “non-parallel (serial) fraction” of the program includes the communication, synchronization, load imbalance, and maintaining cache coherency.

Figure 1. Amdahl’s Law

There are several techniques software developers can use on multicore devices to ensure that their code runs as efficiently as possible. A close look at Amdahl’s law (Figure 1) demonstrates that the most effective utilization of multiple cores is obtained when the code is not serialized. Where does serialization originate? Serialization occurs when code cannot run in parallel due to resource conflicts, synchronization issues, or communications overhead that do not directly contribute to solving the problem. Cache coherency (or the lack of it) can be a leading cause of serialization in multicore systems.

As will be illustrated later in this article, the lack of cache coherency is also one area that contributes to wasting power above and beyond the fact that the code is serialized. It is however, a problem that can be easily corrected or avoided all together.

Addressing Cache and Power Consumption Issues

While some of the problems described below should also be avoided in uni-core software, some are unique to multicore. Further, some of the issues that manifest in uni-core deployments can be avoided with properly architectured multicore systems.

Multicore Cache Coherency

Most developers are familiar with cache in uni-core systems and issues such as maintaining cache coherency. Utilizing cache in a multicore design takes the problem of cache coherency to the next level. Now, in addition to maintaining coherency between memory and a single cache, the cache coherency scheme must address simultaneous reads and writes to an individual processor’s cache while maintaining coherency with the shared memory, as well as all the other cache memories used by the other processors.

From a performance perspective, reducing the number of cache misses can have dramatic impact on speed, but what about the power consumption aspect of cache misses? One effect is that unnecessary cache misses impact performance leading to a reduction in output/watt. The other factor, with respect to power consumption, is that main memory reads are much more expensive in terms of power consumption.

Cache Thrashing

Cache thrashing occurs when two or more cores are accessing shared data in such a way that it causes frequent invalidations of the cores cache (Figure 2). This pattern leads to excessive cache misses that require frequent memory fetches to keep cache coherent. In order to prevent this, developers need to implement techniques that allow separate cores to process separate data sets, limiting cache contention and coherency problems. Simple solutions range from turning off coherency, adding delays between reads and writes, or segmenting the problem space differently.

Figure 2. Cache thrashing occurs when two or more cores access shared data in such a way that it causes frequent invalidations from each of the core’s cache.

Ping-Pong Effect

One problem that may manifest in poorly designed AMP systems or may occur due to the load-balancing in SMP-based systems is the “Ping-Pong” effect. This occurs when two or more processes cause massive cache invalidations due to the ordering or frequency in which they access cache. An example below (Figure 3) is due the SMP’s load balancing as processes migrate from core to core in a multicore system.

Figure 3. The “Ping Pong Effect”

The ping pong effect not only decreases performance and increases power consumption, but also impacts real-time determinism in the system as memory access for the memory fetch is also a shared resource is considered a contested resource.

The potential for this problem is easily avoided in a well designed AMP multicore system (Figure 4). The reason is simple: Unlike load-balancing SMP systems where the tasks migrate from core to core, on AMP systems, the task loading is done statically by the system architect. Additionally, the problem can be easily detected by profiling during testing and corrected. This condition is more problematic to detect in SMP systems due to the random nature of the load-balancing scheduler and it may not manifest until after the device has been deployed into real world conditions.

Figure 4. Using core affinity to avoid the ping pong effect.

Some SMP-based operating systems do employ what is called processor or core affinity where selected tasks are bound to a particular core and do not participate in the load-balancing aspects of the system. The downside to this approach is that most of the SMP-based operating systems that support binding are not commonly used in embedded devices.

False Sharing
False sharing occurs when non-shared data is used by different cores that reside on the same cache line (Figure 5). If these data items are updated the entire cache line is market dirty by the cache coherency hardware. As a result, the cache coherency hardware performs an update of the dirty cache line(s).

Figure 5. False sharing occurs when non-shared data items reside on a
shared cache line.

The solution to false sharing may be as simple as relocating a variable in a data structure so that the variables are located on different cache lines (Figure 6). Another potential solution is to evaluate how and when the data items are used and determine if changing the frequency or timing in which the variables are accessed.

Figure 6. False sharing eliminated when variables are relocated to
different cache lines.

Unbalanced Cache Size
Unlike prior examples where two cores are sharing or appear to be sharing data, unbalanced cache size (Figure 7) thrashing occurs when two or more tasks running on a single core have very large cache requirements. As a result, every context between the two tasks invalidates large amounts of cache.

Figure 7. Unbalanced cache.

This problem may not have an easy solution on uni-core architecture. However on a multicore architecture, the simplest solution is to use core affinity to bind the tasks to different cores.

Figure 8. Core affinity reduces potential for unbalanced cache.

Conclusion

Migrating to multicore devices is a wonderful way to increase performance, or better yet, to reduce overall power consumption by using every watt more efficiently. It does, however, require as does all software development, thorough planning, design, and tuning to maximize performance and reduce the overall power budget of the device.

It also requires that designers choose the best hardware for the problem domain. One of the easiest ways to aid software in contributing to reducing power consumption is reducing cache contention. Cache thrashing is both easy to create and avoid. Careful architecture and profiling can aid in avoiding cache thrashing. Using core affinity is one sure way to avoid a number of problems discussed above.

To truly understand an application, one must be fully aware of the operating system and hardware needed to reach design goals. From the programming model perspective, SMP-based operating systems have a number of advantages such as automatic load balancing and built-in, inter-processor communications. These advantages can turn into disadvantages if a developer is not fully familiar with both the software system and hardware behavior. As with any endeavor, leaving the details to OS and hardware designers may have unintended consequences.

About the author: Todd Brian is a product manager for the Embedded Systems Division of Mentor Graphics and is responsible for Nucleus OS and related products. Brian has spent more than 15 years working with embedded systems and software in consumer and office electronics. Prior to joining Mentor, Brian served in a variety of engineering and marketing positions at Konica Minolta, Inc. He holds B.S. and M.S. degrees in Computer Science from the University of South Alabama, as well as a M.B.A from Spring Hill College.