Threading the Needle

The IP business is a wild and wooly place. There have been a lot of casualties as companies have tried to figure out how to make money saving other companies time. And time equals money. Should be a simple concept, but it’s amazing how hard it can be to get someone to spend money in order to save money. Somehow it always just feels like spending money.

So the market has shaken out a bit, but, in that process, no IP component has ascended to the same lofty heights as the processor. It becomes the critical first decision, and all other decisions flow from that. Not only does a processor decision drive the other architectural and IP choices for a chip under design, it also drives future versions of that chip because of the software legacy that gets established, making it hard to change later. And it drives a huge ecosystem of tools and accompanying IP providers that must tailor their offerings to the processor.

So there are a few very recognizable names here. Very few. Anyone not immediately coming up with ARM fails outright and must go find a new career in dishwashing. MIPS and Intel also are obvious choices, and ARC (now part of Virage) and Tensilica are familiar names; XMOS and Tilera are wiggling their way in as well. Beyond that, you have to start thinking harder. (I suspect I’ll get an incensed email from someone I didn’t name…) And that’s largely because the big guys control so much of what’s going on; there’s an immense barrier to entry.

Therefore it is surprising to find that there is a new entry on the market. Actually, calling it new isn’t quite right, but let’s say newly commercialized: it’s Imagination Technologies’s (IMG’s – I’m lazy that way) META processor. It’s been around for at least 12 years, but only for internal use by IMG as they put together systems for their clients.

It started as a very basic processor created around the concept of hardware threading. Now, by uttering the phrase hardware threading, we have entered a murky world of tasks and schedulers into which we’ll dive a bit deeper in a bit. (Have your snorkels ready; a couple aspirin at the ready isn’t a bad idea as well.)

IMG came up with their own MeOS OS that went along with the core, and, over time, they added MMU and cache coherency capabilities. Having accomplished all this, they stood back, viewed their creation, decided that it was good, and called it an actual product. They’ve come to market with three variations on their core: a simple single-thread bare-bones version (LTP); an area-optimized version (MTP) and a high-end one (HTP). Paying attention to power consumption appears to be an obsession across the board.

Part of their approach to power is to extract as much work as possible out of each clock cycle. So each core (except LTP) has four of what they call threads (an overloaded term which we’ll review shortly), each of which can be configured as a RISC or DSP thread (DSP threads have some extra instructions and registers available). A low-level hardware scheduler manages the threads and can do a swap in a single cycle – more on this shortly.

If multiple threads have some work scheduled that requires non-overlapping resources, both threads can proceed at the same time – they refer to this as super-threading, which shouldn’t be confused with the other kind of super-threading that means something different. More on this also coming up. I feel like such a tease.

Each thread can be independently assigned to a different OS. For example, three threads on a four-thread core could be assigned to Linux SMP, with the fourth being bare-metal or getting assigned to an RTOS. Sounds simple in principle, but it raises some significant thread management – not to mention terminology – questions. In fact, this is the last place I’ll refer to these “threads” as threads. And here is where we enter the murk. Snorkels on.

Threading through threads

A challenge when trying to deal with some of these issues is that broad concepts end up with specific implementations, which may all be different, and they often use the same words for different things. It’s very easy to get confused by overloaded terms like “thread.” So we’ll zoom back out and work our way in, trying to be precise about language along the way.

This will likely be review, but work with me here, since my crusty old brain benefits from something resembling a linear stroll through the murk. I make no claim to be an expert; what follows reflects the efforts of an earnest dilettante to gather the various pieces of evidence from various people and places and paint a coherent picture. Your mileage may vary. Hopefully you emerge less confused, but, if not, I’ll consider myself in good company.

• Once upon a time, in the age of the paleocomputer, only one task would exist at a time. Computers had a single CPU that could do only one thing at a time. There was nothing else. And the world was simple. And boring. And slow.

• Multi-tasking allowed more than one task to be in play at a given time, although the CPU could still do only one thing at a time, and the operating system swapped tasks in and out.

• Multi-processing is another term that tends to mean the same thing but is overloaded with a hardware connotation. It can describe a system that has more than one CPU and is sometimes conflated with multicore. To avoid this processor/core confusion, we’ll refer here to CPU without regard to where it’s located relative to any other CPUs.

• Given a multi-processor system (in the general sense), it becomes possible to execute multiple things at the same time, rather than managing multiple things that have to share a single CPU. These things that are executed might be processes, which are, roughly speaking, independent software entities with their own space and context that, in general, don’t overlap with any other processes. Or they might be threads, which are portions of a process that can execute independently of each other. A system that can manage multiple threads is called multi-threading.

• Operating systems generally manage processes and threads, deciding which to run when, and, in the context of a multi-processor system, where to run them. Here we enter the realm of scheduling, which a few minutes of research will confirm is a complicated discipline in its own right. From a scheduling standpoint, whether the entity being scheduled is a process or thread doesn’t usually matter, and so the term task is often used generically to indicate process or thread.

If you think it’s easy to decide which of a number of tasks to schedule at any given time on one of several CPUs, you’re wrong. There are numerous algorithms and approaches, each with benefits and drawbacks.

So with a single CPU and multiple tasks, the OS will determine when to swap one task for another. Swapping can be done for efficiency, making use of stalled time in one task to run another task, or for fairness, to ensure that each task gets some time. If the CPU is simple, then it knows about only one task at a time.

• When swapping out a task, the OS takes the information about the task being executed, called the context, and saves it to memory; the task being swapped in has its context retrieved from memory. The contents of the context will vary by architecture but will generally include such things as register contents and, most importantly, the program counter.

Swapping to memory takes time, so there’s always a tradeoff between the amount of time being saved due to some stalled task versus the amount of time it takes to swap the context. All part of the scheduling consideration.

• This kind of multi-threading, where the OS decides which one of several tasks to schedule on a single CPU at a given moment, is called temporal multi-threading. Some kinds of temporal multi-threading are called super-threading. If you have multiple CPUs, however, then you can have true concurrent computing and can therefore schedule more than one task at a time. Because you can have threads actually running at the same time, this is referred to as simultaneous multi-threading.

• The OS has to be specifically designed to handle simultaneity – typically found in SMP (symmetric multi-processing) versions. The scheduling gets more complex, since a task that starts and then stalls on one CPU might actually resume on another. Cache state and TLB contents are among the complicating items that figure into the calculus.

• We can add another degree of complexity with hardware threading. In the simplest instance, this is where a CPU has a single execution unit but can hold the contexts of multiple threads in hardware. This means that context switches are very fast, because there is no swapping in and out of memory.

• This is one instance where language gets confusing. In a CPU with the capacity to hold four threads in hardware, a hardware guy will think of each of those “slots” as a thread. In that way of thinking, there would never be more than four threads (for this example) because that’s all the hardware will hold. But from a software standpoint, the application determines the number of threads; there could be 20 or 1000 software threads.

So, from a scheduling standpoint, you now have some threads in hardware that can swap quickly and the rest that would have to be managed by memory swaps. Presumably the most commonly-run threads would benefit by remaining in hardware as much as possible. Here the OS scheduler would have to decide which threads to keep in hardware and which in memory. More algorithmic complexity.

• If we have a single CPU with multiple hardware thread capability, then the OS can simply think in terms of a single CPU, and you can live with a “standard” version of the OS, although it has to know about the hardware threading. But we can get yet more complex by splitting the difference between multicore and single core by having multiple versions of some parts of the execution hardware.

Exactly what this means could be different for different architectures, but the upshot is that, under the right circumstances, where the right resources aren’t shared, two threads can execute at the same time. Under the wrong circumstances, where the threads are vying for the same execution hardware, only one can execute.

So it’s not full multicore with completely independent CPUs; it’s somewhere in between, and it still qualifies as simultaneous multi-threading. Intel refers to their version of this as hyper-threading; IMG uses the term super-threading, which should not be confused with the super-threading we just talked about, which is a flavor of temporal multi-threading.

• When you have simultaneous multi-threading (whether of the hyper- or super- variety), even though you have only one complete CPU, the fact that you can sometimes do more than one thing means that now the OS needs to view the CPU as not being a single CPU, but rather as being multiple virtual CPUs (which I’ll abbreviate VCPU here). Now you need an SMP version of the OS.

In theory, at least, the number of duplicated execution units doesn’t have to match the number of hardware threads (you could have four hardware threads with two partially-replicated execution paths). But in the specific case of the META processor, each hardware thread acts like its own VCPU. And so, while we referred to the four slots before as “threads,” I’ll call them VCPUs from now on.

• We can make things yet more complicated by including a hardware scheduler. So far, all the scheduling has been done by the OS. But it takes time for the OS (which is its own process with its own threads) to figure out what to schedule next. A hardware scheduler should be able to do this much more quickly. The META core has a hardware scheduler that can implement a single-cycle swap. But it can swap only between the threads in hardware; presumably the OS has to get involved if other threads need to be scheduled.

This place where the OS scheduler and the hardware scheduler meet is an area where casual research reveals almost no information. But the obvious implication is that the OS has to communicate with the hardware scheduler at some level. As long as the four hardware threads contain four concurrent threads that the OS is managing, the hardware scheduler can swap them around to maximize performance without the OS even knowing about it.

• But – and bear with me here, we’re almost done with our little tour – let’s take things one level further: with the META processor, you can have different hardware threads allocated to different OSes. The VCPUs share many resources, so there has to be some way to manage the resources between the OSes, which don’t know of each other’s existence.

Here the hardware scheduler can help, since it’s very low-level and is outside the scope of either of the OSes. But how can priorities between the two OSes be rationalized? Let’s start with the simple case of all four VCPUs being given to a single OS. If you want to dial around which threads execute on which VCPUs or mess with priorities, the OS can let you use affinity or priority mechanisms to bias scheduling decisions.

This scenario is reasonably easy to manage because both the hardware scheduler and the OS scheduler have visibility over all of the VCPUs. But what happens when we put Linux SMP on only three of the VCPUs and have an RTOS on the fourth VCPU? The fact that an RTOS is being used would suggest that it has some demanding requirements with respect to timing and priority.

The RTOS can effectively manage the activities under its purview, and Linux can manage its threads. But since neither OS knows about the other, how can priorities and load balancing be managed between the two? This is where IMG has another low-level technology called Automatic MIPS Allocation (AMA). This is a knob that can be tweaked in real time to allocate the available cycles amongst the threads.

Generally, AMA would be used to arbitrate between different OSes. In the example above, the three Linux threads would typically all get the same AMA setting (with any balancing there done through the OS) while the RTOS might get a different AMA setting. But of course you could always have a good laugh by playing with both AMA at the low level and OS settings at the high level and causing all kinds of confusion. Good clean family fun.

There, whew, done that. It bears noting that much of the complexity arises from interactions between the operating system and the hardware. It’s also possible to implement “bare-metal” systems – that is, with no operating system. In this case, the hardware threads and scheduler provide complex OS duties without the need for an OS.

All in all, there is a surprising level of flexibility, granularity, and control that’s possible with the META processor if needed. Taking full advantage of all those degrees of freedom would be an unusual project that could force you to learn more than you ever wanted to about scheduling and thread control.

Finally, if nothing else from this discussion is clear, the one obvious take-away that even I can understand is that terminology is rather pliable in this space. Critical terms mean different things to different companies. If you accept the fact that, all too often, the meanings of the buzzwords really are less important than the simple fact of having the buzzwords, it becomes immediately obvious that there are a number of powerful buzzwords that have yet to be exploited that really should be addressed.

So, regardless of the behavior they describe, let the record show that these pages spawned the following new buzzwords. The obvious ones take threading into the giant and miniscule ranges via the metric system. So we must exploit the concept of megathreading, gigathreading, tera-, and petathreading on the high end; microthreading, nano-, pico-, femto-, and attothreading on the low end. We will be wowed by gonzothreading, extremeThreading and the capabilities of the PowerThread, the CottonThread, the NuclearThread (which will be spawn a counter-project by the competition coded-named NuclearThread Deterrent).