Trying to Keep Big Things in Little Packages

Embedded has always been something of a mixed blessing on FPGAs. Certainly FPGAs feature prominently in many embedded systems, but they rarely take the central computing stage. Why? One word: performance.

There are two ways to implement a processor on an FPGA. The most prevalent is to use a soft core like Nios (Altera), MicroBlaze (Xilinx), ARM (Actel, Altera), Coldfire (Altera), or Mico32 (Lattice; open). The alternative is to use one of the built-in PowerPC processors on the high-end Virtex devices from Xilinx.

The soft core approach is often more appealing for FPGA vendors because they don’t have to dedicate silicon to a processor that may or not be used and they don’t have to offer two versions of their chip – one with and one without. In fact Altera started to go down the dedicated processor route and changed their minds pretty quickly. You may hear debated the actual reasons why (sales? unholy mix of software and hardware engineering in the user community?) Bottom line, they decided against it – and consider themselves very successful with their Nios offering.

What’s the downside to the soft cores? It’s that one word again: performance. Clearly, you’re not going to be able to make FPGA gates, connected together in the form of a processor, operate as quickly as a hand-crafted processor. Or even one slapped together out of standard cells. The soft cores do very well where they’re not blazing the trail on speed.

The dedicated processors are certainly faster, and yet, here again, if there’s one thing that holds them back, it’s – you guessed it – performance. They’re faster than the soft cores, but again, at 300-500 MHz, they don’t provide nearly the speed that a dedicated processor chip can provide. So it’s almost a double-fault: they’re less flexible than a soft core but not fast enough to compete with standard chips. They’re stuck in this tweener land, neither fish nor fowl.

In fact, with the Virtex 4 family from Xilinx, designers often chose the high-end devices for such things as serial signaling; the processor came along for the ride even though it wasn’t needed. With Virtex 5, the high-end features were split out – in particular serial communications – so that they were available without a processor.

So with this as context, there are pieces of this business that do better and worse, but, relative to the overall embedded market, it’s not huge. FPGAs are much more commonly used as accelerators that receive handoffs from other off-chip processors when critical algorithms can’t cut the mustard in software.

At first blush, then, it is certainly surprising to hear John Carbone of Express Logic say that they get some of the best performance of their ThreadX operating system on FPGAs. Until you look a level deeper. Many of the soft core designs are simpler functions that don’t need or want to compete with heavy duty computing. In such a design, using a full-up operating system like Linux, with its roots in telecom, or a real-time OS like VxWorks, with its military heritage, can be a real waste of horsepower, chewing up processor cycles and swallowing memory. And that’s being kind. So it stands to reason that a smaller OS might play well here. And that’s where ThreadX is positioned.

This is even truer for small, more cost-sensitive boxes like printers, cameras, and small routers. If FPGAs are going to be used here at all, it will be the small cheap ones, and, on those, soft cores are the only option. A small, cheap device means not too much memory, so again, we need that reduced-footprint OS. And that means an OS with fewer services than the big guys provide.

The specific set of services offered by ThreadX was admittedly arrived at somewhat by successive approximation. As John tells the story, it is the descendant of the Nucleus OS, which had too few services, and which was succeeded by the Nucleus+ OS, which was too rich, giving rise to ThreadX (by the same guy, William Lamie, but now in a different company, Express Logic), which was just right. Call it the Goldilocks OS.

One of the tradeoffs for using ThreadX is that it supports only one process, although it allows multi-threading. Once you start managing threads, guaranteeing real-time performance can get trickier. Threads can be swapped in and out by the OS; just because a thread has a higher performance doesn’t mean it stays in place until done: it can still be pre-empted by a lower-priority thread. Typically, the only way to stop this from happening is to block all pre-emption.

ThreadX has a unique middle way: a pre-emption threshold. If a thread of a high priority is executing and the OS wants to swap in another thread to give it a chance to execute, the new thread must have a priority exceeding a certain threshold (which can be defined) before the pre-emption can take place. In this manner, the highest-priority tasks can be guaranteed deterministic access without completely stopping pre-emption, and time-critical operations can complete without being swapped out. If there are multiple important threads, they can be given the same high priority and then be time-shared.

Another way of providing both higher performance and greater access to processing by threads is to use multiple processors. ThreadX supports an SMP (symmetric multi-processing) configuration, again, with a single multi-threaded process. Threads can be pinned to cores so that, for example, a critical time-sensitive compute-intensive function can be given exclusive access to one core while the rest of the threads share the remaining core or cores.

SMP is generally the simplest way to manage a multicore system, but it requires that each core look identical – same memory (or one shared memory), same everything. The OS has to be able to assign computation to cores without worrying about which cores do what; they should all act alike. In applications where this doesn’t make sense, an AMP (asynchronous multi-processing) setup can be used. But a single OS can’t manage all that: with AMP, each core has its own OS and operates more or less autonomously.

Which doesn’t sound very useful if you’re trying to get these things to act like they’re all on the same team and collaborating in the furtherance of some common good. This is where a messaging system is needed so that the cores can talk to each other. You can roll your own such setup, but that’s a fair bit of infrastructure to have to build – especially when you no longer have to. The MCAPI (Multicore Communications API) standard, approved last spring, provides a messaging paradigm for just this kind of situation. Polycore Software has provided the first (and, to date, only) implementation of the MCAPI standard with their PolyMessenger and PolyGenerator products, which now support ThreadX. It works anywhere ThreadX works, meaning it works on processor cores in FPGAs.

So despite its simplicity, ThreadX supports many FPGA soft cores all the way from simple single-threaded designs up to AMP multicore – for a one-process design. Given the small-footprint nature of the RTOS and the fact that soft cores are easy to instantiate multiple times into a multicore fabric in an FPGA, it makes sense that Express Logic would see FPGAs as an important part of their business. They overlap well the space where soft cores play well.