Do-It-Yourself Linux Machine

It’s a good month for microprocessor aficionados, what with the new Cortus twins, the MIPS I6400, AMD’s Hierofalcon, and now Synopsys’s ARC HS38. There’s still some differentiation to be had in this market.

Followers of Synopsys know that the EDA company acquired ARC, the CPU-design firm, several years ago and folded the CPU IP into its DesignWare library system. Indeed, the processor cores are branded as DesignWare, reflecting the reality that ARC processors are more like a design tool than a traditional CPU core. That’s because ARC processors are user-defined. You can add and subtract registers, create your own instructions, invent new condition codes, bolt on in-house coprocessors, and more. Every ARC processor has the capability to be unique and oh-so-finely tuned to its intended application, a feature that many developers really like. It must be working: ARC cores have appeared in 1.5 billion chips just in this year alone.

What ARC-based chips typically don’t have is a big backlog of third-party software, a necessary side effect of their configurability (and the minor detail that they’re not ARM, MIPS, or x86). Like most second-tier CPU architectures, ARC processors are used in deeply embedded applications where suitability for purpose, small size, and low cost are more important than a thriving app store.

What ARC does have is Linux support. In fact, Synopsys’s brand new ARC HS38 processor supports both “standard” single-core and SMP multicore implementations of Linux, something a bit new and unusual in the DIY processor arena. So just because you’ve rolled your own processor hardware doesn’t mean you have to give up on familiar operating systems.

The new HS38 represents the new high end of the ARC processor lineup, essentially replacing the previous flagship ARC 770. Where the 770 had (actually, still has) a limited MMU, smaller micro-TLBs, and a restricted physical address range, the HS38 blows out all of those limitations, giving designers control over their MMU page sizes and a 40-bit address space. The HS38 also gains L1 cache coherence and the option for L2 and/or L3 caches, if you’re so inclined.

Ten years of progress has also benefitted the HS38’s instruction set. The default ISA is now ARC v2, a modern compressed instruction set that’s an average of 18% more thrifty with memory compared to the older ARCcompact ISA, according to the company. And, at up to 2.2 GHz clock speed, the HS38 is way faster.

The HS38 has a ten-stage pipeline, which is longish by embedded-CPU standards. Long pipelines are mandatory for fast clock speeds and high performance, but they exact a penalty every time program flow changes, branches are mis-predicted, or data is loaded from memory and immediately used in an operation. The longer the pipeline, the longer the freight train you have to back up and reroute down the new track.

Branches are somewhat mitigated by a new branch-prediction hardware in the first pipeline stage. The HS38 implements dynamic branch prediction, meaning it guesses on the fly based on recent activity, as opposed to static branch prediction, which relies entirely on hard-coded guidance from the programmer.

There isn’t much Synopsys can do about changes in program flow – that’s up to the programmer – but the HS38 does handle load/use penalties cleverly. Arithmetic and logic operations are typically committed to the register file in stage 6, but they can be pushed back to stage 9 when operands are loaded just before they’re used. The late-commit stage can completely mask the load/use penalty typical of longer pipelines.

Finally, the HS38 is a bit more tolerant of slower memories, something that’s necessary when clock frequencies reach the UHF range. The CPU really has one and a half pipelines, with the second half (stages six through ten) split in two. One half handles arithmetic and logic operations, while the other is dedicated to memory accesses. The Y-shaped pipeline allows the HS38 to take its time (relatively speaking) dealing with operand routing.

New to the HS38 is the option for multiple register files, up to a maximum of eight. This is a bit like what ARM processors (and many microcontrollers) allow, and it enables fast context switching among register sets. Your operating system or scheduler will need to understand how that works, but for fast real-time response, it’s a lot quicker and cleaner than the usual push/pop, call/return stack.

Since it’s technically a configuration tool and not a canned CPU core, the HS38 naturally comes with a lot of design-time options. Don’t want caches? No problem. Don’t need an MMU? That’s doable. In fact, Synopsys is offering three different versions of the new CPU, called HS34, HS36, and HS38. They’re technically the same CPU, just with different options turned on or off. You can save money by licensing the lightweight HS34 version, but you won’t be able to enable the caches, MMU, or SMP Linux support. On the other hand, you can opt for the deluxe HS38 version and later decide to downgrade it to an HS34 or ’36. It’s the same CPU either way; only the financial terms change.

In its ’38 configuration, the CPU supports single, dual, and quad-core configurations. (You can do three cores, too, if you really want.) And, of course, since it’s from Synopsys, there is a wealth of peripheral I/O you can add on. Yes, it’s a good time for processor designers.