feature article
Subscribe Now

Centaur Produces Nine-Headed, Two-Species Monster

New Server Chips Combine AI Acceleration with x86 Familiarity

“A great man is… an inspiration and a prophecy.” – Robert Green Ingersoll 

If a centaur is supposed to be a combination of two separate beasts, then nothing could make a more appropriate mascot for Centaur Technology’s latest microprocessor. The tiny Austin-based company has created a dual-DNA, nine-headed creature that runs both x86 server code and hugely accelerated machine-learning (ML) processes. It’s a match made in mythology. 

The new chip is called simply CHA (which doesn’t stand for anything), and it’s probably due out sometime late this year. We’re not sure, because sightings of the elusive beast are rare, although confirmed. It’s real; it’s just not reproducing yet in any great numbers. 

What CHA does, exactly, would confuse even Janus. On one side, it’s a fairly conventional x86 server processor with eight CPU cores running at 2.5 GHz. But on the other, it’s a massive 20-trillion-operations-per-second (20 TOPS) machine-learning processor that can outrun far more expensive hardware on MLPerf benchmarks. Creaky old CISC architecture, meet your new workload. 

For those still brushing up on their microprocessor mythology, Centaur Technology has been around longer than many of us. The company was formed back in 1995 during the x86 clone wars. Like NexGen, Rise, Cyrix, Transmeta, and others too numerous to mention, Centaur thought it could produce clean-room x86 designs that were fully compatible with Intel’s (and AMD’s) processors, hoping they could chip off a tiny sliver of the obscenely profitable PC processor market. The difference is, Centaur was right. Its chips worked as advertised, and the company has been cranking out newer and better x86 chips for 25 years. For most of that time, the company has been wholly owned by Via Technology in Taiwan, but it effectively still runs as a small and independent startup. 

Now the company has taken on a new challenge: accelerating ML inference code. It’s still early days for ML hardware – not unlike the x86 market in the mid-90s – so there’s no consensus on how (or why) to build ML processors. Centaur chose a dual-pronged approach by combining the familiar x86 architecture with a completely new and original SIMD architecture boasting 128-bit instructions and massive 4096-byte (not bits) data paths. It’s two processors in one, intended for enterprise-level installations that are space- and cost-constrained. It’s neither the fastest x86 processor you can buy nor the fastest ML accelerator, but it is apparently the only device to combine the two. 

First, the x86 part. This half of the chip (which, in reality, consumes about two-thirds of the CHA chip’s silicon die area) sports eight identical x86 processor cores, each with their own private L1 and L2 caches, and 16MB of shared L3 cache. All eight cores support Intel’s AVX-512 extensions for vector operations. The chip also includes north/southbridge features like four DDR4 channels, 44 PCIe 3.0 lanes, the usual assortment of miscellaneous PC I/O, and a port to a second CHA chip socket in case you want to build a two-chip system. 

Next, the ML accelerator. Centaur calls it Ncore, and it’s a clean-sheet design organized as a long SIMD pipeline with a limited instruction set and very wide registers, data buses, and memories. Designers may argue over the best way to design an inference engine, but everyone agrees on one thing: machine learning consumes huge gobs of data. 

ML workloads are broadly like those in signal processing or graphics, in that they’re (a) repetitive, (b) highly parallel, and (c) require lots of data transfers. In fact, feeding data to an inference engine is one of the toughest parts, just as it is with many DSP or GPU tasks. 

Unlike a “normal” RISC or CISC processor, feeding the Ncore pipeline with instructions isn’t all that hard because the code is highly iterative (i.e., it’s mostly loops). In fact, Ncore’s code memory is only 512 words long, and even that is divided into halves, with Ncore executing out of one half while the programmer (that’s you) loads the other half. Its data memory, on the other hand, spans 16MB and consumes about two-thirds of Ncore’s total area. 

That data memory is mapped onto the chip’s internal PCI bus during enumeration, so it appears to the x86 processors as a single 16MB block of RAM. All Ncore programming is done through this memory-mapped window; you never access Ncore directly. 

Like many DSPs and GPUs, Ncore is highly parallel and optimized for repetitive MAC (multiply-accumulate) operations working on a steady flow of new data. It can work on 8-bit integers, 16-bit integers, or BFloat16, a popular format for machine learning that has a wider dynamic range than the “official” IEEE-754 standard FP format. 

Because there’s so much data manipulation in ML workloads, Centaur built special-purpose hardware to rotate, shift, and otherwise massage data as it flows into and out of Ncore’s memory. Certain functions, like bit-wise rotation or replicating bytes to fill out a vector, are done on the fly without explicit coding. Output activation functions like rectified linear unit (ReLU), sigmoid, and hyperbolic tangent are also done by dedicated hardware and don’t take any time. 

Based on published MLPerf benchmark results, Centaur’s CHA kicks some AI butt. It runs circles around Intel’s best Xeon processors, and it beats out chips from nVidia, Qualcomm, and cloud-based services from Google and Alibaba. It’s not the fastest overall, but Centaur claims it’ll offer the best price/performance. A CHA-based system will also be smaller, since other x86 solutions require external accelerator cards and/or FPGA add-ons. Nobody else offers an x86 processor and ML accelerator all in one package. That might be a big deal to customers creating space-constrained 1U rackmount systems. 

Besides, there’s nothing to prevent you from adding off-chip hardware alongside CHA, since the chip behaves like any other x86 processor. The Ncore coprocessor inside doesn’t preclude the use of additional hardware acceleration outside. 

Downsides? There may be a few. If you look at it as just an x86 server processor, CHA is woefully out of date. Its eight cores are all single-threaded, compared to the dual-threaded processors from Intel and AMD. Its 2.5GHz clock rate is second-rate, too. Centaur describes its x86 design as a cross between Haswell and Skylake, which is pretty accurate in terms of ISA features, but Haswell is seven years old and Skylake is five years old. By the time CHA starts shipping late this year, Intel will have moved on to Sunny Cove, putting CHA yet another generation behind the state of the art. Designing a multicore x86 processor from scratch with a small team of engineers is a remarkable achievement, but even so, this one’s not close to current performance norms. CHA’s value is in the ML accelerator, not the eight-core x86 chip.

Ancient centaurs were supposed to be gifted in prophecy. Maybe the modern Centaur, with its hybrid CHA, sees something coming that its competitors don’t – yet. 

Leave a Reply

featured blogs
Jan 15, 2021
It's Martin Luther King Day on Monday. Cadence is off. Breakfast Bytes will not appear. And, as is traditional, I go completely off-topic the day before a break. In the past, a lot of novelty in... [[ Click on the title to access the full blog on the Cadence Community s...
Jan 14, 2021
Learn how electronic design automation (EDA) tools & silicon-proven IP enable today's most influential smart tech, including ADAS, 5G, IoT, and Cloud services. The post 5 Key Innovations that Are Making Everything Smarter appeared first on From Silicon To Software....
Jan 13, 2021
Here are some genius solutions to everyday problems you probably didn'€™t even know existed, but after you'€™ve seen them you'€™ll say '€œWow!'€...
Jan 13, 2021
Testing is the final step of any manufacturing process, and arguably the most important, and yet it can often be overlooked.  Releasing a poorly tested product onto the market has destroyed more than one reputation for quality, and this is even more important in an age when ...

featured paper

Overcoming Signal Integrity Challenges of 112G Connections on PCB

Sponsored by Cadence Design Systems

One big challenge with 112G SerDes is handling signal integrity (SI) issues. By the time the signal winds its way from the transmitter on one chip to packages, across traces on PCBs, through connectors or cables, and arrives at the receiver, the signal is very distorted, making it a challenge to recover the clock and data-bits of the information being transferred. Learn how to handle SI issues and ensure that data is faithfully transmitted with a very low bit error rate (BER).

Click here to download the whitepaper

Featured Chalk Talk

General Port Protection

Sponsored by Mouser Electronics and Littelfuse

In today’s complex designs, port protection can be a challenge. High-speed data, low-speed data, and power ports need protection from ESD, power faults, and more. In this episode of Chalk Talk, Amelia Dalton chats with Todd Phillips from Littelfuse about port protection for your next system design.

Click here for more information about port protection from Littelfuse.