feature article
Subscribe Now

Centaur Produces Nine-Headed, Two-Species Monster

New Server Chips Combine AI Acceleration with x86 Familiarity

“A great man is… an inspiration and a prophecy.” – Robert Green Ingersoll 

If a centaur is supposed to be a combination of two separate beasts, then nothing could make a more appropriate mascot for Centaur Technology’s latest microprocessor. The tiny Austin-based company has created a dual-DNA, nine-headed creature that runs both x86 server code and hugely accelerated machine-learning (ML) processes. It’s a match made in mythology. 

The new chip is called simply CHA (which doesn’t stand for anything), and it’s probably due out sometime late this year. We’re not sure, because sightings of the elusive beast are rare, although confirmed. It’s real; it’s just not reproducing yet in any great numbers. 

What CHA does, exactly, would confuse even Janus. On one side, it’s a fairly conventional x86 server processor with eight CPU cores running at 2.5 GHz. But on the other, it’s a massive 20-trillion-operations-per-second (20 TOPS) machine-learning processor that can outrun far more expensive hardware on MLPerf benchmarks. Creaky old CISC architecture, meet your new workload. 

For those still brushing up on their microprocessor mythology, Centaur Technology has been around longer than many of us. The company was formed back in 1995 during the x86 clone wars. Like NexGen, Rise, Cyrix, Transmeta, and others too numerous to mention, Centaur thought it could produce clean-room x86 designs that were fully compatible with Intel’s (and AMD’s) processors, hoping they could chip off a tiny sliver of the obscenely profitable PC processor market. The difference is, Centaur was right. Its chips worked as advertised, and the company has been cranking out newer and better x86 chips for 25 years. For most of that time, the company has been wholly owned by Via Technology in Taiwan, but it effectively still runs as a small and independent startup. 

Now the company has taken on a new challenge: accelerating ML inference code. It’s still early days for ML hardware – not unlike the x86 market in the mid-90s – so there’s no consensus on how (or why) to build ML processors. Centaur chose a dual-pronged approach by combining the familiar x86 architecture with a completely new and original SIMD architecture boasting 128-bit instructions and massive 4096-byte (not bits) data paths. It’s two processors in one, intended for enterprise-level installations that are space- and cost-constrained. It’s neither the fastest x86 processor you can buy nor the fastest ML accelerator, but it is apparently the only device to combine the two. 

First, the x86 part. This half of the chip (which, in reality, consumes about two-thirds of the CHA chip’s silicon die area) sports eight identical x86 processor cores, each with their own private L1 and L2 caches, and 16MB of shared L3 cache. All eight cores support Intel’s AVX-512 extensions for vector operations. The chip also includes north/southbridge features like four DDR4 channels, 44 PCIe 3.0 lanes, the usual assortment of miscellaneous PC I/O, and a port to a second CHA chip socket in case you want to build a two-chip system. 

Next, the ML accelerator. Centaur calls it Ncore, and it’s a clean-sheet design organized as a long SIMD pipeline with a limited instruction set and very wide registers, data buses, and memories. Designers may argue over the best way to design an inference engine, but everyone agrees on one thing: machine learning consumes huge gobs of data. 

ML workloads are broadly like those in signal processing or graphics, in that they’re (a) repetitive, (b) highly parallel, and (c) require lots of data transfers. In fact, feeding data to an inference engine is one of the toughest parts, just as it is with many DSP or GPU tasks. 

Unlike a “normal” RISC or CISC processor, feeding the Ncore pipeline with instructions isn’t all that hard because the code is highly iterative (i.e., it’s mostly loops). In fact, Ncore’s code memory is only 512 words long, and even that is divided into halves, with Ncore executing out of one half while the programmer (that’s you) loads the other half. Its data memory, on the other hand, spans 16MB and consumes about two-thirds of Ncore’s total area. 

That data memory is mapped onto the chip’s internal PCI bus during enumeration, so it appears to the x86 processors as a single 16MB block of RAM. All Ncore programming is done through this memory-mapped window; you never access Ncore directly. 

Like many DSPs and GPUs, Ncore is highly parallel and optimized for repetitive MAC (multiply-accumulate) operations working on a steady flow of new data. It can work on 8-bit integers, 16-bit integers, or BFloat16, a popular format for machine learning that has a wider dynamic range than the “official” IEEE-754 standard FP format. 

Because there’s so much data manipulation in ML workloads, Centaur built special-purpose hardware to rotate, shift, and otherwise massage data as it flows into and out of Ncore’s memory. Certain functions, like bit-wise rotation or replicating bytes to fill out a vector, are done on the fly without explicit coding. Output activation functions like rectified linear unit (ReLU), sigmoid, and hyperbolic tangent are also done by dedicated hardware and don’t take any time. 

Based on published MLPerf benchmark results, Centaur’s CHA kicks some AI butt. It runs circles around Intel’s best Xeon processors, and it beats out chips from nVidia, Qualcomm, and cloud-based services from Google and Alibaba. It’s not the fastest overall, but Centaur claims it’ll offer the best price/performance. A CHA-based system will also be smaller, since other x86 solutions require external accelerator cards and/or FPGA add-ons. Nobody else offers an x86 processor and ML accelerator all in one package. That might be a big deal to customers creating space-constrained 1U rackmount systems. 

Besides, there’s nothing to prevent you from adding off-chip hardware alongside CHA, since the chip behaves like any other x86 processor. The Ncore coprocessor inside doesn’t preclude the use of additional hardware acceleration outside. 

Downsides? There may be a few. If you look at it as just an x86 server processor, CHA is woefully out of date. Its eight cores are all single-threaded, compared to the dual-threaded processors from Intel and AMD. Its 2.5GHz clock rate is second-rate, too. Centaur describes its x86 design as a cross between Haswell and Skylake, which is pretty accurate in terms of ISA features, but Haswell is seven years old and Skylake is five years old. By the time CHA starts shipping late this year, Intel will have moved on to Sunny Cove, putting CHA yet another generation behind the state of the art. Designing a multicore x86 processor from scratch with a small team of engineers is a remarkable achievement, but even so, this one’s not close to current performance norms. CHA’s value is in the ML accelerator, not the eight-core x86 chip.

Ancient centaurs were supposed to be gifted in prophecy. Maybe the modern Centaur, with its hybrid CHA, sees something coming that its competitors don’t – yet. 

Leave a Reply

featured blogs
May 2, 2024
I'm envisioning what one of these pieces would look like on the wall of my office. It would look awesome!...
Apr 30, 2024
Analog IC design engineers need breakthrough technologies & chip design tools to solve modern challenges; learn more from our analog design panel at SNUG 2024.The post Why Analog Design Challenges Need Breakthrough Technologies appeared first on Chip Design....

featured video

Why Wiwynn Energy-Optimized Data Center IT Solutions Use Cadence Optimality Explorer

Sponsored by Cadence Design Systems

In the AI era, as the signal-data rate increases, the signal integrity challenges in server designs also increase. Wiwynn provides hyperscale data centers with innovative cloud IT infrastructure, bringing the best total cost of ownership (TCO), energy, and energy-itemized IT solutions from the cloud to the edge.

Learn more about how Wiwynn is developing a new methodology for PCB designs with Cadence’s Optimality Intelligent System Explorer and Clarity 3D Solver.

featured paper

Designing Robust 5G Power Amplifiers for the Real World

Sponsored by Keysight

Simulating 5G power amplifier (PA) designs at the component and system levels with authentic modulation and high-fidelity behavioral models increases predictability, lowers risk, and shrinks schedules. Simulation software enables multi-technology layout and multi-domain analysis, evaluating the impacts of 5G PA design choices while delivering accurate results in a single virtual workspace. This application note delves into how authentic modulation enhances predictability and performance in 5G millimeter-wave systems.

Download now to revolutionize your design process.

featured chalk talk

Audio Design for Augmented and Virtual Reality (AR/VR) Glasses
Open ear audio can be beneficial to a host of different applications including virtual reality headsets, smart glasses, and sports and fitness designs. In this episode of Chalk Talk, Amelia Dalton and Ryan Boyle from Analog Devices explore the what, where, and how of open ear audio. We also investigate the solutions that Analog Devices has for open ear audio applications and how you can design open ear audio into your next application. 
Jan 23, 2024
14,215 views