feature article
Subscribe Now

Centaur Produces Nine-Headed, Two-Species Monster

New Server Chips Combine AI Acceleration with x86 Familiarity

“A great man is… an inspiration and a prophecy.” – Robert Green Ingersoll 

If a centaur is supposed to be a combination of two separate beasts, then nothing could make a more appropriate mascot for Centaur Technology’s latest microprocessor. The tiny Austin-based company has created a dual-DNA, nine-headed creature that runs both x86 server code and hugely accelerated machine-learning (ML) processes. It’s a match made in mythology. 

The new chip is called simply CHA (which doesn’t stand for anything), and it’s probably due out sometime late this year. We’re not sure, because sightings of the elusive beast are rare, although confirmed. It’s real; it’s just not reproducing yet in any great numbers. 

What CHA does, exactly, would confuse even Janus. On one side, it’s a fairly conventional x86 server processor with eight CPU cores running at 2.5 GHz. But on the other, it’s a massive 20-trillion-operations-per-second (20 TOPS) machine-learning processor that can outrun far more expensive hardware on MLPerf benchmarks. Creaky old CISC architecture, meet your new workload. 

For those still brushing up on their microprocessor mythology, Centaur Technology has been around longer than many of us. The company was formed back in 1995 during the x86 clone wars. Like NexGen, Rise, Cyrix, Transmeta, and others too numerous to mention, Centaur thought it could produce clean-room x86 designs that were fully compatible with Intel’s (and AMD’s) processors, hoping they could chip off a tiny sliver of the obscenely profitable PC processor market. The difference is, Centaur was right. Its chips worked as advertised, and the company has been cranking out newer and better x86 chips for 25 years. For most of that time, the company has been wholly owned by Via Technology in Taiwan, but it effectively still runs as a small and independent startup. 

Now the company has taken on a new challenge: accelerating ML inference code. It’s still early days for ML hardware – not unlike the x86 market in the mid-90s – so there’s no consensus on how (or why) to build ML processors. Centaur chose a dual-pronged approach by combining the familiar x86 architecture with a completely new and original SIMD architecture boasting 128-bit instructions and massive 4096-byte (not bits) data paths. It’s two processors in one, intended for enterprise-level installations that are space- and cost-constrained. It’s neither the fastest x86 processor you can buy nor the fastest ML accelerator, but it is apparently the only device to combine the two. 

First, the x86 part. This half of the chip (which, in reality, consumes about two-thirds of the CHA chip’s silicon die area) sports eight identical x86 processor cores, each with their own private L1 and L2 caches, and 16MB of shared L3 cache. All eight cores support Intel’s AVX-512 extensions for vector operations. The chip also includes north/southbridge features like four DDR4 channels, 44 PCIe 3.0 lanes, the usual assortment of miscellaneous PC I/O, and a port to a second CHA chip socket in case you want to build a two-chip system. 

Next, the ML accelerator. Centaur calls it Ncore, and it’s a clean-sheet design organized as a long SIMD pipeline with a limited instruction set and very wide registers, data buses, and memories. Designers may argue over the best way to design an inference engine, but everyone agrees on one thing: machine learning consumes huge gobs of data. 

ML workloads are broadly like those in signal processing or graphics, in that they’re (a) repetitive, (b) highly parallel, and (c) require lots of data transfers. In fact, feeding data to an inference engine is one of the toughest parts, just as it is with many DSP or GPU tasks. 

Unlike a “normal” RISC or CISC processor, feeding the Ncore pipeline with instructions isn’t all that hard because the code is highly iterative (i.e., it’s mostly loops). In fact, Ncore’s code memory is only 512 words long, and even that is divided into halves, with Ncore executing out of one half while the programmer (that’s you) loads the other half. Its data memory, on the other hand, spans 16MB and consumes about two-thirds of Ncore’s total area. 

That data memory is mapped onto the chip’s internal PCI bus during enumeration, so it appears to the x86 processors as a single 16MB block of RAM. All Ncore programming is done through this memory-mapped window; you never access Ncore directly. 

Like many DSPs and GPUs, Ncore is highly parallel and optimized for repetitive MAC (multiply-accumulate) operations working on a steady flow of new data. It can work on 8-bit integers, 16-bit integers, or BFloat16, a popular format for machine learning that has a wider dynamic range than the “official” IEEE-754 standard FP format. 

Because there’s so much data manipulation in ML workloads, Centaur built special-purpose hardware to rotate, shift, and otherwise massage data as it flows into and out of Ncore’s memory. Certain functions, like bit-wise rotation or replicating bytes to fill out a vector, are done on the fly without explicit coding. Output activation functions like rectified linear unit (ReLU), sigmoid, and hyperbolic tangent are also done by dedicated hardware and don’t take any time. 

Based on published MLPerf benchmark results, Centaur’s CHA kicks some AI butt. It runs circles around Intel’s best Xeon processors, and it beats out chips from nVidia, Qualcomm, and cloud-based services from Google and Alibaba. It’s not the fastest overall, but Centaur claims it’ll offer the best price/performance. A CHA-based system will also be smaller, since other x86 solutions require external accelerator cards and/or FPGA add-ons. Nobody else offers an x86 processor and ML accelerator all in one package. That might be a big deal to customers creating space-constrained 1U rackmount systems. 

Besides, there’s nothing to prevent you from adding off-chip hardware alongside CHA, since the chip behaves like any other x86 processor. The Ncore coprocessor inside doesn’t preclude the use of additional hardware acceleration outside. 

Downsides? There may be a few. If you look at it as just an x86 server processor, CHA is woefully out of date. Its eight cores are all single-threaded, compared to the dual-threaded processors from Intel and AMD. Its 2.5GHz clock rate is second-rate, too. Centaur describes its x86 design as a cross between Haswell and Skylake, which is pretty accurate in terms of ISA features, but Haswell is seven years old and Skylake is five years old. By the time CHA starts shipping late this year, Intel will have moved on to Sunny Cove, putting CHA yet another generation behind the state of the art. Designing a multicore x86 processor from scratch with a small team of engineers is a remarkable achievement, but even so, this one’s not close to current performance norms. CHA’s value is in the ML accelerator, not the eight-core x86 chip.

Ancient centaurs were supposed to be gifted in prophecy. Maybe the modern Centaur, with its hybrid CHA, sees something coming that its competitors don’t – yet. 

Leave a Reply

featured blogs
Dec 7, 2023
Building on the success of previous years, the 2024 edition of the DATE (Design, Automation and Test in Europe) conference will once again include the Young People Programme. The largest electronic design automation (EDA) conference in Europe, DATE will be held on 25-27 March...
Dec 7, 2023
Explore the different memory technologies at the heart of AI SoC memory architecture and learn about the advantages of SRAM, ReRAM, MRAM, and beyond.The post The Importance of Memory Architecture for AI SoCs appeared first on Chip Design....
Nov 6, 2023
Suffice it to say that everyone and everything in these images was shot in-camera underwater, and that the results truly are haunting....

featured video

Dramatically Improve PPA and Productivity with Generative AI

Sponsored by Cadence Design Systems

Discover how you can quickly optimize flows for many blocks concurrently and use that knowledge for your next design. The Cadence Cerebrus Intelligent Chip Explorer is a revolutionary, AI-driven, automated approach to chip design flow optimization. Block engineers specify the design goals, and generative AI features within Cadence Cerebrus Explorer will intelligently optimize the design to meet the power, performance, and area (PPA) goals in a completely automated way.

Click here for more information

featured paper

Universal Verification Methodology Coverage for Bluespec RISC-V Cores

Sponsored by Synopsys

This whitepaper explains the basics of UVM functional coverage for RISC-V cores using the Google RISCV-DV open-source project, Synopsys verification solutions, and a RISC-V processor core from Bluespec.

Click to read more

featured chalk talk

USB Power Delivery: Power for Portable (and Other) Products
Sponsored by Mouser Electronics and CUI Inc.
USB Type C power delivery was created to standardize medium and higher levels of power delivery but it also can support negotiations for multiple output voltage levels and is backward compatible with previous versions of USB. In this episode of Chalk Talk, Amelia Dalton and Bruce Rose from CUI Inc. explore the benefits of USB Type C power delivery, the specific communications protocol of USB Type C power delivery, and examine why USB Type C power supplies and connectors are the way of the future for consumer electronics.
Oct 2, 2023
8,087 views