feature article
Subscribe Now

Centaur Produces Nine-Headed, Two-Species Monster

New Server Chips Combine AI Acceleration with x86 Familiarity

“A great man is… an inspiration and a prophecy.” – Robert Green Ingersoll 

If a centaur is supposed to be a combination of two separate beasts, then nothing could make a more appropriate mascot for Centaur Technology’s latest microprocessor. The tiny Austin-based company has created a dual-DNA, nine-headed creature that runs both x86 server code and hugely accelerated machine-learning (ML) processes. It’s a match made in mythology. 

The new chip is called simply CHA (which doesn’t stand for anything), and it’s probably due out sometime late this year. We’re not sure, because sightings of the elusive beast are rare, although confirmed. It’s real; it’s just not reproducing yet in any great numbers. 

What CHA does, exactly, would confuse even Janus. On one side, it’s a fairly conventional x86 server processor with eight CPU cores running at 2.5 GHz. But on the other, it’s a massive 20-trillion-operations-per-second (20 TOPS) machine-learning processor that can outrun far more expensive hardware on MLPerf benchmarks. Creaky old CISC architecture, meet your new workload. 

For those still brushing up on their microprocessor mythology, Centaur Technology has been around longer than many of us. The company was formed back in 1995 during the x86 clone wars. Like NexGen, Rise, Cyrix, Transmeta, and others too numerous to mention, Centaur thought it could produce clean-room x86 designs that were fully compatible with Intel’s (and AMD’s) processors, hoping they could chip off a tiny sliver of the obscenely profitable PC processor market. The difference is, Centaur was right. Its chips worked as advertised, and the company has been cranking out newer and better x86 chips for 25 years. For most of that time, the company has been wholly owned by Via Technology in Taiwan, but it effectively still runs as a small and independent startup. 

Now the company has taken on a new challenge: accelerating ML inference code. It’s still early days for ML hardware – not unlike the x86 market in the mid-90s – so there’s no consensus on how (or why) to build ML processors. Centaur chose a dual-pronged approach by combining the familiar x86 architecture with a completely new and original SIMD architecture boasting 128-bit instructions and massive 4096-byte (not bits) data paths. It’s two processors in one, intended for enterprise-level installations that are space- and cost-constrained. It’s neither the fastest x86 processor you can buy nor the fastest ML accelerator, but it is apparently the only device to combine the two. 

First, the x86 part. This half of the chip (which, in reality, consumes about two-thirds of the CHA chip’s silicon die area) sports eight identical x86 processor cores, each with their own private L1 and L2 caches, and 16MB of shared L3 cache. All eight cores support Intel’s AVX-512 extensions for vector operations. The chip also includes north/southbridge features like four DDR4 channels, 44 PCIe 3.0 lanes, the usual assortment of miscellaneous PC I/O, and a port to a second CHA chip socket in case you want to build a two-chip system. 

Next, the ML accelerator. Centaur calls it Ncore, and it’s a clean-sheet design organized as a long SIMD pipeline with a limited instruction set and very wide registers, data buses, and memories. Designers may argue over the best way to design an inference engine, but everyone agrees on one thing: machine learning consumes huge gobs of data. 

ML workloads are broadly like those in signal processing or graphics, in that they’re (a) repetitive, (b) highly parallel, and (c) require lots of data transfers. In fact, feeding data to an inference engine is one of the toughest parts, just as it is with many DSP or GPU tasks. 

Unlike a “normal” RISC or CISC processor, feeding the Ncore pipeline with instructions isn’t all that hard because the code is highly iterative (i.e., it’s mostly loops). In fact, Ncore’s code memory is only 512 words long, and even that is divided into halves, with Ncore executing out of one half while the programmer (that’s you) loads the other half. Its data memory, on the other hand, spans 16MB and consumes about two-thirds of Ncore’s total area. 

That data memory is mapped onto the chip’s internal PCI bus during enumeration, so it appears to the x86 processors as a single 16MB block of RAM. All Ncore programming is done through this memory-mapped window; you never access Ncore directly. 

Like many DSPs and GPUs, Ncore is highly parallel and optimized for repetitive MAC (multiply-accumulate) operations working on a steady flow of new data. It can work on 8-bit integers, 16-bit integers, or BFloat16, a popular format for machine learning that has a wider dynamic range than the “official” IEEE-754 standard FP format. 

Because there’s so much data manipulation in ML workloads, Centaur built special-purpose hardware to rotate, shift, and otherwise massage data as it flows into and out of Ncore’s memory. Certain functions, like bit-wise rotation or replicating bytes to fill out a vector, are done on the fly without explicit coding. Output activation functions like rectified linear unit (ReLU), sigmoid, and hyperbolic tangent are also done by dedicated hardware and don’t take any time. 

Based on published MLPerf benchmark results, Centaur’s CHA kicks some AI butt. It runs circles around Intel’s best Xeon processors, and it beats out chips from nVidia, Qualcomm, and cloud-based services from Google and Alibaba. It’s not the fastest overall, but Centaur claims it’ll offer the best price/performance. A CHA-based system will also be smaller, since other x86 solutions require external accelerator cards and/or FPGA add-ons. Nobody else offers an x86 processor and ML accelerator all in one package. That might be a big deal to customers creating space-constrained 1U rackmount systems. 

Besides, there’s nothing to prevent you from adding off-chip hardware alongside CHA, since the chip behaves like any other x86 processor. The Ncore coprocessor inside doesn’t preclude the use of additional hardware acceleration outside. 

Downsides? There may be a few. If you look at it as just an x86 server processor, CHA is woefully out of date. Its eight cores are all single-threaded, compared to the dual-threaded processors from Intel and AMD. Its 2.5GHz clock rate is second-rate, too. Centaur describes its x86 design as a cross between Haswell and Skylake, which is pretty accurate in terms of ISA features, but Haswell is seven years old and Skylake is five years old. By the time CHA starts shipping late this year, Intel will have moved on to Sunny Cove, putting CHA yet another generation behind the state of the art. Designing a multicore x86 processor from scratch with a small team of engineers is a remarkable achievement, but even so, this one’s not close to current performance norms. CHA’s value is in the ML accelerator, not the eight-core x86 chip.

Ancient centaurs were supposed to be gifted in prophecy. Maybe the modern Centaur, with its hybrid CHA, sees something coming that its competitors don’t – yet. 

Leave a Reply

featured blogs
Nov 30, 2021
Have you ever wondered why Bill is a common nickname for William and Dick is a common nickname for Richard?...
Nov 30, 2021
Explore the history of the chip design process, from the days of Integrated Device Manufacturers (IDMs) to EDA tools and today's era of democratized design. The post Just What Is Democratized Design Anyway? appeared first on From Silicon To Software....
Nov 30, 2021
The demand for smaller electronics devices can be achieved by high-density layers in multi-layer build-up substrates or multi-layered printed circuit boards (PCB). Vias are essential in the design... [[ Click on the title to access the full blog on the Cadence Community site...
Nov 8, 2021
Intel® FPGA Technology Day (IFTD) is a free four-day event that will be hosted virtually across the globe in North America, China, Japan, EMEA, and Asia Pacific from December 6-9, 2021. The theme of IFTD 2021 is 'Accelerating a Smart and Connected World.' This virtual event ...

featured video

Architecture All Access: Modern FPGA Architecture

Sponsored by Intel

In this 20-minute video, Intel Fellow Prakash Iyer takes you on a journey within the architecture of an FPGA, starting with simple logic gates and then moving up through architecture, design, and applications. Along the way, he answers many questions you might have about FPGAs, even if you’ve worked with FPGAs for years.

Click here for more information

featured paper

How to Fast-Charge Your Supercapacitor

Sponsored by Maxim Integrated (now part of Analog Devices)

Supercapacitors (or ultracapacitors) are suited for short charge and discharge cycles. They require high currents for fast charge as well as a high voltage with a high number in series as shown in two usage cases: an automatic pallet shuttle and a fail-safe backup system. In these and many other cases, the fast charge is provided by a flexible, high-efficiency, high-voltage, and high-current charger based on a synchronous, step-down, supercapacitor charger controller.

Click to read more

featured chalk talk

Thermocouple Temperature Sensor Solution

Sponsored by Mouser Electronics and Microchip

When it comes to temperature monitoring and management, industrial applications can be extremely demanding. With temperatures that can range from 270 to 3000 C, consumer-grade temperature probes just don’t cut it. In this episode of Chalk Talk, Amelia Dalton chats with Ezana Haile of Microchip technology about using thermocouples for temperature monitoring in industrial applications.

More information about Microchip Technology MCP9600, MCP96L00, & MCP96RL00 Thermocouple ICs