feature article
Subscribe Now

Quadric CPU Combines AI with Conventional Code

Parallel 256-core Machine Handles Both C and TensorFlow

Another day, another new microprocessor architecture. There was a time in the Nineties when everyone and his dog was designing a new processor. They were all going to revolutionize the world, crush Intel, enable new cutting-edge devices, and show us how it’s really supposed to be done. The nerd journals were filled with new CPU acronyms like RISC, VLIW, IPC, EPIC, ROB, ILP, SSE2, TLB, BTB, AES-NI, SIMD, and more. And, of course, IPO. 

Fortunately for us, those days of revolution are over. The majority of those new processor families faded away, the x86 retained its dominance (in some markets, anyway), and many of the innovations that defined the new upstarts were eventually absorbed into the incumbent CPUs. The mad rush to topple the CPU status quo has settled down. 

Well, almost. 

Some of today’s workloads aren’t like yesterday’s workloads. New problems require new solutions. AI and ML have changed the rules of the game. We need “…a unified hardware and software platform that can unlock the power of on-device AI for a wide range of applications at the network edge.” 

So saith Quadric, a 30-person Silicon Valley startup that’s tackling one of the biggest problems of all: getting an entirely new processor architecture off the ground. The company’s unique q16 chip is currently shipping, and its small M.2 development boards are available as well. So, it’s real. But is it really different? 

Quadric’s processor is hard to pigeonhole. It’s touted as an AI accelerator but also as a general-purpose processor. It bears some of the hallmarks of a VLIW machine as well as being nouveau RISC. Quadric calls it a “code-driven architecture” and says it’s “built from the ground up with software in mind.” 

By that, they mean they arrived at a set of problems first, then designed a CPU to solve those problems. Quadric’s founders are all experienced EEs and startup grognards, and they had been working on a machine-vision system that just didn’t have enough throughput. The solution then was to throw more hardware at the problem, but they decided there must be a better way… 

Current CPUs — even relatively modern ones — are designed to do anything, says Quadric co-founder and Chief Product Officer Daniel Firu. “But software has changed a lot [since those CPUs were created]. There are neural nets all over: recommendation engines, self-driving cars, robotics, cameras, etc. The only existing thing that comes close is a GPU. We set out to generalize the data-parallelism problem in compute- and power-constrained environments. We run algorithms that a neural net processor can run, or that an app builder or roboticist would run.” 

Quadric’s CPU designers set out to simplify, not complicate, their processor. They believe in offloading to software anything that doesn’t need to be done in hardware. The q16 chip has no caches, no branch prediction, no speculative execution, nor any of the frippery common to today’s CPUs. Instead, the compiler schedules load/store operations between the CPU and memory, as well as data transfers within the chip. Work is allocated among the 256 processor cores at compile time, with only minimal run-time tweaking. 

“What’s starkly different is the amount of hardware speculation and complex caching algorithms in current CPUs,” says Firu. “There’s all this hardware trying to figure out what the program might want to do. A general CPU can do anything, but if you constrain usage to fewer things, you can get rid of a lot of that stuff.” 

Sounds like RISC all over again. Eliminate hardware bottlenecks and push the responsibility onto the compiler. Firu doesn’t disagree, but he points out that compilers are a lot better now than they were 30 years ago. “LLVM allows us to leverage big-company compiler technology” without the big compiler company. 

Although q16 is designed to excel at parallel algorithms, it’s also a general-purpose CPU. It’s intended to replace both the host processor (x86, ARM, etc.) and the AI accelerator (Tensor, Tegra, GraphCore, etc.) in a system. It’s programmable in C, but it’s equally comfortable with PyTorch and TensorFlow. 

That said, Quadric isn’t out to unseat Intel, AMD, ARM, or nVidia. “We’re ambitious but pragmatic,” says Firu. “We fancy ourselves as a high-performance general-purpose architecture. It’s a bridge between ‘embarrassingly parallel’ and ‘single-threaded.’” 

With 256 identical cores arranged in a grid, the q16 chip has ample resources for parallel problems. Each core has its own local memory, plus single-cycle access to its neighbors’ local memories. Cores farther away can be accessed with a time penalty for each hop. Quadric’s compiler statically schedules all 256 cores, as well as their data transactions and interactions. Users can theoretically intervene in this process manually, though there’s little reason to ever do so. 

Each core is Turing-complete, meaning it’s a fully fledged computer that can run any possible program, not just a specialized accelerator. The ISA includes a hundred instructions or so, including logic functions, multiply-accumulate, and integer arithmetic. Quadric expects its customers will never program the chip in assembly language, relying instead on Quadric’s SDK, which, it says, ”…allows the developer to express graph-based and non-graph-based algorithms in unison.” 

Historically, the biggest problem with creating a new processor is not creating the processor — it’s creating the software ecosystem for it. There’s zero installed base of software, and potential users have zero experience using it. Firu says that’s less of a problem with q16 because customers’ unique IP isn’t tied up in C code anymore. It’s in ML graphs, and those are portable via PyTorch or TensorFlow. Second, Quadric has developed a C++ API and libraries to ease “normal” code development. 

Right now, Quadric is a fabless chip company selling development boards and silicon. But in the future, the company may make the switch to licensing its IP. Firu says that fully 60% of their current customers (a small sample size, admittedly) are interested in licensing the processor IP; the other 40% want silicon. “We may evolve into a pure IP play,” he says. If so, where does that leave chip customers? Will the q16 be an only child? 

Not a chance, says Quadric. Regardless of business model, the company will always produce  at least one chip per CPU generation as a “showpiece” demonstrator for the architecture. Like SiFive, Quadric may get most of its revenue from licensing, but with chips on the side. 

Processor innovation isn’t dead, it just took a pause. If Quadric is right and workloads have changed but the need for a do-it-all CPU hasn’t, the company may be at the start of something big. 

Note: This is my last article for Electronic Engineering Journal. After 15 years and 700 articles, it’s time for me to bow out and retire before anyone realizes I don’t know what I’m doing. My thanks go out to the entire crew at Techfocus Media, and to the readers of EEJ for keeping us honest and involved. It’s been a great ride.

One thought on “Quadric CPU Combines AI with Conventional Code”

Leave a Reply

featured blogs
Jan 21, 2022
Here are a few teasers for what you'll find in this week's round-up of CFD news and notes. How AI can be trained to identify more objects than are in its learning dataset. Will GPUs really... [[ Click on the title to access the full blog on the Cadence Community si...
Jan 20, 2022
High performance computing continues to expand & evolve; our team shares their 2022 HPC predictions including new HPC applications and processor architectures. The post The Future of High-Performance Computing (HPC): Key Predictions for 2022 appeared first on From Silico...
Jan 20, 2022
As Josh Wardle famously said about his creation: "It's not trying to do anything shady with your data or your eyeballs ... It's just a game that's fun.'...

featured video

Synopsys & Samtec: Successful 112G PAM-4 System Interoperability

Sponsored by Synopsys

This Supercomputing Conference demo shows a seamless interoperability between Synopsys' DesignWare 112G Ethernet PHY IP and Samtec's NovaRay IO and cable assembly. The demo shows excellent performance, BER at 1e-08 and total insertion loss of 37dB. Synopsys and Samtec are enabling the industry with a complete 112G PAM-4 system, which is essential for high-performance computing.

Click here for more information about DesignWare Ethernet IP Solutions

featured paper

Clinical-Grade AFE Measures Four Vital Signs for Remote Patient Monitoring Devices

Sponsored by Analog Devices

Simplify the design of wearable remote patient monitoring devices by measuring four vital signs with one triple-system vital signs AFE. This single-chip AFE integrates three measurement systems (optical, ECG and bio-impedance) to obtain four common vital signs: electrocardiogram, heart rate, blood-oxygen saturation, and respiration rate.

Find Out More

featured chalk talk

10X Faster Analog Simulation with PrimeSim Continuum

Sponsored by Synopsys

IC design has come a very long way in a short amount of time. Today, our SoC designs frequently include integrated analog, 100+ Gigabit data rates and 3D stacked DRAM integrated into our SoCs on interposers. In order to keep our heads above water in all of this IC complexity, we need a unified circuit simulation workflow and a fast signoff SPICE and FastSPICE architecture. In this episode of Chalk Talk, Amelia Dalton chats with Hany Elhak from Synopsys about how the unified workflow of the PrimeSim Continuum from Synopsys can help you address systematic and scale complexity for your next IC design.

Click to read more about PrimeSim Continuum