feature article
Subscribe Now

Skew This!

Azuro Touts Clock Concurrent Optimization for Aggressive Nodes

The concept of clocking a register is pretty simple. It’s Logic Design 101 stuff. Having an entire system controlled by a uniform clock makes accessible that which would otherwise be an intractable problem. It’s like adding traffic lights downtown to keep traffic from getting completely chaotic.

A whole discipline has grown out of this very basic concept: that of synchronous design. An entire ecosystem of tools and techniques has been built around some very fundamental assumptions of how to design such circuits. And as the circuits have gotten bigger, clock tree synthesis (CTS) has become an art in its own right. But, according to the team at Azuro, the foundation on which it’s built has some cracks in it, and that foundation needs to be replaced.

The nitty gritty details of clock timing can get a bit cumbersome if you want to spell things out in a manner that will withstand the objections of a clock specialist, but for the rest of us, things come down to some relatively straightforward concepts. The way we do things traditionally involves 1) designing a logic path that will provide the timing needed to achieve performance and then 2) creating a clock network. We rely on the elimination of clock skew to make this possible.

What that means is that we try to create a clock tree such that the paths from the clock source to all registers have the same delay – they have zero skew. That means sometimes adding delays here or there to balance the network, which is ok. As long as the paths are equal, the circuit will work.

But Azuro argues that as we approach small dimensions (starting around the 65-nm node), this idealization of zero skew must necessarily fail for three reasons.

One reason is the increased use of clock gating. A gate adds a delay in the clock path, and a gated path will be different from an ungated path. One way of addressing this is to push all clock gates to the bottom of the tree so that the clock paths are equal for as long as possible, diverging only at the very end, thereby keeping the skews minimal. But this means creating many more gates than are actually necessary, with accompanying complicated logic to control them. The result is greater area used and, ironically, greater power consumption – ironic because the whole purpose of clock gating is to reduce power. So not pushing the gates to the bottom of the network means that you’re introducing skew.

A second reason is simply the complexity of advanced SoCs. Start with the fact that IP blocks are inserted into the chips intact, with their pre-designed clock networks attached to the SoC clock network. Add test clocking schemes for accelerating or simplifying test sequences: these require clock muxes here and there that interfere with the normal clock networks. Add such further complicating factors as multiple clock domains and adaptive capabilities like dynamic voltage and/or frequency scaling (DVFS), and you’ve pretty much eliminated any pretense at creating a balanced clock network.

But overlaying both of these is one much more fundamental trump against ever having a balanced network: manufacturing variations that become more significant with each process node. Even if you have what looks like a simple, perfectly-designed clock tree, on any given die, on any given day, different paths will have different delays, and it’s unpredictable and it’s unsystematic and it will change with each die. And the variation has become significant enough that you can never margin your way out of it.

Add this variation on top of the clock gating and other complexities, each of which will itself have variation, and you find yourself unable to assume anything about skew.

Robbing Peter to pay Paul

With traditional clock design, a given registered path or pipeline will have a critical path – the one logic path along the chain of registers that acts as the rate-limiting step and determines the clock frequency. Designers try to get the critical path small enough to where they can guarantee the required performance and then insert a balanced clock tree.

Azuro argues that, since a balanced tree is no longer possible, a different way of designing is required. Instead of doing separate logic and clock path design steps, the logic and clock paths must be optimized concurrently – hence the moniker “concurrent clock optimization.” But here’s the deal: the clock paths can now be tweaked around a bit by tweaking the logic paths. This means that along a “chain” the timing of various stages may be different.

But there’s no free lunch here. If the logic path delays are monkeyed with, you’re essentially borrowing time from a prior or later stage, so someone has to pay up at the end: the overall chain must be in balance. No borrowing from the lottery, no borrowing from the education fund, no borrowing from Social Security. If you borrow more time than is available in the overall chain, you lose. The chain must end up with net positive slack overall after all the borrowing and repaying have been done.

Because each stage of a chain may have a different delay, you lose the concept of the critical path. Instead, it’s replaced with the concept of the critical chain: this is the chain with the least overall net positive slack, and it is the chain that determines the max frequency for the clock driving that chain and any others in the same network.

Of course, it’s one thing to come up with a clean theoretical new foundation to replace the old cracked foundation; it’s quite something else to implement a commercial tool that really works in the real world. Earlier this year, Azuro launched Rubix, which they claim as the first tool to combine the optimization of the logic and clock paths. They boast results as much as 20% faster than using traditional clock synthesis, but, perhaps more significantly, they also claim much faster design completion. This is due to the elimination of numerous iterations of the optimize-logic/build-clock-tree/hope-they-converge loop. If the predictions that the traditional flow completely breaks down around the 32-nm area are true, then it means not just faster time to market, but, in fact, it means getting to market versus not.

If accurate, that’s a pretty powerful promise. Maybe even enough to make you stop skewing around with the old way of doing things.

Link: Azuro Rubix

Leave a Reply

featured blogs
Feb 28, 2021
Using Cadence ® Specman ® Elite macros lets you extend the e language '”€ i.e. invent your own syntax. Today, every verification environment contains multiple macros. Some are simple '€œsyntax... [[ Click on the title to access the full blog on the Cadence Comm...
Feb 27, 2021
New Edge Rate High Speed Connector Set Is Micro, Rugged Years ago, while hiking the Colorado River Trail in Rocky Mountain National Park with my two sons, the older one found a really nice Swiss Army Knife. By “really nice” I mean it was one of those big knives wi...
Feb 26, 2021
OMG! Three 32-bit processor cores each running at 300 MHz, each with its own floating-point unit (FPU), and each with more memory than you than throw a stick at!...
Feb 25, 2021
Learn how ASIL-certified EDA tools help automotive designers create safe, secure, and reliable Advanced Driver Assistance Systems (ADAS) for smart vehicles. The post Upping the Safety Game Plan for Automotive SoCs appeared first on From Silicon To Software....

featured video

Designing your own Processor with ASIP Designer

Sponsored by Synopsys

Designing your own processor is time-consuming and resource intensive, and it used to be limited to a few experts. But Synopsys’ ASIP Designer tool allows you to design your own specialized processor within your deadline and budget. Watch this video to learn more.

Click here for more information

featured paper

Functional Safety-Relevant Wireless Communication in Automotive Battery Management Systems

Sponsored by Texas Instruments

With increasing energy density in HEV/EVs, effective battery management and monitoring is essential to avoid any kind of hazards related to overvoltage or overtemperature. This paper explores achieving ASIL D functional safety compliance while using a wireless battery management system.

Click here to download the whitepaper

Featured Chalk Talk

SLX FPGA: Accelerate the Journey from C/C++ to FPGA

Sponsored by Silexica

High-level synthesis (HLS) brings incredible power to FPGA design. But harnessing the full power of HLS with FPGAs can be difficult even for the most experienced engineering teams. In this episode of Chalk Talk, Amelia Dalton chats with Jordon Inkeles of Silexica about using the SLX FPGA tool to truly harness the power of HLS with FPGAs, getting better results faster - regardless of whether you are approaching from the hardware or software domain.

More information about SLX FPGA