feature article
Subscribe Now

Skew This!

Azuro Touts Clock Concurrent Optimization for Aggressive Nodes

The concept of clocking a register is pretty simple. It’s Logic Design 101 stuff. Having an entire system controlled by a uniform clock makes accessible that which would otherwise be an intractable problem. It’s like adding traffic lights downtown to keep traffic from getting completely chaotic.

A whole discipline has grown out of this very basic concept: that of synchronous design. An entire ecosystem of tools and techniques has been built around some very fundamental assumptions of how to design such circuits. And as the circuits have gotten bigger, clock tree synthesis (CTS) has become an art in its own right. But, according to the team at Azuro, the foundation on which it’s built has some cracks in it, and that foundation needs to be replaced.

The nitty gritty details of clock timing can get a bit cumbersome if you want to spell things out in a manner that will withstand the objections of a clock specialist, but for the rest of us, things come down to some relatively straightforward concepts. The way we do things traditionally involves 1) designing a logic path that will provide the timing needed to achieve performance and then 2) creating a clock network. We rely on the elimination of clock skew to make this possible.

What that means is that we try to create a clock tree such that the paths from the clock source to all registers have the same delay – they have zero skew. That means sometimes adding delays here or there to balance the network, which is ok. As long as the paths are equal, the circuit will work.

But Azuro argues that as we approach small dimensions (starting around the 65-nm node), this idealization of zero skew must necessarily fail for three reasons.

One reason is the increased use of clock gating. A gate adds a delay in the clock path, and a gated path will be different from an ungated path. One way of addressing this is to push all clock gates to the bottom of the tree so that the clock paths are equal for as long as possible, diverging only at the very end, thereby keeping the skews minimal. But this means creating many more gates than are actually necessary, with accompanying complicated logic to control them. The result is greater area used and, ironically, greater power consumption – ironic because the whole purpose of clock gating is to reduce power. So not pushing the gates to the bottom of the network means that you’re introducing skew.

A second reason is simply the complexity of advanced SoCs. Start with the fact that IP blocks are inserted into the chips intact, with their pre-designed clock networks attached to the SoC clock network. Add test clocking schemes for accelerating or simplifying test sequences: these require clock muxes here and there that interfere with the normal clock networks. Add such further complicating factors as multiple clock domains and adaptive capabilities like dynamic voltage and/or frequency scaling (DVFS), and you’ve pretty much eliminated any pretense at creating a balanced clock network.

But overlaying both of these is one much more fundamental trump against ever having a balanced network: manufacturing variations that become more significant with each process node. Even if you have what looks like a simple, perfectly-designed clock tree, on any given die, on any given day, different paths will have different delays, and it’s unpredictable and it’s unsystematic and it will change with each die. And the variation has become significant enough that you can never margin your way out of it.

Add this variation on top of the clock gating and other complexities, each of which will itself have variation, and you find yourself unable to assume anything about skew.

Robbing Peter to pay Paul

With traditional clock design, a given registered path or pipeline will have a critical path – the one logic path along the chain of registers that acts as the rate-limiting step and determines the clock frequency. Designers try to get the critical path small enough to where they can guarantee the required performance and then insert a balanced clock tree.

Azuro argues that, since a balanced tree is no longer possible, a different way of designing is required. Instead of doing separate logic and clock path design steps, the logic and clock paths must be optimized concurrently – hence the moniker “concurrent clock optimization.” But here’s the deal: the clock paths can now be tweaked around a bit by tweaking the logic paths. This means that along a “chain” the timing of various stages may be different.

But there’s no free lunch here. If the logic path delays are monkeyed with, you’re essentially borrowing time from a prior or later stage, so someone has to pay up at the end: the overall chain must be in balance. No borrowing from the lottery, no borrowing from the education fund, no borrowing from Social Security. If you borrow more time than is available in the overall chain, you lose. The chain must end up with net positive slack overall after all the borrowing and repaying have been done.

Because each stage of a chain may have a different delay, you lose the concept of the critical path. Instead, it’s replaced with the concept of the critical chain: this is the chain with the least overall net positive slack, and it is the chain that determines the max frequency for the clock driving that chain and any others in the same network.

Of course, it’s one thing to come up with a clean theoretical new foundation to replace the old cracked foundation; it’s quite something else to implement a commercial tool that really works in the real world. Earlier this year, Azuro launched Rubix, which they claim as the first tool to combine the optimization of the logic and clock paths. They boast results as much as 20% faster than using traditional clock synthesis, but, perhaps more significantly, they also claim much faster design completion. This is due to the elimination of numerous iterations of the optimize-logic/build-clock-tree/hope-they-converge loop. If the predictions that the traditional flow completely breaks down around the 32-nm area are true, then it means not just faster time to market, but, in fact, it means getting to market versus not.

If accurate, that’s a pretty powerful promise. Maybe even enough to make you stop skewing around with the old way of doing things.

Link: Azuro Rubix

Leave a Reply

featured blogs
May 21, 2022
May is Asian American and Pacific Islander (AAPI) Heritage Month. We would like to spotlight some of our incredible AAPI-identifying employees to celebrate. We recognize the important influence that... ...
May 20, 2022
I'm very happy with my new OMTech 40W CO2 laser engraver/cutter, but only because the folks from Makers Local 256 helped me get it up and running....
May 19, 2022
Learn about the AI chip design breakthroughs and case studies discussed at SNUG Silicon Valley 2022, including autonomous PPA optimization using DSO.ai. The post Key Highlights from SNUG 2022: AI Is Fast Forwarding Chip Design appeared first on From Silicon To Software....
May 12, 2022
By Shelly Stalnaker Every year, the editors of Elektronik in Germany compile a list of the most interesting and innovative… ...

featured video

Increasing Semiconductor Predictability in an Unpredictable World

Sponsored by Synopsys

SLM presents significant value-driven opportunities for assessing the reliability and resilience of silicon devices, from data gathered during design, manufacture, test, and in-field. Silicon data driven analytics provide new actionable insights to address the challenges posed to large scale silicon designs.

Learn More

featured paper

Intel Agilex FPGAs Deliver Game-Changing Flexibility & Agility for the Data-Centric World

Sponsored by Intel

The new Intel® Agilex™ FPGA is more than the latest programmable logic offering—it brings together revolutionary innovation in multiple areas of Intel technology leadership to create new opportunities to derive value and meaning from this transformation from edge to data center. Want to know more? Start with this white paper.

Click to read more

featured chalk talk

Just 1-Wire to Power and Operate I2C or SPI Endpoints

Sponsored by Mouser Electronics and Analog Devices

If you are working on a connection or IO constrained design, a one wire solution could be a great way for you to power and operate your I2C or SPI endpoints. In this episode of Chalk Talk, Amelia Dalton chats with Scott Jones from Maxim Integrated about the DS28E18 communications bridge: a one wire solution that can help you address a variety of system level challenges including protocol conversion, wiring limitations, and communication distance concerns.

Click here for more information about the Maxim Integrated DS28E18EVKIT Evaluation System