feature article
Subscribe Now

Skew This!

Azuro Touts Clock Concurrent Optimization for Aggressive Nodes

The concept of clocking a register is pretty simple. It’s Logic Design 101 stuff. Having an entire system controlled by a uniform clock makes accessible that which would otherwise be an intractable problem. It’s like adding traffic lights downtown to keep traffic from getting completely chaotic.

A whole discipline has grown out of this very basic concept: that of synchronous design. An entire ecosystem of tools and techniques has been built around some very fundamental assumptions of how to design such circuits. And as the circuits have gotten bigger, clock tree synthesis (CTS) has become an art in its own right. But, according to the team at Azuro, the foundation on which it’s built has some cracks in it, and that foundation needs to be replaced.

The nitty gritty details of clock timing can get a bit cumbersome if you want to spell things out in a manner that will withstand the objections of a clock specialist, but for the rest of us, things come down to some relatively straightforward concepts. The way we do things traditionally involves 1) designing a logic path that will provide the timing needed to achieve performance and then 2) creating a clock network. We rely on the elimination of clock skew to make this possible.

What that means is that we try to create a clock tree such that the paths from the clock source to all registers have the same delay – they have zero skew. That means sometimes adding delays here or there to balance the network, which is ok. As long as the paths are equal, the circuit will work.

But Azuro argues that as we approach small dimensions (starting around the 65-nm node), this idealization of zero skew must necessarily fail for three reasons.

One reason is the increased use of clock gating. A gate adds a delay in the clock path, and a gated path will be different from an ungated path. One way of addressing this is to push all clock gates to the bottom of the tree so that the clock paths are equal for as long as possible, diverging only at the very end, thereby keeping the skews minimal. But this means creating many more gates than are actually necessary, with accompanying complicated logic to control them. The result is greater area used and, ironically, greater power consumption – ironic because the whole purpose of clock gating is to reduce power. So not pushing the gates to the bottom of the network means that you’re introducing skew.

A second reason is simply the complexity of advanced SoCs. Start with the fact that IP blocks are inserted into the chips intact, with their pre-designed clock networks attached to the SoC clock network. Add test clocking schemes for accelerating or simplifying test sequences: these require clock muxes here and there that interfere with the normal clock networks. Add such further complicating factors as multiple clock domains and adaptive capabilities like dynamic voltage and/or frequency scaling (DVFS), and you’ve pretty much eliminated any pretense at creating a balanced clock network.

But overlaying both of these is one much more fundamental trump against ever having a balanced network: manufacturing variations that become more significant with each process node. Even if you have what looks like a simple, perfectly-designed clock tree, on any given die, on any given day, different paths will have different delays, and it’s unpredictable and it’s unsystematic and it will change with each die. And the variation has become significant enough that you can never margin your way out of it.

Add this variation on top of the clock gating and other complexities, each of which will itself have variation, and you find yourself unable to assume anything about skew.

Robbing Peter to pay Paul

With traditional clock design, a given registered path or pipeline will have a critical path – the one logic path along the chain of registers that acts as the rate-limiting step and determines the clock frequency. Designers try to get the critical path small enough to where they can guarantee the required performance and then insert a balanced clock tree.

Azuro argues that, since a balanced tree is no longer possible, a different way of designing is required. Instead of doing separate logic and clock path design steps, the logic and clock paths must be optimized concurrently – hence the moniker “concurrent clock optimization.” But here’s the deal: the clock paths can now be tweaked around a bit by tweaking the logic paths. This means that along a “chain” the timing of various stages may be different.

But there’s no free lunch here. If the logic path delays are monkeyed with, you’re essentially borrowing time from a prior or later stage, so someone has to pay up at the end: the overall chain must be in balance. No borrowing from the lottery, no borrowing from the education fund, no borrowing from Social Security. If you borrow more time than is available in the overall chain, you lose. The chain must end up with net positive slack overall after all the borrowing and repaying have been done.

Because each stage of a chain may have a different delay, you lose the concept of the critical path. Instead, it’s replaced with the concept of the critical chain: this is the chain with the least overall net positive slack, and it is the chain that determines the max frequency for the clock driving that chain and any others in the same network.

Of course, it’s one thing to come up with a clean theoretical new foundation to replace the old cracked foundation; it’s quite something else to implement a commercial tool that really works in the real world. Earlier this year, Azuro launched Rubix, which they claim as the first tool to combine the optimization of the logic and clock paths. They boast results as much as 20% faster than using traditional clock synthesis, but, perhaps more significantly, they also claim much faster design completion. This is due to the elimination of numerous iterations of the optimize-logic/build-clock-tree/hope-they-converge loop. If the predictions that the traditional flow completely breaks down around the 32-nm area are true, then it means not just faster time to market, but, in fact, it means getting to market versus not.

If accurate, that’s a pretty powerful promise. Maybe even enough to make you stop skewing around with the old way of doing things.

Link: Azuro Rubix

Leave a Reply

featured blogs
Aug 17, 2018
Samtec’s growing portfolio of high-performance Silicon-to-Silicon'„¢ Applications Solutions answer the design challenges of routing 56 Gbps signals through a system. However, finding the ideal solution in a single-click probably is an obstacle. Samtec last updated the...
Aug 17, 2018
If you read my post Who Put the Silicon in Silicon Valley? then you know my conclusion: Let's go with Shockley. He invented the transistor, came here, hired a bunch of young PhDs, and sent them out (by accident, not design) to create the companies, that created the compa...
Aug 16, 2018
All of the little details were squared up when the check-plots came out for "final" review. Those same preliminary files were shared with the fab and assembly units and, of course, the vendors have c...
Aug 14, 2018
I worked at HP in Ft. Collins, Colorado back in the 1970s. It was a heady experience. We were designing and building early, pre-PC desktop computers and we owned the market back then. The division I worked for eventually migrated to 32-bit workstations, chased from the deskto...
Jul 30, 2018
As discussed in part 1 of this blog post, each instance of an Achronix Speedcore eFPGA in your ASIC or SoC design must be configured after the system powers up because Speedcore eFPGAs employ nonvolatile SRAM technology to store its configuration bits. The time required to pr...