Municipal Clock Design

Picture yourself living in a big city with lots of traffic. That could be anywhere in the world. Now picture that city with a robust subway/rail system (in other words, not busses that also have to contend with traffic). Admittedly, that narrows things down (to places mostly outside the US, but never mind… work with me here.)

In this city, you have a choice. When you want to go from your home to your work (both within the city), you could drive the entire distance. Or, if you were lucky, you could take public transit the entire distance and not use your car at all. Perhaps you wouldn’t even need to own a car.

Now… for the sake of this discussion, let’s assume we’re working with a transit system that has an excellent on-time record. I know, none of them are perfect, but let’s say that the variation in your train trip duration rounds to zero – it always takes exactly the same time. We can then make the following observation: due to the vagaries of traffic and random variations in behind-the-wheel idiocy, driving all the way to work gives the most uncertainty on when you’ll get there – it has the highest arrival time variation or trip-to-trip skew; the train trip has the lowest (zero in this case).

Granted, the logistics of using your car are easiest; for the train, there’s more to do to get there: get a ticket, wait for it to arrive, etc. So the car has the best ease of use, but is the least predictable; the train is more work, but is the most predictable.

Now let’s say that you work somewhere in a plant located outside the edge of town, out of the reach of the transit system, and you live in the heart of town. Then you might have a beater car that you park at the train station nearest your work. You take the train that far, and you then drive the remaining distance. You’ve now put some unpredictability back into your arrival time due to the drive. But because the drive is so much shorter than driving all the way from home, the variation in your trip times is likely to be much smaller.

And if, for any reason, you decided not to park at the closest-to-work station, but rather at some farther-away station (perhaps it’s near your favorite pub or grocery store), then your drive would be longer and less predictable – but still shorter and more predictable than driving all the way from home.

The point here is that the shorter the drive portion of the trip, the more predictable your arrival time will be. Which relates rather nicely to clock design on large ICs. Really. Work with me here.

The easiest way to get your clocks where you need them with the desired timing is good ol’, tried-and-true, push-button clock tree synthesis (CTS). Each clock load gets a custom-crafted signal path that originates at the clock source. It’s like driving to work: you get the best control, the easiest methodology, and it’s what most people do.

There’s just one issue: on-chip variation at today’s aggressive dimensions is a tremendous consideration, which makes it really hard to manage the fact that on any given die, one side may be faster or slower than the other (to oversimplify things). And you don’t know which side and you don’t know if it’s faster or slower. It’s like trying to drive all the way to work and figuring out what time you should leave so that you’ll never arrive to work late and you’ll never arrive too ridiculously early (which would make you look too eager… no one wants to be that guy…). If the traffic varies too much from day to day, there’s really no way to do that. The only way you would be able to manage your arrival time would be to look at a traffic map before leaving home so you could time your departure (and hope things didn’t change too much after you left). Unfortunately, there are no on-line traffic maps on an IC, so that option isn’t available to clock designers.

The other option, then, is to use transit. The IC version of this has been so-called “clock mesh” design. So-called because it involves a large, fine-grained clock mesh. The idea is that you have many sources of a given clock all around the die, and you hook them all onto the mesh. You are literally shorting the values of all of these drivers, so if one is a little slower than another, the faster one starts yanking on the line earlier, which compensates for the guy that’s slow. In other words, by shorting all these drivers, their delay variations are sort of averaged out.

In that manner, all of the paths from the clock sources to the mesh end up having pretty much the same effective delay. It’s like the transit system – low skew. And if you’re lucky enough to be able to drive your clock loads directly from the mesh, then you have effectively zero skew, like a door-to-door transit system provides.

But, of course, just as none of us can reasonably expect a complete door-to-door transit system, you also can’t drive the mesh to absolutely everywhere you need a clock signal. So you get it as close as possible – the tight mesh gives you that – and then you have a bit more logic that gets you from the mesh to your actual load. You get one or two levels of logic where you can, for instance, do some gating. This is like taking the train to the outer station and then driving the last little bit.

Because the source-to-mesh skew is roughly zero, you really have to worry only about the skew of that last couple of logic levels, and that’s manageable. So it sounds like an easy solution. But, of course, as with all things engineering, there are tradeoffs.

First off, in the same way that driving is easier than using transit, CTS is easier than designing a clock mesh. It’s push-button. Clock meshes are harder to design and analyze, and they have to have a nice, clean, uninterrupted area. If your chip has a circuit – perhaps it’s some hard IP you purchased whose layout you can’t control – that uses the same metal layers as the mesh, now you’ve blocked the mesh and screwed things up.

The upshot is that, as Synopsys tells it, clock meshes aren’t used very often: they tend to be limited to the processor areas on SoCs, and they’re created by well-trained, dedicated teams. Other companies are reluctant to go that route because of the risk inherent in changing to an approach that’s significantly harder than CTS.

There’s another problem: If a transit system has too many stations, if they try to deliver you too close to any possible destination, then the train will just get underway from one station and have to start slowing down for the next station. The train practically won’t spend any time traveling; it’s constantly starting and stopping and waiting for folks to get on and off the train. And we all know that starting and stopping frequently is more work – takes more energy – than simply gliding along the rail. So a train system with closely-packed stations will use more power than one without. To a first approximation, anyway. Work with me here… (Will it use less than if everyone takes their car? Probably not… work with me here…)

Albeit for different reasons, the clock mesh approach also uses much more power than a CTS approach. It’s because of those big metal mesh lines. Unlike CTS, where only the lines going from point A to point B switch, here we have parts of the mesh that aren’t near point A or point B – and they’re still switching. Bottom line, lots of metal, all of it switching: higher power.

The compromise here is to make the mesh less fine: raise the level of granularity. This is like designing a transit system with more space between stations. You consume less power, but you also end up getting dropped off farther from your destination. Synopsys calls this multi-source CTS. You still end up designing a clock tree between the mesh and load – 8-9 levels of logic rather than 1 or 2 in a full mesh – but now the source for that tree isn’t only a single clock source (as it would be in standard CTS), but it’s the combination of all of the sources that drive the mesh.

Doing this helps with the power problem while compromising a little on the skew. The thinking is that even nine levels of logic is still manageable. That aside, there’s still the potential problem of ease of use. This is where Synopsys recently announced, as a part of their 20-nm support, a multi-source CTS methodology with IC Compiler that’s more automated to make it more like CTS when it comes to ease of use.

One of the issues they have to solve is, what’s the best place to tap off of the mesh to drive a local tree? And where do you run the local clock tree from that tap? To automate this design issue, they have a “clustering” algorithm that finds clusters of the circuit that are naturally near each other and need the same clock. Those clusters will often reflect the design hierarchy, and the tool will pay attention to the hierarchy in its deliberations, but it’s not tightly bound to that – it can bring in or leave out pieces of the circuit that cross the block boundaries

The result is that you now have three clock design choices:

CTS, which will give you the lowest power and is the easiest to do but has the least ability to deal with on-chip variation;
Multi-source CTS, which tolerates on-chip variation better than CTS but uses 10-20% (preliminary numbers) more power than CTS and is a bit more work to implement; and
a full clock mesh, which has the best on-chip variation tolerance but uses 20-40% more power than CTS and is harder to implement than either CTS or multi-source CTS.

So don’t sell your car quite yet. You may decide you want it just to make that last leg between the station and your work a little bit more flexible.

Image: Bryon Moyer

The point here is that the shorter the drive portion of the trip, the more predictable your arrival time will be. Which relates rather nicely to clock design on large ICs. Really. Work with me here.

Doing this helps with the power problem while compromising a little on the skew. The thinking is that even nine levels of logic is still manageable. That aside, there’s still the potential problem of ease of use. This is where Synopsys recently announced, as a part of their 20-nm support, a multi-source CTS methodology with IC Designer that’s more automated to make it more like CTS when it comes to ease of use.

The result is that you now have three clock design choices:

CTS, which will give you the lowest power and is the easiest to do but has the least ability to deal with on-chip variation;
Multi-source CTS, which tolerates on-chip variation better than CTS but uses 10-20% (preliminary numbers) more power than CTS and is a bit more work to implement; and
a full clock mesh, which has the best on-chip variation tolerance but uses 20-40% more power than CTS and is harder to implement than either CTS or multi-source CTS.

So don’t sell your car quite yet. You may decide you want it just to make that last leg between the station and your work a little bit more flexible.

Image: Bryon Moyer