There are some places it seems everyone wants to be. The Oscars. An inaugural ball. Mardi Gras. New Years in Times Square. (OK, pre-War on Terror.) Well there was a new member of this list a while back that might not have sprung to mind immediately: the microprocessor session at ISSCC, packed to the gills. Four new processors were presented, plus one process migration to 45 nm. The bragging rights on such chips are typically all about performance (or performance efficiency), and everyone fusses over clock rates and bus sizes and various and sundry other numbers, but, in this article, we’ll turn away from the distracting lure of feeds and speeds. Anyone can look up the numbers in a datasheet, and the techno-paparazzi have assuredly posted pictures of the most lurid ones long ago. So we’ll focus on things that we found interesting in the new processors, assuming that what interests us will interest you.
The session started with Sun’s CMT-3 processor, featuring sixteen cores. This chip abandons the standard out-of-order processing mechanism, which uses an in-order fetch followed by out-of-order execution and in-order retirement. The issue with this traditional scheme is that all instructions – including those successfully executed – are kept around so they can be retired in the order they were fetched. Sun felt that the hardware required to keep 1000 or so such instructions around, along with the other required storage and the reorder buffer, was getting too big to justify – especially since only a subset of the instructions being stored actually had to be replayed. So what they did instead was to send only unexecuted instructions to a deferred queue, retiring each instruction as it’s completed, out of order if appropriate.
They have also taken on a “scout threading” methodology. Each core supports two standard threads and two scout threads. When a stall occurs due to, say, a cache miss, a scout thread is sent out, and it executes speculatively, beating the bushes to identify where the next stalls are likely to be. These newly-uncovered events can then be launched early so that the data will be ready by the time actual execution catches up. In fact, there are two threads that chase each other: one goes forward speculatively; the other hangs back, doing clean-up work once missing data arrives. At some point the lagging thread can catch up, and the two might even leap-frog each other.
This involves a system of checkpointing, where they capture the state of the system at strategic points, hopefully reducing the amount of replaying required once they resolve the speculative paths. The register files have something akin to shadow registers (a term I use cautiously, since it’s not necessarily a shadow register in the strict scan sense) for capturing the state at checkpoints. While the main register is designed for performance, the capture registers can store checkpoints using lower-power transistors, holding the data until (and if) it needs to re-enter the main flow.
The next leap Sun has taken is the move to a transactional memory access model. A write of up to 32 stores can be committed atomically or rolled back entirely. This is intended to simplify writing programs that use memory shared by multiple threads. Such a program may have a critical set of instructions that modify the shared memory in such a way that all of those instructions need to be executed by one thread before any other thread accesses the memory. Failure to do this could cause invalid data states or results that make no semantic sense from the view of a particular thread. These are referred to as critical regions, and normally only one thread can execute a critical region at a time. In addition, a thread will protect shared memory by acquiring a lock that essentially gives it exclusive rights to write to the memory for as long as it owns the lock. Transactional memory eliminates the critical region restrictions and the need to take out locks by allowing programmers to use transactions instead. All of the steps of the transaction will execute or none of them will, just as with a database transaction.
Power and area were reduced by taking advantage of common usage – or low-duty-cycle usage – of critical system blocks. First off, they noted that server code is often shared between cores, and so rather than giving each core its own separate instruction cache, four cores share one instruction cache, accessing it using a round-robin approach. Data caching is handled similarly, except that the data cache is shared between two cores. They also took data that showed that even in floating-point-intensive applications, the utilization of a floating point unit (FPU) was less than fifty percent. So rather than giving each core its own FPU, they share each FPU between two cores.
IBM followed Sun, showing the conversion of their Cell Broadband Engine from 65-nm SOI to 45-nm SOI, which they tried to automate as much as possible. This didn’t involve any change of architecture or design; it was strictly a methodology play. They started with an automated schematic migration step, which took care of device scaling. Then for the physical migration, there were parameterized cells that were tweaked programmatically; some commonly-used cells that were automatically shrunk and reshaped, with manual cleanup; and custom cells, where the swap-out of old for new was automated, again with manual cleanup. The automatic steps resulted in between 85 and 99% clean designs, with the manual steps filling the gaps.
They then addressed the design-for-manufacturing (DFM) issues, using a tool to make DFM corrections up to the last minute. The results were graded by yield-checking software, which was kind of like a DRC for DFM. Each of many items was scored for manufacturability, and in order for the die to pass muster, the overall score had to exceed a pre-determined threshold.
Power reductions were implemented by optimizing the power grid, lowering the power supply, replacing some of the active circuits with static circuits, and minimizing leakage – high-VT transistors were used throughout except where performance demanded normal-VT. Their simulations showed a 35-40% improvement in power.
They had to deal separately with the SRAM cells, because the lower supply voltage threatened the stability of the SRAM bit. This “VDDmin problem” was a recurring theme at ISSCC; SRAM cells are much less able to tolerate lowered VDD than logic cells are because a read operation can inadvertently flip a bit, or conversely, a write operation might fail. SRAM cells have historically avoided this problem through careful transistor ratioing, but VT variability at the smallest process nodes is now bad enough that you can’t rely on a static guardbanded design; different design approaches are required. IBM’s solution was to use a higher VDD for the SRAM array, with a level shifter in the wordline driver to transition between the power domains.
Overall, they found that the digital logic scaled very well, the memory scaled ok, and the I/O and analog didn’t shrink at all. This is a flip-chip, using Controlled Collapse Chip Connection (C4) “bumps.” They kept the bump pitch the same, which ended up determining the die size.
Tilera described their TILE64 device, which has 64 cores. The classic problem with multicore beyond a few cores is how to provide effective communication with memory and I/Os, and between cores. So of note here was Tilera’s use of five separate mesh networks interconnecting these elements. Each of these networks has five full-duplex ports at each core – one each for east, west, north, and south, used for routing purposes, and one for the processor to get onto the network.
Two of the networks are under hardware control: the memory dynamic network (MDN) and the tile dynamic network (TDN). The MDN allows the tiles to access external memory; the TDN allows tiles to talk to each other cache-to-cache. The other three meshes are the I/O dynamic network (IDN), the user dynamic network (UDN), and the static network (STN); they are under software control. The IDN is used to gain access to the I/Os; the UDN is used for such inter-process communication purposes as streaming, message passing, and channels; and the STN appears to be used to share scalars – i.e., constants – between tiles.
Messages on the IDN and UDN can be tagged, and there are four queues that allow incoming messages to be filtered by tag. So you can define which four tags to filter into the dedicated queues, and then there’s an “other” queue for messages that don’t have one of the four specific tags. You can use this to prioritize or otherwise segregate messages, handling them out of order as needed. There’s a message buffer prior to the queues that’s shared between the IDN and UDN, although messages can bypass the buffer and go directly into a queue.
Renesas presented an eight-MIPS-core processor where each core has independent clocking and power control. Used with a compiler that takes advantage of this, power can be significantly reduced by shutting down the supply and/or the clock on a core-by-core basis whenever they aren’t used. The compiler first schedules the tasks, and then it can go in and determine which cores aren’t being used in each cycle, scheduling power reductions during those times.
Each core has five power modes:
- “normal,” with the power and clock active;
- “light sleep,” with the CPU clock off but the power on;
- “sleep,” with the cache and CPU clocks off but the power on;
- “resume power-off,” with cache clock, CPU clock, and CPU power off; and where the core’s user RAM is powered up but its clock is off; and
- “full power-off,” with all clocks and power off.
Software can put the core in any of the states by writing to each core’s power register. Going into and out of the sleep modes is immediate; powering off takes 5 µs; recovery from the resume power-off state takes 30 µs. In the various stages of powering down, the core dissipates 1.4 W, 455 mW, 304 mW, 35 mW, and 0 mW. In an advanced audio coding (AAC) encoding example that used all 8 cores, full power consumed 2.37 W; reducing power where possible brought it down to 710 mW.
Synchronization between blocks was also sped up through the use of hardware barrier registers. Barriers are used to stop execution when multiple parallel calculation flows merge back into a single flow. If one of the “tributary” flows finishes first, it can’t just keep going because the data from the other tributary flows isn’t ready yet. So a barrier is used to hold things up until everyone else is ready. Typically there is one master task that makes sure all of the other tasks are ready before declaring itself ready. This helps avoid deadlock situations where each core is looking at the others trying to figure out who’s in charge.
Barriers are normally handled by software using memory, requiring many cycles to execute. Instead, Renesas has provided each core with barrier flags consisting of one write and four read registers – one for itself and three for the other cores. Coherency of the barrier flags is maintained by a snooping process. One particular benchmark they ran went from 1214 cycles to 66 by switching from conventional to the hardware barriers – a speedup of 18x.
By the time of the last presentations, there was standing room only, with Intel announcing the world’s first two-billion transistor chip, an Itanium processor with four hyper-threaded cores. Obviously sheer scale is part of the wow of this chip. But some of the interesting new things they’ve done don’t involve the core processing path.
They’ve moved to a high-speed serial chip-to-chip interconnect, apparently rolling their own in what they call QuickPath interconnect to provide a combined peak bandwidth of 96 GB/s. They also support serial channels for fully-buffered DIMMs giving a maximum memory bandwidth of 34 GB/s.
Power management is partly addressed by providing distinct domains for the core logic, the L3 cache, the system interface, and the I/O logic. They also added a sleep circuit to the RAM cell to reduce leakage current when the cell is idle. More dramatically, in order to manage the combined effects of power and frequency more effectively, they have implemented a complex voltage-frequency scaling system that monitors activity rates on about 120 different internal events. These rates are weighted according to capacitance and accumulated over a 6-7-µs window in order to decide whether to reduce or increase the clock frequency. This decision is made by indexing the accumulated activity into a lookup table to determine the right frequency.
They also had a concern about sudden changes in activity resulting in quick changes in power, which can cause VDD to droop temporarily. To help minimize this they put in a tunable di/dt manager that limits how many high-power instructions can be issued in a short period of time when a burst hits. On the flipside, it also issues dummy high-power instructions to ease things down when the activity suddenly goes away.
Intel also put a lot of energy into hardening their SRAM against soft errors. In addition to the problem of disturbing bits when reading and writing, the continued decrease in supply voltage makes SRAM bits more and more vulnerable to upset by alpha particles. While many of the old sources of alpha particles (primarily packaging) have been controlled, there’s still the pesky matter of incoming cosmic artillery bombarding our electronics, and the bit cells are now small enough to where a single hit can affect multiple bits. Intel hardened their SRAM cells by using so-called DICE structures that add redundant transistors with feedback that effectively make the cell harder to write, rendering it less vulnerable to spurious energy. This isn’t a cheap measure: the cell area increased by 34 to 44%, and power went up by 25%.
Full descriptions of all of these papers in gory detail are available in the ISSCC proceedings, and, in April, copies of the actual slide presentations will be available.