At the Dawn of the Universe

We take a lot for granted in this world. When we got here, stuff was more or less just working. We’ve managed to bend a lot of that to our will since then, and we’ve done so by mastering a wide variety of reliable physical laws. We’ve managed to destroy some of it (and threaten to destroy more) often because we have not yet mastered yet more of those laws (or because someone could make a quick buck, but that’s a different topic). We live in a universe that has the deceptive feeling of steady state, using our relatively infinitesimal human tenure here as a window that defines “normal.” It feels like the universe started in some wildly chaotic state and annealed asymptotically into this relatively benign existence.

Scientists try to understand the inchoate universe, pushing back the threshold of our knowledge ever closer to the “beginning,” whatever that is, and if it even exists. And they’ve developed theories that take us back phenomenally close to a putative Big Bang. But there are those first femtoseconds – before the Forces differentiated, before Symmetry was broken, where Relativity and Quantum somehow resolve their differences – that have yet to be plumbed. None of the physical niceties we take for granted today apply back then. We can assume nothing. The rules don’t obtain because most of the rules don’t exist yet. Exploring this realm requires an open mind, the capacity for the bizarre, and, most of all, the ability to accept things that would be nonsensical in today’s world.

Shifting our attention to the vastly more banal, today’s software engineers can take a lot for granted. By the time most of them are writing code, the computer or system for which they’re writing is more or less just working. Software designers continue to push the limits of what can be done, even as some of them end up breaking the system (or turning out crappy code to make a quick buck, but that’s a different topic). They can count on a platform that behaves in a deceptively steady manner, at least as viewed during the time they’ve worked with it, even though they know that at some point it was just a random heap of metal and plastic that somehow got fashioned into this highly-ordered system, with hardware and operating system humming along at their service.

Some software engineers actually work with the system in its early stages, in particular with embedded systems, writing drivers or other low-level code that must operate much closer to the “beginning” of the existence of the system. Some of them help bring up the operating system for the first time. But there are those first few clock cycles that get executed on a new piece of computing silicon, before the operating system boots, before any connection to the internet exists, before printf has any meaning, that someone must navigate. None of the normal development conveniences apply then. You can assume nothing. Orderly execution of code doesn’t apply because the cores aren’t executing in an orderly fashion yet. Exploring this realm requires an open mind, the capacity for the bizarre, and, most of all, the ability to accept things that would be nonsensical for normal software development.

I was fortunate enough to spend some time with a few bring-up road warriors at MIPS. Rick Leatherman brought together a couple of his erstwhile FS2 cohorts, Bruce Ableidinger and Richard Hilbig, as well as Scott McCoy of MIPS extraction. All have been in the bring-up world a long time, and Richard and Scott in particular are on the frontlines of helping customers bring life to their new SoCs. The topic at hand was the state of the multicore SoC before it’s really running and executing steadily. In particular, what kinds of things can go wrong? And how do you figure out what’s happening?

We should start with some critical positioning: MIPS sells processor IP for integration into SoCs. They own and control what’s inside that box; the customer owns and controls what’s outside that box – the “customer system.” These guys are very clear about this and yet appear to be at peace with the fact that, as often as not, they’re going to be helping the customer debug their own problem rather than a MIPS core problem. In those cases, they can’t fix the problem, but they can help lead the customer to a solution.

The other clear thing was that, while the specific cores and solutions discussed were MIPS’s, the concepts apply generally, and good tools and debug solutions are available from all of the big core providers, particularly arch-rival ARM. So this isn’t a MIPS topic, it’s a universal topic as elucidated through MIPS’s experience.

What could possibly go wrong?

It helps to start by questioning what kinds of problems are typically encountered during SoC bring-up. Because of the extensive verification done prior to tape-out, it’s extremely unlikely that a chip will come up DOA. While that may sound like good news, the flip side of that means the issues that crop up tend to be subtle and surprising. Even stupid.

Most of them actually deal with the way the busses are connected or implemented. It’s not uncommon, for example, for someone to bridge an AMBA bus to their OCP bus. If everything isn’t working properly, you have major issues to deal with. They’ve seen situations where burst mode wasn’t working properly, and therefore code couldn’t be fetched properly. You can’t fetch code: you can’t execute.

In another example, there was an issue where memory was loading into instruction cache (i-cache) two words at a time, with the words swapped. This meant that each pair of instructions was being executed in reverse order. There’s only so far you can go like that before something starts to get rather confused.

These can be connection issues, logic or protocol issues, or something as “simple” as a clumsy clock domain crossing. Some, but not all, of these things can be caught in verification. The nature of the problem is always obvious once you know what it is; it’s getting there from “this machine will not communicate” that’s the tough part. That’s where a solid bring-up process becomes valuable.

Start from the start

The first thing that happens at power up is reset. How it happens can vary. The core will have its own power-up sequence to get it into a known state from “power good,” which takes 15 clock cycles in MIPS’s case. The rest of the system will have whatever reset is designed by the user, and the generally-accepted practice is that, somewhere, a master reset pin is held for milliseconds, often by something as crude as an RC circuit, so that everything gets a chance to get to full power before the chip is unleashed. There’s that window between when parts of the chip think they’re on and other parts are still asleep that’s particularly dangerous, and, with multiple clock domains being derived on and off chip, it’s essential to ensure that everyone is in line and saluting before giving the order to march.

But, in particular in a multicore chip, what happens when reset is released? Does everyone just go off and start doing… whatever? This is a critical design decision: typically, someone has to be the boss, at least until everyone is up and running.

That means that one core should act as a master, synchronizing and starting up the other cores in a controlled fashion. If all of the cores have their own independent environments – in particular, if they share no resources – it is possible for them all to come up separately and independently, establishing synchrony through software handshakes after startup. But this is a less common scenario; more often than not, cores are more tightly coupled and must be managed together.

Taking that one step further, MIPS sells some cores in clusters. In this case, one core (and, in fact, one “virtual processor” within the core) is designated as the master. There is a shared cluster management block that starts the master when reset is released; the master then brings up the other cores in the cluster. Even with this management block, however, if a chip has multiple clusters, the overall system must figure out how to manage reset globally.

Cluster coherency can complicate reset since, depending on how the cluster is configured, the coherency manager can suppress reset. This would be useful if the cluster needs to get to a coherent state before stopping. But it suggests a nightmare for certain debug scenarios where you’re trying to understand why reset doesn’t seem to work.

A diversion

If things don’t go right after reset is released, then you need to get into debug mode to figure out what’s going on. On the MIPS cores, there’s a jump vector that can be set for debugging, along with a bit that says to go there after reset is released. There’s a flag that can then be examined to confirm whether or not the core thinks it’s in debug mode.

Of course, doing this requires some way to get in the back door of the chip. And that’s typically done via the JTAG port, even though that’s not really what JTAG was designed to do. JTAG’s primary purpose is testing – specifically, boundary scan testing. It originally had nothing whatsoever to do with what’s on the chip; it was all about whether or not continuity between chips was intact. But, as with all good ideas, once let loose, people found other useful things to do with it. Notably, debug.

It’s possible to wire everything up into a single JTAG chain, but that’s not typical because tests would be incredibly slow due to the excessive length of the chain. Instead, the JTAG circuitry gets muxed and mucked with, creating multiple JTAG chains that are selected before you start actually running tests. How you select the debug mode is not part of any standard and can vary by company and even by chip. It’s one of those things that must be written down and communicated from the hardware designer to the software engineer. But more on that later.

So… we’ve got a JTAG port that we can use to put the chip into debug mode. But… that assumes the core responds correctly to the debug directive. It also assumes that the JTAG engine is working properly. And it assumes that that debug circuitry is working. It also assumes that JTAG has been set to debug and not boundary scan. It assumes that the various JTAG instruction registers have the length promised in the BSDL file that provides the tool with information about the structure of the JTAG chain. It even assumes that the pin connections between the chip and the prototype board or between the board and the debugger are intact. And all of this assumes that the fuzzy jacket you’re wearing in that plastic chair, with your leg close to the cable, isn’t causing an issue.

And one or more of these assumptions could be wrong. In particular, the fuzzy jacket issue actually occurred.

That’s the key to much of this work: not rushing headlong under the assumption that all the things that should normally be working are working. A sequence of steps gradually working in towards the core is necessary to build confidence that assessments of whether or not something is working are based on valid tests.

As an example, MIPS’s debugger probe will go in and start by verifying that the JTAG engine is responding properly. It can go from there to the point where the core is in debug mode and has executed instructions correctly. At each step along the way, if there’s a failure, it can report on what went wrong and make suggestions as to how to fix it.

There’s one more level of potential complexity: security. For the most secure designs, debug technology will be completely ripped out to the extent possible. For other secure designs, the JTAG port may be wrapped in some other technology, with a password required to enter debug mode. This creates yet another layer that cannot be assumed to work – even the password cannot be assumed to be correct.

Wait, exactly who am I talking to here?

In a multicore scenario, you need to start by making sure the master is under control. It can be useful to have a debug tool that can create multiple “virtual connections” to different parts of the chip. In this manner, you can start with a connection to the master and confirm that it’s working. Once you’ve got that under your belt, then, one core at a time, you can wake up the other cores and see if they’re working.

This provides better control over exactly which core is doing what at any given time. It can all be done through the single JTAG port: the debug tool can direct its activities towards any of the cores through the magic of the JTAG “bypass” instruction, allowing more specific targeting of different circuits while the other circuits remain unaffected.

Things get messier when you have a coherent cluster of cores. You now have memory, two levels of cache – including an i-cache and a data cache (d-cache) – and possibly even a cache at the debug tool. That’s five possible different places that a single entity could exist. If everything’s working, then whatever coherency mechanism is used will ensure that the data is either correct everywhere or is correctly marked as stale when not yet updated. At this early stage of bring-up, that’s not an assumption that can be made.

Debuggers read and write using the data side of the cache, even when storing instructions. So let’s say the debugger replaces an instruction in memory that’s being used by two cores. If the cache is changed on one core, does it write through immediately to memory? Now you execute and it doesn’t work. What were you reading? The d-cache you changed? Memory where it may or may not have been updated? The other potentially stale d-cache? The debugger cache? And even if those all look right, the instruction is being processed for execution through the i-cache. Perhaps there’s a problem there. Even if not, which version of the instruction did the i-cache pick up: the modified one or the old one?

This is exactly what happened in the instruction-swapping example mentioned above. Everything looked fine; it was only when the instructions were loaded into i-cache (not d-cache) that they got flipped. The only way to detect this was to trace the i-cache itself, comparing it with memory. If that won’t turn your hair gray, well, it probably means you have no hair left.

So what do I do?

These examples but hint at the complexity that can arise when trying to piece together the whys and wherefores of what’s happening during the first nascent clock cycles. But assuming no really dorky preventable cause of a problem is found, what are the main sources of these pitfalls?

One is simply the fact that some things can’t be validated prior to silicon. Emulation is a common way to exercise complex chips, but, as an example, depending on the emulator, complex multiple clock domains might not be directly implemented. They may be created on the emulator as integer multiples of some other clock, which approximates the real-life ratio but doesn’t exhibit the kind of mutual drift and other anomalies that can exist for real unrelated clocks. It therefore remains as a potential breeding ground for post-silicon bugs.

But emulation failure due to the inability to reproduce a real scenario is less common than human frailty. A more common cause of unnecessary debug pain is the failure to plan. Debug scenarios should be thought through ahead of time, during system design, to ensure that problems can be addressed both right after first silicon as well as in the field after the system is deployed. Including the debug circuitry in the verification process might seem overly cautious until you discover that there was a problem with your debug mode access scheme.

A bigger, broader cause of problems is failure to communicate. Hardware designers put features into silicon, software engineers write and debug software using those features. A prototype board may use jumper or resistor settings to enable debug. There may be a debug password. Reset may defer to the coherency manager. The various connections, instructions, debug features, and other elements may or may not be clearly documented.

And even if they are documented, it may be in the dreaded form of a reference manual, where each instruction or feature is laboriously detailed with no context whatsoever that would tell you when or how to put the feature to use. Good, solid communication is critical during design and after first silicon. The MIPS guys have, on occasion, had to introduce the software guys to the hardware guys, hopefully early in the project, to establish a line of communication that can head off future problems.

Finally, if we were to summarize the ultimate root cause of every possible crepuscular nightmare in one word it would be: assumptions. Under normal circumstances, you can assume Maxwell’s equations, you can assume the sun will rise tomorrow, and you can assume that malloc will correctly return a nice chunk of heap memory. But these aren’t normal circumstances. There is no heap yet; the sun has not yet coalesced out of the primordial celestial gasses, and the forces governed by Maxwell haven’t even yet been differentiated.

Scott McCoy proposed a succinct way of heading off trouble: “Every time you make an assumption, write it down. Because one of them is wrong.”