Another Excuse Gone

Nobody wants to wait these days.

Unless it’s a surprise moment in a thrilling movie, we want it now.

Didn’t you get my tweet? You haven’t re-tweeted yet? What? You’re busy on your design project and can’t be interrupted? Dude! That’s so 20^th century! Look at meeee!

Um, hello… I’m your customer; I pay your salary, OK? So? Did you see my email? No? Why? Sleeping? No excuse, even if it is 2 AM in your time zone. There is no forgiveness for not immediately attending to my needs. Now.

New design? We need it now. The market window is closing as you dawdle.

What? Can’t test the software yet because there’s no hardware? Well, now you can with a virtual platform and an emulator.

What? Can’t test the hardware architecture yet because there’s no software? Well… um… someone get me that damn software! NOW!

Or… there is a new option that proposes to let you evaluate your multicore architecture before you have any software. If you could completely over-design your system, you wouldn’t have to worry about this. But margins rarely allow significant over-design, and under-design is summarily punished by the market. So you have to thread the needle, and you have to do it without much reliable information.

To be clear, almost no SoC would have no early software. You probably have a rich supply of legacy and open-source code planned. But that’s just the start; there’s lots more that hasn’t been written yet.

This isn’t so much of an issue on a standard, run-of-the-mill single-core system. But when you start loading up multiple cores, their interaction can have a huge impact on performance. So you want to create an architecture that will be effective for the actual software that will run, using the actual data it’s likely to see when deployed.

This tuning can work both ways: once an architecture is fixed, the challenge becomes configuring the software to run efficiently on that platform through effective partitioning and mapping. But before the architecture is settled, you can play with both hardware and software. Except that there’s lots of missing software that you have to fake. And how you do that isn’t so obvious.

Which is where Synopsys comes in with their just-announced Platform Architect with Multicore Optimization Technology. Kind of like Rufus with Chaka Khan. Except… kind of not.

The idea here is that you can build a “task graph” to model a program as a series of tasks, build an abstract version of the computing architecture, and then model the program on the architecture using SystemC. It’s done at the transaction level using abstract processors called Virtual Processing Units (VPUs).

It’s all put together via a new SystemC API layer. This API incorporates a number of abstract notions required for modeling multicore systems. Obvious ones are tasks and VPUs, but it also includes interfaces and communication channels. Synopsys provides libraries of parameterized pre-built entities to get things going quickly, accompanied by a facility for creating new entities if what you want isn’t in a library.

The architecture and task graph are each built graphically using a drag-and-drop methodology. Connections between tasks imply dependencies and communication. Tasks are then mapped to VPUs. When simulation starts, the API controls a “task manager,” which coordinates the modeled execution of the program. Tasks are launched on their assigned VPUs and then communicate with other tasks as needed. You can then observe how the architecture performs when running the tasks.

Of course, this is all abstracted. You’re not modeling real processors. And you’re not modeling real tasks. You’re essentially building placeholders for both. These placeholders have some key attributes that you can set to tune them to some degree.

In order to get the VPU to mimic a specific processor more closely, you can set up a number of hardware- and software-related characteristics. For example:

You can attach a clock generator that supports frequency scaling. This allows performance estimates to be based on the actual clock period, including dynamic changes in clock rate (if there are any).
You can configure any number of bus ports for instruction fetches and memory accesses (as initiators) or to allow external access to internal VPU memory (as targets).
You can configure any number of interrupt ports.
You can define traffic generators, which are key for modeling the real-world data that will heavily affect how your system runs.
You can add level-1 caches, specifying such characteristics as line size and miss rate.
You can specify the task scheduling algorithm for each VPU, including pre-emption and time-slicing.
You can add platform-independent interrupt servicing routines.
You can add platform-independent “drivers” that allow the VPU to communicate with other entities (VPUs, memories, etc.). You can select from libraries of canned drivers, or you can write your own (or customize one from a library). The tool can infer communication between VPUs based on the mapping, but the efficiency can be improved by manual tuning.

For the generic tasks in the library, you can set such features as the following:

You can give each task a priority, which is used by the scheduling algorithm.
You can group tasks into jobs (with each task getting its associated job ID). That way you can issue a single command to start a job, which may end up starting more than one task at the same time.
You can estimate how long it takes the task to complete, assuming all memory fetches and other communication take no time (they’re handled separately). Presumably you’ll want to do this in terms of instructions or cycles (or else it would nullify the clock rate information you already entered). Hopefully you have some other implementations or references from which you can derive this estimate.
You can specify the probability of instruction fetches and memory loads and stores. The simulator can figure bus and memory delays and then add the cumulative delay to the processing delay you specified.
You can set up the memory address ranges for data and instructions so that the simulation can model access to the appropriate memories.

If you don’t like the generic task in the pre-defined library, you can always define your own, building in any parameters you need.

Once you’ve built all of your models and start simulation, you can view both task-related items, like scheduling, and architecture-related items, like processor loading. You can then go in and refine the settings on the models – or even change the entire platform topology – until you get the performance and efficiency you want.

So… if all goes according to plan, this kills yet another excuse reason why you can’t pull in your project schedule by an even more absurd amount.

Putting words in Synopsys’s mouth, “You’re welcome.”

More info: Platform Architect