The 1980s witnessed a “golden age” of the computer. While the commercially successful x86 architecture continued to evolve at the microarchitecture level, completely new architectures and instruction sets innovated rapidly and set the stage for intense competition. RISC concepts were refined and expanded in the MIPS, Sparc, Power, PA-RISC and Alpha (to name a few) architectures with great success.
In short, there was a lot of Darwinian action taking place. Interesting in retrospect, the vast majority of these architectures focused on workstation CPUs: by and large, the architectures were optimized for compute horsepower with a focus on integer and floating-point performance. The race was on to build faster and faster compute engines.
Workstations are still with us, but the vast bulk of heavy compute lifting is now performed on server farms. Simulation, place & route, DRC, DFM – as well as such diverse applications as commercial animation – can literally use as much compute power as you can throw at them. They populate racks of servers loaded with the most powerful CPUs available. These compute jobs can easily take tens of hours, so any efficiency gains directly translate into meaningful productivity gains. And this class of compute job wants a workstation CPU: blazing fast integer and floating-point pipelines coupled with a sophisticated memory hierarchy to keep those pipelines humming.
Build such a high horsepower compute farm and you can credibly get away with the “cloud computing” label. For purposes of the remainder of this series, however, forget all that.
An increasing percentage of cloud computing workloads does not require high horsepower computing. While Facebook consumes an otherworldly quantity of CPU cycles, the workload does not require tons of double-precision floating-point calculations. Indeed, this class of workload is dominated by data movement rather than by sheer computation.
Despite their very different workload characteristics, contemporary cloud computing centers are built around conventional CPU architectures. They tend toward less powerful CPUs in far larger numbers, granted, but that is far from a radical re-thinking. Even the emergence of CPUs based on 64-bit ARM cores can be traced to very conventional architectures, from the internals of the CPU to the memory hierarchy.
I am eagerly awaiting truly novel cloud-compute server architectures: outside-the-box thinking as exemplified in the late 1980s by Danny Hillis’s Connection Machines or the VLIW computers from Cydrome and Multiflow. Well, to be more precise, I am eagerly waiting truly novel and commercially viable cloud-compute server architectures.
While we are awaiting these breakthrough innovations, let’s design a cloud computing datacenter with contemporary technology, starting with a few observations:
- Contemporary cloud-compute datacenters are massive: check out any number of articles on the facilities recently built by Amazon, Apple, Facebook, Google and Microsoft. The stakes are very high, and even modest degrees of increased efficiency translate into real money.
- All things being equal, these datacenter designers prefer homogeneity over heterogeneity. That is to say that, ideally, the datacenter consists of a colossal number of identical compute nodes. For one thing, the design is simpler and the result more flexible. For another thing, heterogeneity is almost certainly going to give the CIO and CFO heartburn and/or insomnia as they worry about having too many underutilized Node Xs and not enough overloaded Node Ys.
- Virtualization is critical to efficiency as defined on any number of vectors: management, OpEx, CapEx, and flexibility to name a few – to say nothing of presenting a usable programming model for developers.
Now let’s zoom from the multiple-football-field scale down to the compute node. For the reasons posited above and for simplicity’s sake, assume an ocean of identical nodes. Each node has processor(s), memory, storage and networking. Clearly these will be multi-core CPUs as [a] contemporary mobile phones have four+ cores (for no adequately explored reason) and [b] we’re not about to build a state-of-the-art cloud-compute datacenter around CPUs with fewer cores than a mobile phone.
Grossly over-simplifying, we can go the “big iron” route with a server-class x86 CPU or the exceedingly trendy route with a low-power ARM or x86 CPU. Again, for simplicity’s sake, assume that down either route we end up with an eight-core CPU (reasonable, given that we are designing today a datacenter to be built in 2015). Now we face a significant decision: the number of processors on each node.
If we choose the “big iron” CPU route, one could argue that power/cooling will dictate a single CPU per node, as a cursory review of recently announced commercially available nodes bears out. If we choose the trendy low-power CPU route, we have greater flexibility: let’s assume we want to maximize compute horsepower per node, and so we land on four low-power CPUs per node, very likely in a shared memory configuration. The first order of magnitude, power consumption, will be roughly equal with one “big iron” or four low-power CPUs.
At last, we move beyond the “catalog selection process” to a genuinely vexing challenge: each of our nodes is an efficient compute engine, but what if some of the applications running on our cloud will require more than one node’s compute horsepower? Some workloads benefit from coarse-grained multi-threading and will run efficiently across multiple nodes; other computationally complex workloads will run well inside an eight-core CPU but suffer tremendous inefficiencies across multiple nodes. You could certainly tightly couple more CPUs on each node: 4 server-class x86 CPUs or 16 low-power ARM/x86 CPUs, but, trust me here, the team tasked with powering and cooling our datacenter will go ballistic before you step away from the whiteboard.
Ideally, we want the ability to combine CPUs on multiple nodes into a “super-node” as the application workload demands. The virtualization folks will quip that they’ve got this done-and-dusted, but one needs to remember they’re software people: they will almost certainly brush aside the significant inefficiencies incurred in crossing nodes. What we need is a hardware-assisted mechanism that enables construction of super-nodes that (nearly) perform as if they were purpose-built nodes with a sea of compute power.
Interestingly enough, this was a VERY hot area of innovation in the late 1980s. The motivation was different: multi-core CPUs were still on the horizon, and the objective was combining multiple single-core CPUs into something that looked and performed like a single multi-core CPU. Wondering what stopped these very hot innovations dead in their tracks? It was the surprisingly early arrival of surprisingly cost-effective multi-core CPUs.
It’s taken me all these paragraphs to lay the groundwork for the aforementioned thought-provoking idea: might some of the concepts demonstrated in the Computing Surface (an awesome name, quite unfortunately inversely proportional to its commercial success) built upon the Inmos Transputer be applicable to our scalable/flexible cloud-computing challenge?
Take the time to review the Wikipedia articles on the Transputer and the Computing Surface. In short, the Transputer was a blank-sheet-of-paper CPU architecture built for scalable/flexible computing. These CPUs worked in tandem as a single processor complex and across completely separate functions in a computer, and indeed, across completely separate computers. Each processor was equipped with four high-speed serial communication links to construct a mesh with other processors, decades before the term ‘SerDes’ was widely used. The architecture was designed around virtualization when that term was applied only (remember VM/CMS?) to mainframes and some minicomputers.
The Transputer suffered from a litany of insurmountable challenges, not the least of which being [a] Inmos was initially funded by the British Government and [b] the processor could be programmed only in an extraordinarily esoteric language called Occam (seriously, I wish I could make up stuff that good).
I am not suggesting that a Transputer-like architecture instantly solves our cloud-compute datacenter challenges. I strongly believe, however, that we should re-visit some of these ground-breaking parallel computing concepts and how they could apply to scalable/flexible cloud computing. Imagine a massive computing surface of identical nodes—dynamically managed by a virtualization layer—optimally allocated as single- and super-nodes based on the collection of workloads running on the cloud.
While the arrival of multi-core CPUs all but destroyed the value proposition of these pioneering processing architectures, we now find ourselves faced with a similar set of challenges in the context of cloud computing. And viable solutions may have their roots in tested though dusty architectures.
Among the ideas, observations and opinions above, I mused that “contemporary mobile phones have four+ cores for no adequately explored reason.” Though it sounds tongue-in-cheek, that statement has a reasonable degree of validity. I see plenty of motivation today for dual-core CPUs (real, responsive multi-tasking) but little for more cores. Well, other than gaming – certainly enough justification for a large class of users.
Absolutely no disrespect to mobile gamers or the developers making serious coin, but let’s set gaming aside and focus on performing real work. Back to my soapbox – when it comes to doing real work on a mobile device today, there is precious little justification for more than a dual-core CPU. Aside from the marketing value, of course, for which I have deep appreciation. This is more than just an opinion.
Exhibit A: the latest mobile phones and tablets have at most 2GB of DRAM, and the best known brand caps out at 1GB of DRAM. While it is possible for applications to pull data and program from flash memory, that is far slower than DRAM.
Given that the motivation for more than two CPU cores is greater performance, running from flash memory runs counter. Performance is all about feeding the cores a constant stream of instructions and data; delays in doing so kill performance. Thus the need for fast DRAM, and, even with a strong cache hierarchy, a quad-core CPU wants more than 1GB of DRAM to keep all those cores happily fed. Adding DRAM has drawbacks in a mobile device: cost and space; hence the dearth of DRAM in state-of-the-art phones and tablets.
It is interesting to note that mobile manufacturers very rarely tout the amount of DRAM, indeed, I relied on teardowns to get the numbers cited above. I suspect that this has a lot to do with the reality that your phone has a fraction of the memory of a low-end notebook (all of them come with 4GB and many 8GB). Now let’s face it: your phone isn’t exactly crashing every day due to a lack of DRAM, so where is the discontinuity here?
The discontinuity is that today’s apps (minus the awesome games) aren’t doing very much computationally speaking and they do not need much DRAM. Here’s a bad example: if you could run Photoshop on your quad-core tablet, trust me – you would want an awful lot more than 1GB of RAM.
Exhibit B: the overwhelming majority of today’s mobile apps falls into one of two programming models: standalone or client-server. The former run entirely (or almost entirely) within your mobile device; the latter run a front-end on your mobile device and communicate with a datacenter for the bulk of the computation and data. This article started with cloud computing, so we will focus on applications that involve both your mobile and a datacenter.
Client-server is a decades-old, tried-and-true programming model – and this is where the plot thickens (insert “thin client” pun here): the mobile app is a lightweight front-end for the “real work” performed on the server. The locally running client app is super-responsive, so entering data is fully interactive. No waiting for communication to-and-from the server that would make the user experience painful. (Remember that website that drove you nuts yesterday on your mobile browser?) Click ‘go’ and the server does the heavy lifting (searching, calculating, formatting and generating output), and then it ships a relatively modest amount of data back to the locally running client app for an equally responsive experience interacting with the result.
So client-server is a good division of labor between your mobile and the datacenter. Miss the plot thickening? “The mobile app is a LIGHTWIEIGHT front-end.” That is a sweeping generalization, and, granted, I have not profiled thousands of mobile apps, but I’ll wager that the vast majority of today’s mobile apps built using a client-server model require very little compute power. Which brings me back to my original rhetorical question: why put more than a 1+ GHz dual-core CPU in a phone or tablet? Completely untested assertion: in today’s four- (soon to be eight-) core mobile CPUs, all but two of the cores are effectively unused.
This is not intended as an indictment against quad- (soon to be octal-) core mobile CPUs. I’ve laid the groundwork to address my cavalier comment that “contemporary mobile phones have four+ cores for no adequately explored reason,” and I can think of some excellent ways to put all that compute horsepower to use. I leave you with two key thoughts:
- We should re-visit some of the ground-breaking parallel computing concepts and how they could apply to scalable/flexible cloud computing
- Contemporary mobile CPUs have a tremendous amount of underutilized compute horsepower
I will tie those two thoughts together in part two of this article and present an out-of-the-box idea for the ultimate computing surface.
About the Author:
Bruce Kleinman is a senior technology/business executive and principal at FSVadvisors and blogs on fromsiliconvalley.com