feature article
Subscribe Now

Itanium Deathwatch Finally Over

Intel’s Itanium Receives its Official Death Warrant

“You miss 100% of the shots you don’t take.” – Wayne Gretzky

It’s not as though we didn’t see this coming, but it’s still a bittersweet moment. Intel’s Itanium, which has been on life support for years, was just given its official end-of-life (EOL) papers. Go ahead and engrave the date of January 30, 2020 on its massive silicon tombstone.

It’s easy with hindsight to poke fun at Intel and Hewlett Packard for creating Itanium in the first place. Armchair quarterbacks from every quarter (myself included) called it the “Itanic” and gleefully reported every hit below the waterline leading to its slow-motion sinking.

But you know what? I don’t want to do that. Sure, Itanium was an enormously expensive and embarrassing failure for both companies, but it was also a big, hairy, audacious endeavor. We need more of those. Good for Intel. Good for Hewlett Packard (now HPE). Bummer that your moonshot project didn’t reach its goal. But it didn’t blow up on the launchpad, either, and Intel, HPE, and we all learned something in the process. As the motivational posters say, you’re not failing so long as you’re learning.

Itanium was created for a lot of reasons, chief among them a desire to leapfrog the x86 in performance and sophistication. Even back in the era of Napster and LiveJournal, the x86 architecture was looking mighty tired. Intel and HPE both needed something better to replace it.

Fortunately, there was no shortage of better ideas. RISC, VLIW, massive pipelining, speculative execution, out-of-order dispatch, compiler optimization, big register sets, data preloading, shadow registers, commit buffers, multilevel caches, and all the other tricks of the CPU trade were surfacing around the same time. Pick any three and create your own CPU! It’ll be faster, cooler, and more academically stimulating than anything else out there. In the 1990s, it was hard not to design a new CPU architecture.

One underlying philosophy behind Itanium (and many other CPU children of the ’90s) was that software is smarter than hardware. Seems simple enough. Have you ever seen inside the branch-prediction logic of a modern CPU? We throw hundreds, then thousands, then millions of transistors at the task of flipping a coin. Will this branch be taken or not taken? Circuitry has just a few nanoseconds to decide.

How much simpler it would be to shift that task to the software. Compared to complex hardware in the critical path, a compiler has an infinite amount of time to deliberate. Compilers can see the whole program at once, not just a tiny runtime window. Compilers can take hints provided by the programmer. Compilers and analysis tools can model program flow and locate bottlenecks. And best of all, compilers are easier than hardware to change, improve, and update.

The same theology applies to parallelism. Hardware struggles to eke out a bit of parallelism where it can. But software can see the whole picture. Software can schedule loads, stores, arithmetic operations, branches, and the whole gamut of instructions for optimal performance. Software can sidestep bottlenecks before they even happen. It’s ludicrous to force runtime hardware to thread that needle when the compiler can do so at its leisure.

Yup, that settles it: we’re doing our optimization in software from now on. Pull that stuff out of the hardware’s critical path and fire up the compiler tools. Let me know when that new optimizing compiler is ready.

We’re still waiting.

And therein lies the problem. The compilers for Itanium never got good enough to deliver the leap in performance that we all just knew was there for the taking. C’mon, where’s my factor-of-ten performance jump?

It’s not coming from the compilers, that’s for sure, nor is it lurking in VLIW, EPIC, or superscalar hardware tricks. Yes, compilers have all the time in the world (compared to runtime hardware) to tease out hazards like data dependencies, load/use penalties, branch probabilities, and other details. And yes, the compiler can see more of the program than hardware can. The compiler does know more than the hardware, but it doesn’t know much more.

Some things are unknowable, and compilers aren’t omniscient. Even with the entire program to analyze, most branch prediction comes down to an educated coin toss. Even with all the source code as its disposal, finding parallelism is tricky beyond a small window of instructions, and the hardware can already do that.

The plan to throw the tough problems at the compiler guys was doomed from the start. Itanium’s compiler writers weren’t slacking off or underperforming. They didn’t need just a little more time, or just one more release. They were saddled with impossibly high expectations.  

Turns out, that convoluted branch-prediction hardware was already doing about as good as job as it’s possible to do. Sure, you can shift that task from hardware to software, but that’s just an implementation tradeoff. There’s no big gain to be had.

Same goes for wide and deep register sets, or bigger caches, or wide instruction words. Itanium bundled instructions and executed them in parallel where possible, through a combination of compiler directives and runtime hardware. Surely the software-directed parallelism will yield big results? Nope. Itanium’s compilers can format beautifully dense and efficient instruction blocks – but only if the program lends itself to such solutions. Apparently, few real-world programs do.

Data dependencies and load/use penalties are just as hard to predict in software as they are in hardware. Will the next instruction use the data from that previous one? Dunno; depends on the value, which isn’t known until runtime. Can the CPU “hoist” the load from memory to save time? Dunno; it depends on where the data is stored. Some things aren’t knowable until runtime, where hardware knows more than even the smartest compiler.

Itanium is like a rocket-powered Hot Wheels car running on its orange plastic track. It had (sorry, still has) awesome power, vast resources, and elaborate control systems. It’s just criminally hampered by its environment. It can do cool loops if it gets a running start but is otherwise stuck on its narrow track.

To borrow a baseball analogy, if you never swing the bat, you’ll never hit the ball. Sometimes you strike out. There’s no shame in that. If you don’t, you’re not trying hard enough, and Intel and HPE were certainly trying hard with Itanium. So long, Itanium, and good luck to its creators.

4 thoughts on “Itanium Deathwatch Finally Over”

  1. Still have over a dozen of these dual core quad processor, huge cache, computational servers in our cluster that are a decade+ old. It was worth updating all them to dual core a few years back, just because for some problems they beat other computational servers in our cluster hands down. Some are Intel built, most are Dell built using the same reference design.

    They are hot though, and do spin the power meter.

  2. I remember having to deal with PA-RISC, I didn’t like it. Someone said it only survived because the big cache gave it enough performance. I think those folks moved on to Itanium and it suffered a similar problems – the base architecture was flawed and irrecoverable (according to a compiler writer friend).

    Hardware guys can do stuff that is pretty cool and should work, but software engineers writing compilers have a different mindset, and if the two aren’t in sync you are out of luck.

    SPARC & Solaris, were infinitely preferable to PA-RISC & HPUX, and I never got close to Itanium in any form – and God knows people like to torture me with stupid processors…

    I think the main takeaway is that Intel will keep drinking their Kool-Aid well past it’s sell-buy date, and refuse to acknowledge things are failing. Hard to sell them new solutions if they don’t admit they’re in a hole…

    1. Agreed. Although Itanium lasted ~20 years, I think the outcome was clear after the first ten. The rest was a long, slow, glide path. It’s tough to decide whether to “fish or cut bait” when there’s that much money on the line.

  3. Well the good news is there will be some really cheap 9760 HPE NUMA clusters hitting the surplus fire sales soon, that will be a lot faster than our 9300’s.

Leave a Reply

featured blogs
Sep 30, 2022
When I wrote my book 'Bebop to the Boolean Boogie,' it was certainly not my intention to lead 6-year-old boys astray....
Sep 30, 2022
Wow, September has flown by. It's already the last Friday of the month, the last day of the month in fact, and so time for a monthly update. Kaufman Award The 2022 Kaufman Award honors Giovanni (Nanni) De Micheli of École Polytechnique Fédérale de Lausanne...
Sep 29, 2022
We explain how silicon photonics uses CMOS manufacturing to create photonic integrated circuits (PICs), solid state LiDAR sensors, integrated lasers, and more. The post What You Need to Know About Silicon Photonics appeared first on From Silicon To Software....

featured video

PCIe Gen5 x16 Running on the Achronix VectorPath Accelerator Card

Sponsored by Achronix

In this demo, Achronix engineers show the VectorPath Accelerator Card successfully linking up to a PCIe Gen5 x16 host and write data to and read data from GDDR6 memory. The VectorPath accelerator card featuring the Speedster7t FPGA is one of the first FPGAs that can natively support this interface within its PCIe subsystem. Speedster7t FPGAs offer a revolutionary new architecture that Achronix developed to address the highest performance data acceleration challenges.

Click here for more information about the VectorPath Accelerator Card

featured paper

Algorithm Verification with FPGAs and ASICs

Sponsored by MathWorks

Developing new FPGA and ASIC designs involves implementing new algorithms, which presents challenges for verification for algorithm developers, hardware designers, and verification engineers. This eBook explores different aspects of hardware design verification and how you can use MATLAB and Simulink to reduce development effort and improve the quality of end products.

Click here to read more

featured chalk talk

Double Density Cool Edge Next Generation Card Edge Interconnect

Sponsored by Mouser Electronics and Amphenol ICC

Nowhere is the need for the reduction of board space more important than in the realm of high-performance servers. One way we can reduce complexity and reduce overall board space in our server designs can be found in the connector solutions we choose. In this episode of Chalk Talk, Amelia Dalton chats with David Einhorn from Amphenol about how Amphenol double-density cool edge interconnects can not only reduce space but also lessen complexity and give us greater flexibility.

Click here for more information about Amphenol FCI Double Density Cool Edge 0.80mm Connectors