Programming Dark Matter

One of the many charms of the x86 processor architecture is its fantastically complex memory-management unit. First-time programmers fall to their knees, quailing in fear, at the thought of programming a Core i7 chip’s MMU. Grown men cry. Horses weep. Concrete structures crumble.

But like any tool, the MMU can be used for good or for evil. In this case, it’s both at the same time. Security researcher Jacob Torrey, building on the work of many x86 programmers before him, has worked out a way to make x86 code highly hack-resistant by using the on-chip MMU in a fiendishly clever way.

In essence, his proposed technique encrypts code by using the MMU to hide it from prying eyes. Anyone seeking to reverse-engineer your software will see only scrambled data, not the actual instructions. It’s pretty hard to disassemble code that you can’t see.

First, some background on the x86 MMU. (If you’re a black belt x86 programmer, you can skip ahead here.) Like most memory-management units, the MMU found on Intel and AMD processors serves at least two purposes: it converts, or maps, virtual addresses to physical addresses. And it enforces a simple level of access protection on areas of memory so that you don’t accidentally start executing your stack or pushing parameters onto your code.

The first part – memory translation – is pretty straightforward. Translation lets you fake out your programs, telling them that there’s a 1MB block of memory starting at address 0xFC003700, when it’s really located at some other address entirely. That one weird trick allows you to build different hardware systems with different memory configurations, while still running all the same code. The MMU quietly converts every memory address the software asks for into the actual address of the real, physical memory. Piece of cake. Pretty much any MMU can do this.

The second part is a bit more x86-specific. You can tell the MMU that certain areas of memory hold either executable code, or stack, or read/write data space, or read-only data (like a ROM). This step is optional, but it can help catch some simple programming errors and also catch runaway code and prevent it from overwriting itself. (There’s a bit more on this topic in our August 14, 2013 issue.) It can be a bit tedious to set up all the necessary area definitions and bit fields, but it’s good programming practice in x86-land.

So how does Torrey’s security trick work? To start, he defines the same area of memory twice, once telling the MMU that it’s executable code and then again as data space. That’s known as “aliasing” and is about the only way you can get an x86 processor to do self-modifying code. (Otherwise, the MMU would prevent you writing into any addresses that had been defined as executable code space.) There’s no problem defining the same area of memory twice, at least in development situations.

Here’s where it gets tricky. When an x86 processor goes to access memory, it first has to consult the MMU tables to look up the appropriate address translation. It must do this for absolutely every memory access, whether it’s a read, a write, or a code fetch. As you can imagine, that could take forever. The overhead of each memory reference would slow down the CPU drastically, especially if you’ve defined a lot of different memory areas with different sizes and access types.

So, to speed things along, the chip caches some – but not all – of that MMU translation data into something called a TLB, or translation lookaside buffer. Actually, the chip maintains two separate TLBs, one for code and one for data. And therein lies the crux of the anti-hacker hack.

Normally, the processor loads, flushes, and reloads both of its TLBs automatically, and you’d never know anything about it. They’re entirely hard-wired and, like any caches, are fully automatic. If the MMU mapping data it needs isn’t in the TLB, the chip will automatically go find the actual MMU data out in memory, and cache it for next time. However, if you’re a hands-on kind of programmer, you can manage the TLBs in software, if you really want to. For instance, you might want to manually flush one or both TLBs if you’re running multiple operating systems that have wholly different memory maps. Or you might flush the data TLB but keep the code TLB intact if you’ve mounted a new storage device or taken a big chunk of memory out of service. Whatever. The point is, you can manipulate the two TLBs at a fairly low level, and this allows you to spoof the processor into thinking the memory has moved around.

Normally, when you alias an area of memory – that is, define it as both code and data – the processor does a good job of keeping your code/data consistent. If you write into that space (i.e., execute self-modifying code), the chip does a good job of making sure that you then execute the new version of the code, not an old cached version. But that’s not what you want when you’re trying to thwart hackers.

Torrey’s trick is to manually intervene in the TLB micro-management so that code fetches from a given address range are directed one way, but data references to the same address range are sent somewhere else. That is, the same virtual addresses are mapped to different physical addresses, depending on whether they’re asking for code or data. The x86 MMU makes it fairly straightforward to tell the one from the other. Actually implementing it is a bit more convoluted, of course. The key is to make sure that instruction fetches “hit” the TLB cache for code, while data references to the same space “miss” the data TLB, even though they’re supposed to be the same thing. That’s exactly the opposite of what the MMU is designed to do.

One nice aspect of this technique is that there’s very little performance impact on the processor. The x86 MMU was designed to be fast and transparent. After all, it has to step in to every single memory transaction the processor initiates, so the design team spent a lot of time optimizing that path. A look at an x86 die photo reveals an awful lot of silicon dedicated to the MMU and its TLBs. This is no trivial part of the processor.

Of course, you could just direct data references to/from the subject area to a random range of empty space, but the more fiendish (and practical) approach is to alias the code to an encrypted version of itself. That way, you can decode and debug your own software, assuming you have the decryption key, of course.

Is it foolproof? Not a chance. For starters, this technique only prevents reading code out through the same processor that’s executing it. That is, a given CPU chip won’t rat out its own code, but in a multi-chip system, it might be possible to disassemble another processor’s code. That would require shared memory and software access to the other chip’s code space, but it’s possible.

This approach also leaves the system vulnerable to side-channel attacks, such as monitoring power usage or RF emissions. Those are pretty advanced techniques used only by hardcore (and well-funded) hackers, but they’ve been proven to work. You could also copy the memory elsewhere and then try a brute-force attack on the encryption key. Code obfuscation through MMU manipulation won’t help that.

But for single-CPU systems based around a modern x86 processor (is that an oxymoron?), the TLB-manipulation approach has some real benefits. It uses the processor’s own built-in protection mechanisms, leveraging a kind of aikido philosophy of using the enemy’s own strength against them. And it’s a software-only solution, so you can use it on any system where you have control over the low-level kernel drivers (it’s been demonstrated on tweaked versions of Linux, for example). And it’s free, if you’re willing to overlook the programming time involved. Here’s for sticking it to the hackers.

2 thoughts on “Programming Dark Matter”

bmoyer says:

March 18, 2015 at 10:48 am

Jim – Interesting stuff. Question: you mentioned a multi-chip setup providing a possible second path, so it works for a single CPU. But what about multicore within a single chip? There’s only a single MMU for all cores there, right? (Not sure if you’re equating CPU with actual core or whether a multicore unit is a single CPU in your lexicon… one of those ambiguous terms…)

Log in to Reply
Jim Turley says:

March 24, 2015 at 10:56 am

Hmm. Tricky. There are separate TLBs for each CPU core in a multicore chip, even though those same cores share some other MMU resources. That could, theoretically, allow a separate path into the “encrypted” memory through a twin core inside the same chip.

Log in to Reply

Programming Dark Matter

Related

2 thoughts on “Programming Dark Matter”

Leave a Reply Cancel reply

featured video

How NV5, NVIDIA, and Cadence Collaboration Optimizes Data Center Efficiency, Performance, and Reliability

featured chalk talk