Going With the Flow

You have a stalker.

He or she is watching you right now. Well, maybe not visually, but that slight muffled sibilant sound you hear? That’s someone sniffing your packets. This someone needs your exclusive attention, distributing a denial of service attack so that you can’t pay attention to any of those other skanks that are trying to lead you astray. Consider your emails read; your privacy is dead. You may have been gifted baby bots that need no love and care from you; they just help your True Love Forever monitor everything you do.

It’s an amazing thing to be adored. You must feel so lucky.

Actually, what’s ironic is that, even though all of those things might be happening, some of them are the result of a black hat and some a result of the white hat that’s trying to protect you from the black hat. Although, to some extent, white is just a very pale shade of black.

This is the first of two articles that will deal with issues surrounding the examination of your innermost secrets: deep packet inspection (DPI). While basic networking protocols deal only with how to get payloads from here to there, DPI goes into the payload itself. It’s like driving down the interstate. For the most part, an officer may just monitor an outward abstraction of your trip: your speed. But occasionally you have to pull into one of those inspection stations to ensure that a stray banana won’t wipe out the entire state’s economy. If necessary, they can perform a deep payload inspection on your car. Much more personal.

DPI in the context we’re discussing is part of a bigger topic: intrusion detection or prevention. As such, it’s intended to make sure there are no nefarious piggy-backers in your packets. But, by itself, DPI is simply a means of looking at packet contents and could conceivably be used for snooping and espionage and anything else someone could think of doing to your packets, with or without court approval.

The theme of these articles has been triggered by a couple of unrelated pieces of information. The first was a conversation with Netronome’s Jarrod Siket at ESC earlier this year. Netronome showed the results of a DPI benchmark they ran. Now, you and I both know that, if they’re showing the results at a convention, then they must have won. No surprise. No presses being stopped.

And, typically, such bake-offs end up with close results, and the various contestants bicker and huff about why this or that doesn’t matter or wasn’t fair or whatever.

Not so this time. The Netronome results were roughly 8X better than the best non-Netronome competitor. That’s eye-catching. Either something interesting is going on or massive fraud has been perpetrated. I haven’t heard any cries of foul (not to be confused with cries of fowl, increasingly common in the urban setting, but I digress), so I’m taking this at face value.

The benchmark involved a popular DPI program called Snort. In a test that was run by NSS Labs, the Netronome board was able to pass 80 Gbps, involving 60 million flows, a half million of which were TCP and HTTP-CPS flows, with a mix of packet sizes. A SourceFire board using Netronome’s technology achieved 40 Gbps. The next contender, the McAfee M-8000 system (built using high-end RMI processors), got just over 11 Gbps, and the long tail goes down from there.

So the immediate question I had was, what is it about Netronome’s technology that would seemingly let it jump all over DPI? As it turns out, it’s not so much a matter of handling DPI well as much as it is about handling other things well that makes DPI go faster. In fact, further focus specifically on DPI could have raised the score even more. But let’s back up and take a bigger-picture look at what’s going on here, according to Netronome’s worldview.

For the most part, packet processors fit into two categories. The first is the basic network processing unit, or NPU, made by companies like EZchip. These are single-minded ICs with one purpose: route packets as fast as superhumanly possible. They operate in a mode that has a dedicated pipeline that can process a typical packet incredibly quickly – the so-called “fast path” or “data plane” – and a separate processor for handling control and the oddball packets that are too rare to burden the fast path with: this is the “slow path” or “control plane.” (It would be pedantic to note that the fast/slow path model and the data/control plane model, strictly speaking, aren’t isomorphic.) These chips are important for getting the throughput needed in high-density 10- or 100-Gb Ethernet ports. They typically don’t operate above layer 4 in the OSI model, and many are optimized for the venerable IPv4 – version 4 of the Internet Protocol, with dwindling IP addresses available, which is only now poised to start losing dominance to IPv6.

At the other end of the spectrum are beefier chips by companies like Cavium and the erstwhile RMI, now part of Netlogic. These are many-core chips that are optimized for network applications, having some hardware accelerators – say, scheduling or encryption. Unlike the small engines making up an NPU fast-path pipeline, these are full-on processors from companies like MIPS, making them capable of a much wider range of duties. They may work on any of the OSI levels, including layer 7, reaching the most intimate corners of the payload.

Netronome sees room for something in the middle. There are many common protocols like TCP that require session management, or where a given quantity of data has to be split over a large number of packets that ultimately have to be reconciled in the order in which they were transmitted. NPUs can’t do this since they look at only one packet at a time. The larger chips can do it, but not fast enough: they’re overpowered for the job.

So this is where Netronome created the “flow processor.” It’s the next level of abstraction above the packet: a series of related packets constitutes a flow. There’s an order to the packets and the flow has a state. The original payload may have been split across many packets, so any individual packet will have only a portion of the overall payload.

This is particularly relevant to DPI because, if you’re going to inspect the contents of something being transmitted across the wire and it’s been broken up into chunks, a particular tell-tale signature may actually get split across packets. If you check just the individual packets, none of them will raise a red flag. It’s only when you join them back together that you find it. State becomes important here because if, at the end of a packet, you were on the trail of something suspicious, you want to save that as a state so that, when the next packet in the series is encountered (which will likely not be the next packet received, and may not even be the next packet received from that particular flow, since packets can be re-ordered in flight), you know to continue the search from where you left off.

Netronome acquired their basic processing units, which they call micro-engines, from Intel. Intel was a big player in the NPU space, combining their micro-engines for the fast path with an XScale processor for the slow path in their IXP family. They ultimately sold the technology off, and Netronome now has the micro-engines.

They used these micro-engines – 40 of them to be precise – to build on the basic network processor architecture and create what they call their Network Flow Processor (NFP) technology. Rather than the packet being the basic unit of information, the flow or session is now the basic unit. They’ve provided acceleration blocks to include a ring and list, CAM and TCAM, a work queue with reordering, memory locking, and hashing.

One of the challenges to the IXP family was the programming model. It required incredibly tedious micro-coding of the micro-engines. So much so that there was a company called Teja Technologies (for whom, full disclosure, I worked) that made a business out of providing a programming model for IXP devices (and other NPUs) based on the C language instead of microcode. Teja was bought by ARC (bought by Virage bought by Synopsys), so Netronome has added their own C front end to allow more conventional programming.

The question is, then, how do these chips accelerate DPI? And the answer is, actually, that they don’t. In the bake-off that they ran, their chip didn’t even handle the DPI stuff: a separate host processor did. But, following Netronome’s logic, without their NFP chips, your choices are either NPUs, which can’t do DPI at all, or the larger chips that can do DPI, but which will also be burdened with the packet processing, making many cycles unavailable for DPI processing.

The NFP chips split the difference between these options: they take on the burden of packet processing at the flow level that NPUs can’t handle, and they send only packets needing DPI (or some other kind of special TLC) to the host processor. This makes the flow management much more efficient, leaving more cycles in the host for DPI. The result? Everything runs faster.

Specifically, in the processing that goes on during the benchmark testing, there are several important tasks being performed:

packet demultiplexing and classification;
zero-copy big-block transfers;
I/O virtualization (for PCIe);
outgoing traffic management;
offloading TCP;
SSL inspection;
I/O security and DPI.

Of these, the first four were handled by the NFP chips; the rest were done by a multicore x86 host. The next obvious question would be, what if we had accelerated those last three items? Then obviously things could have gone faster.

Which brings us to the next article in this series: one way to accelerate the DPI portion. This will bring us deeper into the DPI guts.

More info: Netronome Network Flow Processors