Embedded applications haven’t historically been known for managing humongous amounts of data. One exception to this is the area of communications, where data packets are inspected, poked, prodded, interrogated, waterboarded, and then sent on. Gee, sounds like a stay at Gitmo. Well, except for the being sent on part.
Because of the amount of work performed on data packets, communications applications have been way ahead of the curve on multicore computing, and they provide a great model for embedded data processing in general. The way the packets are handled creates a significant design challenge to get everything done in time. What happens if you work too slowly and get behind? Well, at some point you start discarding packets much in the same way that Lucy had to stuff her mouth full of chocolates when things got too intense. And dropped packets are a bad thing for one of two reasons: either they’re lost forever on a sloppier protocol like UDP or they get retransmitted on something like TCP. Which doesn’t help – it’s like someone taking the chocolates from Lucy’s mouth and putting them back onto the conveyer belt for her to deal with again (Ew!) – it just adds to the traffic. OK, that image may be a tad disturbing, but think of it this way: it reinforces the notion that having to retransmit is bad, whether it’s a packet or a chocolate.
So the bottom line with such a data-intensive application is that the work has to get done in time (on average, given suitable elastic buffers). And the amount of work isn’t getting easier. Once upon a time it was just packet headers and such that needed to be futzed with. Nowadays with applications like Intrusion Detection and Intrusion Prevention (the first just tells you – picture Mr. Mackey here – “Um, there’s someone that just broke in and might be stealing all yer stuff, mkay?”; the second is the bouncer that pretty much won’t let you in if you’re packet doesn’t look hot enough) you have to wade through the entire dang packet (politely referred to as “deep packet inspection”) to figure out what’s going on inside. It’s like we’ve gone from quick X-ray inspections to internal cavity searches.
So that’s a lot of sometimes-not-so-pretty work that needs doing. Without getting into the specifics of how you store and access all that data, the fact remains that multiple processors get called into action to handle the onslaught of packets. It’s the archetypical embedded data-oriented multicore application.
Separately, servers are being put to intensive use for things like data mining. Imagine a giant landfill, and someone’s got to hold their breath and go up there to sift through all the crud either to find useful stuff or to organize it somehow into neat little piles. You end up sending lots of people up there because it’s just too much for one person to do on his or her own (OK, and they keep quitting, but that’s beside the point). How you coordinate those people depends on what you’re trying to do with the goodies you find.
Much of this data manipulation in servers is handled by relational databases through complex sequences of queries and operations. Where performance needs are higher, dedicated programs can speed results, but that’s at the expense of the work required to write the program. Why am I bringing up server stuff here? Patience, grasshopper…
Between increased amounts of analog data being digitized and processed and the increased pervasiveness of communications within embedded systems, not to mention the inward migration of erstwhile server applications and growing needs for security, the need for embedded datamunging is growing. The code to be written for such apps becomes ever more non-trivial. And even were performance to allow it, relational databases generally aren’t a viable solution for embedded. Couple that with the fact that it’s becoming increasingly hard to avoid multicore platforms, and we have the dual opportunity/challenge of handling all this data in dedicated programs running on multiple processing units.
OK people, break it up
There are two simplistic ways to view how you split up a task. One applies if you have a ton of data, the same manipulation has to be done on all the data, and each datum is independent of all the other data. Here the classic solution is to go wide – rather than having one processor handling all the data, one after the other, you get a bunch of processors, each of which gets some portion of the data, and they all work independently to get the job done. This is called “data decomposition” by computing wonks and amounts to a complex vector operation, resembling SIMD (Single Instruction, Multiple Data) on steroids.
The other way of splitting things up is to go deep by creating an assembly line – breaking up the task into manageable chunks, each of which is done by a different processor (wonkishly referred to as “code decomposition”). Things go faster because the first assembly line station can start on new data as soon as it’s done with its task and sends it on to the next station – it doesn’t have to wait until the entire process is done to start on the new data. Of course, “assembly line” and “station” sound way too industrial-age as terms, so we use “pipeline” and “stage,” respectively. OK, pipeline also sounds industrial age, but hey, think of it as an oil thing; nothing is more au courant than oil.
The challenge with a pipeline is that overall throughput is limited by your slowest stage; if Lucy is all thumbs with the chocolate, then ultimately either you have to slow down the conveyor to match her speed or the chocolates that won’t fit in her mouth will drop on the floor and be lost. (OK, the ones in her mouth are lost too. Ew!) Either way, it limits the number of good chocolates coming off the line.
You can combine both of these architectures in a more general parallel pipeline, giving you – surprise! – multiple pipelines in parallel. And once you have this, you can vary the parallelism so that, for example, if the second stage is too slow, you can have three of that stage in parallel but only two of the other stages (with a load sharing allocator). Or pipelines can branch so that after a particular decision, different pipelines take over – and if one decision is more common, there may be two pipelines in parallel for the popular decision, with only one for the less-common decision.
Hacking up the code
So the manipulations and combinations and permutations of the different ways to handle the data processing can get exceedingly complicated. And this all assumes we’ve already broken up the processing code in a suitable manner. The partitioning of code is by no means an obvious task, and the general approach is to find some logical partition that makes sense to a coder. But this is kinda going about it backwards: it assumes the code is already written and now needs to be teased apart to be realized on some multicore fabric.
The real solution, where possible, is to build the code in a modular fashion that’s natural to the processing being done to the data, without worrying about the underlying processors. Here you’re “composing” the program up front rather than “decomposing” it later. And there is a methodology and tool for doing this in Java, although the primary targeted market for this tool has been high-end data manipulations in server farms (your patience has paid off, grasshopper…). It started as an internal tool for a company called Pervasive Software, and they decided to turn it into a product that they’ve called DataRush. It’s in beta (“RC-1”) now, with full release planned by the end of the year.
Datarush allows the abstraction of the processing as a dataflow graph. It’s a Java library that implements the details of the various data objects and methods in a manner that hides much of the messiness involved in adapting to the underlying multicore topology upon which it will run. You work with such high-level constructs as files, tables, streams, and fields and with such operations as reading and writing, sorting, grouping, and joining, just to name a very few.
The idea is to allow users to snap together data elements and processes and let the library handle the grotty bits. Those bits include threading, “autoscaling” (allowing code to run on different multicore platforms without changes), and even race condition avoidance. It’s also possible to create your own dataflow elements for custom operations that aren’t in the off-the-shelf library.
Now while a datagraph has the notion of graph embedded in the very word, and while the examples shown by Pervasive have very clear easy-to-use graphic drawings showing the topology of the design, software writers are funny in that they eschew graphic interfaces (even though they probably expend a lot of effort writing them). Who needs a clear, easy-to-use graphic picture of what’s happening when a much less accessible sequence of obscure instructions will do a much better job of preserving employment? As a result of this preference, the Datarush capabilities aren’t implemented by a graphic tool used to snap elements together. You do your drawing somewhere else and then manually render the drawing into Java code. That’s just how software folks roll. [Author contemplates whether a smiley or 😛 is beneath the dignity of the press…]
It may be a bit early for this kind of programming in the embedded world, certain disciplines excepted, and it’s certainly not the direction Pervasive was originally looking as a means of filling their bellies. But they view it as an interesting opportunity, if it materializes, and would be eager to see it grow some legs. The methodology certainly may make it easier to tackle data-oriented problems regardless of the amount of data. The challenge will be to see how efficiently the functions end up being implemented on embedded systems lacking the cushy resources of high-end servers – and, if lacking, how that can be improved over time.
Link: Pervasive Datarush