There is no question that the Internet of Things (IoT) is exploding. There are estimates that, in the near future, trillions of devices will be connected by the IoT. This could equate to hundreds or even thousands of connected devices for every single human being on Earth, all talking to each other over the largest communications infrastructure ever imagined.
But, what makes a good “thing?”
In order for a “thing” to be a useful contributing member of that internet of theirs, it needs to bring something to the party. Nobody cares about a “thing” that just sits around consuming data and doesn’t have anything to contribute. Like us humans, in order for a “thing” to be interesting, it needs perception — a way to sense what’s going on around it. Only then is it qualified to be an interesting participant in the greater conversation.
It is more or less agreed that we humans have five basic senses, and it’s no surprise that those five senses are among the first with which we bestow the machines and the “things” that we create. Just about every mobile phone today can hear and understand selected English phrases. A wide range of devices can smell and sense various gasses in the environment. Haptic sensors give our machines a sense of touch. And of course, cameras endow our creations with eyes. (Apparently, giving machines a sense of taste hasn’t been very high on the priority list yet. We don’t want them developing a penchant for caviar and truffles and running up our bill on Amazon Prime now, do we?)
We could divide each one of these senses into three levels. First is the simple ability to perceive, or gather the data. The camera, for example, simply captures pixels and nothing more. The second is the ability to interpret. The rudimentary embedded vision system can identify the nouns and verbs. “That object is a man.” “That object is a car.” “That man is walking.” “That car is moving.” The third level is to understand in context. “That car is about to hit that man and corrective action needs to be taken.”
Of all these senses, vision is by far the most demanding to automate. First, the sheer amount of data involved is enormous. Second, the algorithms for vision are extremely complex. Third, the amount of computational power required to run those algorithms on such massive amounts of data is beyond the cutting edge of even what today’s technology can accomplish. As a result, the amount of energy needed to perform the needed computation is also prohibitive.
When we look at embedded vision in the context of the IoT, things become even more challenging. The generally accepted architecture for the IoT is a distributed computing model where increasingly heavy computation loads are passed up the chain to more capable processors, ultimately leading to the extreme heavy lifting being done by cloud servers. Embedded vision’s demands are exactly the opposite. In order to avoid enormous data pipes and unacceptable latency, intense computation needs to be performed at the leaf node, as close to the camera as possible. It’s a lot simpler to send a message up the line that says “Hey! A man is about to be hit by a car!” than it is to send a gigabyte or so of unprocessed video up to the cloud where it will wait in line for processing, analysis, and interpretation. By the time all that could occur, it may well be too late. To look at it another way, the IoT is designed around the premise that each node filters out useless data and sends just the good stuff up the chain. In embedded vision however, it’s a lot more difficult to identify which data is important and which isn’t: there’s a lot more data to deal with, and the penalty for failure is a lot higher than with other types of sensory information.
So, here we have an IoT that’s designed to favor pushing processing toward the cloud and embedded vision that demands supercomputing at the leaf node.
Unfortunately, it’s not even THAT simple. Embedded vision is a Big Data problem also, because the only way to understand what’s really going on in a scene is to have a huge database of objects and scenarios to compare against. In order for any meaningful understanding to take place, embedded vision algorithms need direct access to at least a subset of this context information.
Because of these challenges, embedded vision is just in its infancy. Numerous engineering compromises are required to enable today’s very rudimentary embedded vision applications. You and I use the same eyes whether we’re reading a book, driving a car, catching a baseball, or watching a sunset. But at present, our “things” require a specialized vision system from the camera on down for each separate task they have to perform. Probably one of the most widely known and challenging embedded vision applications today is autonomous driving. This application requires a fusion of cameras with an array of other sensors like IR, radar, and sonar, combined with specialized algorithms for doing things like recognizing traffic signs, locating humans, other cars, and other obstacles in the scene, and detecting lane markers. All of this is supported by the biggest, fastest computers that we can manage to jam into an automobile.
If we were trying to build an embedded vision system to inspect bottles on an assembly line, or one to do facial recognition in security cameras, we’d be starting almost from scratch each time. A general-purpose analog for human-like vision is clearly far, far away.
One thing that’s interesting about embedded vision is that it challenges every single aspect of our modern technological capability at the same time. Image sensors struggle to match the dynamic range, resolution, effective frame rate, and gamut of human vision. Every image sensor chosen today is an engineering compromise — a sensor that happens to have the right attributes for the particular problem we’re solving. The connections within a vision system push the limits of even today’s fastest multi-gigabit interconnect. The algorithms for vision analytics are highly specialized and selective and exist at the very edge of modern day computer science. The processing required for embedded vision algorithms is beyond what today’s fastest conventional processors can do and typically would have compute acceleration with faster, more power-efficient hardware such as FPGAs, GPUs, and ASICs than we have today. As we mentioned in discussing IoT, embedded vision also imposes a serious system architecture challenge. And, finally, because of these other challenges, controlling factors like cost, size, power consumption, and reliability are also extremely difficult.
In short, every part of the engineering buffalo — optical, mechanical, software, electronic, reliability, manufacturing, verification — is pushed to the limit by embedded vision. If you’re looking for an engineering field with some legs, it’ll be a long time before embedded vision runs out of interesting and challenging problems to solve, and the potential upside for humanity is enormous. Machines that can truly “see” will have a dramatic impact on just about every aspect of our lives.