I cannot believe how fast things are moving when it comes to things like artificial intelligence (AI) and machine learning (ML), especially in the context of machine vision, which refers to the ability of a computer to use sight to perceive what’s going on in the world around it (see also What the FAQ are AI, ANNs, ML, DL, and DNNs?).
Are you familiar with eYs3D Microelectronics? This is a fabless design house that focuses on end-to-end software and hardware systems for computer vision, including advanced vision-processing System-on-Chip (SoC) devices. Meanwhile, eCapture is a brand of eYs3D that focuses (no pun intended) on the camera business.
I was just chatting with the folks at eCapture, and I must admit that I’m rather excited about their recent announcement regarding the availability for pre-order of their G53 3D Vision Depth-Sensing camera, which — at only 50 x 14.9 x 20 mm — they claim to be the smallest stereo camera in the world.
Meet the G53 (Image source, eCapture)
In addition to the lenses and optical sensors, this 3D camera boasts an eYs3D vision-processing SoC that computes stereo depth data, thereby off-loading this computationally intensive task from the main system’s CPU/GPU host processor.
The G53 is of interest for a wide variety of tasks, including simultaneous localization and mapping (SLAM), which refers to the computational task of constructing or updating a map of an unknown environment while — at the same time — keeping track of the mapping agent’s location within it. In turn, visual SLAM (vSLAM) leverages 3D vision to perform location and mapping functions.
Often enhanced by sensor fusion in the form of odometry (the use of data from motion sensors to estimate change in position and orientation over time) SLAM is used for tasks such as robot navigation and robotic mapping, and also for virtual reality (VR) and augmented reality (AR) applications (see also What the FAQ are VR, MR, AR, DR, AV, and HR?).
The point is that the high-performance eYs3D vision processing chip in the G53 camera system allows it to capture exactly the same frame on both sensors, thereby reducing errors from unsynchronized frames while gathering data for use by vSLAM algorithms.
In the context of digital sensors, various ways of capturing the image may be employed. In the case of a rolling shutter mode, each image or frame of a video is captured, not by taking a snapshot of the entire scene at a single instant in time, but rather by rapidly scanning across the scene, either vertically or horizontally. Although this can increase sensitivity, it also results in distortions when the scene involves moving objects or the camera itself is in motion (when the camera is mounted on a robot, for example).
The alternative is known as global shutter mode in which the entire frame is captured at the same instant. Since one of the deployment scenarios for the G53 is robots in motion, it employs global shutter technology to realize simultaneous exposure of all the pixels associated with both cameras, thereby reducing distortion to provide more accurate images.
Comparison of global shutter (left) and rolling shutter (right)
(Image source, eCapture)
Ooh, Ooh — I just remembered two potential use models the folks from eCapture were talking about. The first that caught my attention was the fact that speech recognition can be greatly enhanced by means of machine vision-based lip-reading, especially in noisy environments as exemplified by the cocktail party effect.
The second use case involves people walking past a shop in a street or past a display in a store. Depth-sensing cameras like the G53 can be used to determine the location of each person. Based on this information, beamforming techniques can be used to precisely transmit an audio message that can be heard only by the intended recipient in a 1-meter diameter circle. Someone else standing two meters away wouldn’t hear a thing. Actually, that’s not strictly true, because they would be hearing a different message targeted just for them (e.g., “Your coat is looking a little bedraggled — did you know the latest fashions just arrived and we have a sale on at the moment?”).
But we digress… An infrared (IR) illuminator emits light in the infrared spectrum. The G53 boasts two built-in active IR illuminators to facilitate visual binocular plus structured light fusion. In addition to enabling vision in completely dark environments, this technology also enhances recognition accuracy for things like white walls and untextured objects.
Using active IR to enhance recognition accuracy for white walls and untextured objects (Image source, eCapture)
Enhancing the ability of machines to perceive and understand the world around them can greatly increase their efficiency and usefulness. 3D computer vision technology is critical for AI applications, enabling autonomous functionality for software and machines, from robotic spatial awareness to scene understanding, object recognition, and better depth and distance sensing for enhanced intelligent vehicles. The small size of the G53 makes it ideal for space and power-constrained applications, like modestly sized household cleaning robots, for example. Other obvious applications include automated guided vehicles (AGVs) and autonomous mobile robots (AMRs), such as those used in factories and warehouses and also for tasks like goods to person (G2P) deliveries.
I believe that another key deployment opportunity for the G53 will be in cobots (collaborative robots), which are robots that are intended to be used for direct human-robot interaction within a shared space, or where humans and robots are in close proximity. As the Wikipedia says:
Cobot applications contrast with traditional industrial robot applications in which robots are isolated from human contact. Cobot safety may rely on lightweight construction materials, rounded edges, and inherent limitation of speed and force, or on sensors and software that ensure safe behavior.
Observe the “sensors and software that ensure safe behavior” portion of this, which — to me — sounds like another way of saying “cameras like the G53 coupled with AI.”
Now, I love robots as much as (probably much more than) the next person, but I think that the depth perception technology flaunted by devices like the G53 has potential far beyond robotic applications. I’ve said it before and I’ll say it again — I believe that the combination of augmented reality and artificial intelligence is going to change the way in which we interface and interact with our systems, the world, and each other. An AR headset will already have two or more outward-facing cameras feeding visual data to the main AI processor. Stereo depth data derived from this imagery could be used to enhance the user experience by providing all sorts of useful information. And the same eYs3D vision-processing chip found in the G53 camera system (or one of its descendants) could be employed to off-load the computationally intensive task of computing stereo depth data from the host processor.
I’m sorry. I know I tend to talk about AR + AI a lot, but I honestly think this is going to be a major game changer. When will this all come to pass? I have no idea, but when I pause to ponder how quickly technology has advanced in the past 5, 10, 15, and 20 years — and on the basis that technological evolution appears to be following an exponential path — I’m reasonably confident that we are going to see wild and wonderful (and wildly, wonderfully unexpected) things in the next 5, 10, 15, and 20 years.
Returning to the present, developers and systems integrators will be happy to hear that eCapture also offers software development tools for its products that allow for ease of integration into the Windows, Linux, and Android operating system (OS) environments.
Last but not least, over to you. What do you think about AI, ML, VR, AR, and machine perception in general and in the context of machine vision in particular? And are you excited or terrified by what is lurking around the next temporal turn?