Look At Something, Ask a Question, Hear an Answer: Welcome to the Future

A few days ago, I was introduced to a tempting taste of the future that had me squirming in my seat in excitement and anticipation. You know what it’s like when you are ambling your way through the world without a thought in your head (at least, that’s the way I usually do it). And then something catches your eye that sparks a cascade of questions. Rather than take a picture or make notes for future follow-up, suppose you could simply articulate your questions aloud and immediately hear the answers tickling your ears. Well, that’s the type of technology I just saw!

Yes, of course this story features artificial intelligence (AI). It’s hard to come up with a story that doesn’t, these days. As an aside, before we plunge into the fray with gusto and abandon, I’m sure you’ve heard about AI hallucinations, which are also known as artificial hallucinations, artificial delusions, and artificial confabulations. All these terms refer to when an AI generates a response containing false or misleading information presented as fact. As reported on Legal Dive, for example, the unfortunate attorney Steven A. Schwartz of Levidow, Levidow, and Oberman used an AI to generate a filing that he foolishly assumed to be factually correct prior to presenting it to a judge. Unfortunately for Steven, the judge spotted that the AI had decided to feature bogus cases in its response, boosting its arguments with bogus quotes, bogus citations, and bogus judicial decisions. Suffice it to say that this eventually led to Steven having a very bad hair day indeed.

How do we prevent AI hallucinations? Well, one solution would be to make all AIs like Goody-2, which is billed as the “World’s Most Responsible AI Chatbot.” According to an article I read recently in Wired, this self-righteous chatbot takes AI guardrails to an illogical extreme: “It refuses every request, responding with an explanation of how doing so might cause harm or breach ethical boundaries.”

But we digress…

To provide a basis for our discussions, it’s probably worth noting that we don’t actually see things as well as many people think we see them. For example, we tend to assume that everything we see is always in focus. However, our vision is foveated, which means the only high-resolution part of our optical sensing is an area called the fovea in the center of the retina. In turn, this means we have only about a 2-degree field-of-view (FOV) that’s in focus. That’s about the size of the end of your thumb when held at arm’s length. Everything outside this area quickly drops off in terms of resolution and color contrast. These outer areas are what we call our peripheral vision.

The reason we feel that everything is in focus is that our brain maintains a model of what it thinks it’s seeing, and our eyes dart around filling in the blanks. One thing our peripheral vision is good at is detecting movements and shapes, and one thing all this knowledge is good for is something known as foveated rendering. The idea here is that if we are wearing a mixed reality (MR) headset—where MR encompasses any combination of virtual reality (VR), augmented reality (AR), diminished reality (DR), and augmented virtuality (AV)—then we can dramatically reduce the computational workload associated with the rendering algorithms by using sensors to detect where the user’s eyes are looking, and only rendering those areas in high resolution, gradually diminishing the fidelity of the rendering the further out we go.

“What sort of sensors?” I hear you cry. I’m glad you asked, because Prophesee describe their GenX320 metavision sensor as being “the world’s smallest and most power-efficient event-based vision sensor.” If you want to learn more about the GenX320, take a look at the column penned by my friend Steve Leibson: Prophesee’s 5th Generation Sensors Detect Motion Instead of Images for Industrial, Robotic, and Consumer Applications.

“What else can these sensors be used for?” I hear you ask. Once again, I’m glad you asked, because I was just chatting with the folks at Zinn Labs. The “Zinn” part is named after the German anatomist and botanist Johann Gottfried Zinn, who provided the first detailed and comprehensive anatomy of the human eye circa the mid-1700s.

I’ve been introduced to some boffin-packed-and-stacked companies over the years, but I think Zinn Labs is the first in which every single member has a PhD. These clever little scamps develop gaze-tracking systems based on event sensors, which enable dramatically lower latency and higher framerates. In addition, the tailored sensor data can run in limited-compute embedded environments at low power while maintaining high-performance gaze accuracy.

As an example of the sort of things the little rascals can do with their gaze-tracking systems, the folks at Zinn Labs recently announced an event-based gaze-tracking system for AI-enabled smart frames and MR systems.

Event-based gaze-tracking system for AI-enabled smart frames and MR systems (Source: Zinn Labs)

In addition to an outward-facing 8-megapixel camera in the center, and microphones and speakers at the sides, these frames (lens-less, in this case) boast Zinn’s event-based modules featuring Prophesee’s GenX320 sensors. These are the frames that were employed in the video demo that blew me away.

As we see, in this example, the user asks questions while looking at different plants. The front-facing camera captures the scene, while the sensors identify where the user is looking. A curated version of the visual information is passed to an image-detection and recognition AI in the cloud. At the same time, the question is fed through a speech-to-text AI. The combination of the identified image and the question are then fed to a generative AI like ChatGPT. The response is then fed through a text-to-speech AI before being presented to the user.

Admittedly, there is a bit of a delay between questions and responses, and the format of the replies is a little stilted (we really don’t need the “let me look that up for you” part). However, we must remind ourselves that this is just a tempting teaser for what is to come. AIs are going to get more and more sophisticated while 5G and 6G mmWave cellular communications are going to get faster and faster with lower and lower latencies.

All I can say is that my poor old noggin is jam-packed with potential use cases (personal, professional, and social). How about you? If you do have any interesting ideas, you might be interested in acquiring one of Zinn’s development kits. In the meantime, as always, I would love to hear what you think about all this.