An AI that can make up its mind
Early AI systems developed by luminaries such as MIT’s John McCarthy, Seymour Papert, and Marvin Minsky, were based on child development research by Jean Piaget and others. The goal was to imbue computers with symbolic constructs equivalent to those used by humans to enable self-directed learning. Yet as these symbolic AI researchers failed to achieve their goals, the industry focus has shifted to perception and memory. Taking advantage of increasingly powerful computers, almost all commercial AI systems rely on training of machine learning and deep learning models that require thousands to millions of examples.
A Cambridge, Mass. startup and MIT AI lab spinoff called Leela AI has combined the latest deep learning capabilities with a return to symbolic AI algorithms to enable more extensive self-learning. Leela AI’s understand.video software is currently being trialed by several customers that are using it to improve safety and productivity on manufacturing lines and construction sites.
“Understand.video turns video data into metrics and actionable alerts,” explains Leela AI CEO and co-founder Cyrus Shaoul, who studied at the MIT Media Lab. “Instead of watching or manually monitoring endless hours of video, our users are notified when a specific event occurs. Our software uses its self-learning capabilities to get by with a hundred times less training data than typical ML.”
You may have heard other AI companies promote their self-learning or self-teaching technology, but Shaoul says there is a major difference. “Most AI systems are mostly reactive. They can observe and find correlations with outcomes, which is useful, but they can never get at the true nature of cause and effect because they are not active. Our AI agent can autonomously decide to do an action and then do it. It then evaluates the results and understands the connection to achieve causality learning.”
Unlike typical AI solutions, the technology can also express itself. “Understand.video’s ability to explain its inferences and allow humans to understand and correct it is very powerful,” says Leela AI co-founder and VP of Products Milan Singh Minsky . “Being able to correct itself on the job site makes this a very robust tool.”
The AI engine underlying understand.video may well be a major breakthrough in artificial intelligence. If nothing else, Leela AI directly addresses several roadblocks that are currently hampering AI deployments, starting with the delays and cost overruns caused by long training time. Most AI systems are also difficult to customize, in part because they are black boxes that cannot explain how they arrived at a decision. This also makes them harder to integrate with existing customer software.
The understand.video AI engine is a hybrid of a symbolic knowledge network system and a more conventional deep learning system constructed of numerous perceptron networks (often called neural networks). The symbolic network is based on research on infant and animal brain development and acquisition of knowledge.
“The question that drives us is how to build an AI that can acquire knowledge in the same way humans do,” explains Shaoul. “No AI can do that now, including ours, but the goal has inspired us to create a more human-like intelligence than is possible with most AI systems.”
The symbolic layer creates an internal mental representation of the world based on symbols, which could be objects or activities. “When a typical ML model decides it is looking at a flower, it possesses no symbolic understanding of flowers,” explains Shaoul. “It is using a purely statistical operation on millions of pixels based on hundreds of thousands of flower pictures. Neural networks are often better than any human at classification. Yet, they are much worse than any human at determining whether it is a real flower or a drawing, let alone grasping the potential uses of either.”
Our solution is inspired by the way humans learn, which is by perceiving and acting simultaneously.
With symbolic AI, “there is more of a continuum that enables the agent to connect the symbol of flowers with related symbols for seeds, sun, or water,” says Shaoul. “Our solution is inspired by the way humans learn, which is by perceiving and acting simultaneously. With mammals, perception and action are so intimately tied together, you cannot separate them. Our software takes an action and perceives the results rather than using a separate process to decide on an action. There is a rich, bidirectional interplay between perception/action and reasoning.”
Understand.video is based on lightweight algorithms that help connect cause and effect, explains Singh Minsky. “The idea is to sandbox the agent and jumpstart it with some learning so it can start connecting cause and effect,” she says. “It can then quickly begin to reason on top of what it perceives. The perceptive part is done with a neural net, which is sort of a black box. The causal reasoning part is layered on top and has a hierarchical structure. The design allows us to use far less data for training.”
The tightly integrated design also enables the agent to know and communicate why it has taken an action. “Part of our innovation is creating a back-and-forth communication between the perception and symbol-processing sides,” says Singh Minsky. “The reasoning layer is usually in charge, but if it makes a mistake, the perception layer can correct it through brute force learning. It is much like trying to open a sticky lock. If we can’t rationally figure out why the key is not working, we can often simply fiddle with it until it opens.”
Leela AI was founded in 2016 to develop technology that emerged primarily from computer vision research developed at the MIT AI Lab by researchers including Leela AI’s third cofounder, CTO Henry Minsky. Henry has since spearheaded multiple AI and IoT projects at NTT DoCoMo and Google’s Nest Labs.
Leela AI recently secured an undisclosed round of seed funding and joined the MIT STEX Exchange’s STEX25 accelerator. The company has made use of other startup services such as the MIT Venture Mentoring Service. “It really helps to be plugged into all these MIT programs,” says Shaoul.
After deciding that video analytics provided the best initial target for its platform, the Leela AI team looked for untapped market needs with the help of the NSF’s I-Corps program and the MIT Innovation Initiative. The search led them to focus understand.video on improving safety and productivity at construction sites and manufacturing lines.
One of the company’s early customers operates construction sites where cranes and other heavy equipment pose safety risks. In this application, understand.video gets started by watching videos of people and cranes. These firsthand observations are combined with initial boot-strapping using human knowledge.
“We can explain to the AI that this is a person, that is a crane, that is the crane’s load, and it’s moving around,” explains Shaoul. “After that, the AI learns a lot more by watching and judging the similarities between safe and unsafe situations. It can detect if somebody has been standing under a crane without a helmet for more than a minute and then send a safety alert.”
Another customer is using understand.video to analyze a manufacturing line to provide safety alerts and improve productivity. “It is a complex environment with many different processes involving people working closely with machines,” says Shaoul. “Our software can figure out when and where people are doing manual activities and detect if people are moving too quickly for safe operation. It can notice that a group of people are not working and thereby determine that a particular machine is broken. Our customers can then analyze the data to optimize line operations.”
In both the construction and manufacturing scenarios, there might be 30 or 40 activities to track, which would take months or years to learn using a typical ML system, says Shaoul. “Our AI can quickly learn actions such as touching or throwing, and our customers can then add examples of people throwing things in different bins. The AI doesn’t need to retrain each time to understand the action of throwing.”
A GUI enables codeless development. “Customers can supply about five video examples of a specific activity and then point to some similar, but not identical activities to ensure against false positives and false negatives,” says Shaoul. “They can mark on the video where the activity begins and ends and highlight important actors, such as a tool, person and bin.”
Customers can supply about five video examples of a specific activity and then point to some similar, but not identical activities to ensure against false positives and false negatives.
After this initial training by the customer, Leela AI applies some manual and automatic processes and deploys the finished software in a day or two. “Our goal is to further reduce training time by letting the software make more guesses,” says Shaoul. “After a few rounds of feedback, the agent would be able to build its own causal model about a set of activities and the customers could add their own activities. The ability to interactively adjust the software is one of our major advantages.”
Understand.video is deployed as a SAAS offering with fee-based subscriptions. Leela AI also offers APIs to enable either the customers or Leela AI to link the AI to existing software such as messaging or enterprise resource planning systems.
The software is currently deployed entirely on the cloud. “Right now, a cloud platform is more efficient and affordable than using edge AI devices and makes it easier to work with existing cameras,” says Shaoul.
Yet the company is evaluating the potential for future edge AI applications and is talking to potential customers that want to connect understand.video to mobile robots or autonomous vehicles. “A robotics application would leverage our strengths in the interplay between action, perception, and reasoning. The AI could reason that more information is needed and tell the robot to move to a better spot to gain new input.”
Much of the current R&D is attempting to improve understand.video’s natural language capabilities. “We want our software to acquire language skills so it can talk about what it is seeing and allow people to ask questions about its simulations and models,” says Shaoul. “The ability to communicate with the AI is a big game changer for our users.”