10.12-13.22-DigitalTech-Common-Sense-Machines-CSM

Startup Exchange Video | Duration: 5:09
October 12, 2022
  • Interactive transcript
    Share

    MAX KLEIMAN-WEINER: Great. Thanks for having me. I'm co-founder and CEO of Common Sense Machines. Our company has deep MIT roots. Myself and my co-founder Tejas, we're both PhD students here. And we founded the company with our PhD supervisor, Josh Tenenbaum.

    And what we're doing at Common Sense Machines is to scale and deploy some of the breakthroughs that have happened in generative AI, the ability for AI systems to not just make decisions and discriminate between alternatives, but to really learn to create new content, 3D simulations, perceptions, and even intelligent actions.

    So share a little bit about some of the prototypes we've developed and deployed and see if you're interested in scaling 3D simulation with us. So we started with one of the key bottlenecks that's actually already been talked about a bit today. The key bottleneck in scaling machine learning more broadly, which is that you need a lot of annotated data. Or, in some cases, synthetic or simulated data to train large scale systems to do intelligent action.

    The challenge in many domains, including this one-- so this is like a clip from an Nvidia simulator-- is that all of these simulators today are built by people, and it takes hours to even days to create a single 3D model. If you wanted to model a chair in this room, that might take a few hours or hundreds of dollars of a 3D designer's time.

    To annotate even a single frame of a video can cost up to $1 for dense annotation. And when you think about the hours of video that would need to be-- that would need to be annotated, those numbers become astronomical very quickly. Finally, the tools for creating these data and simulation are hard to use. You have to train people, and they're hard to exploit efficiently.

    The unending task that this presents is that even when you create that large data set or the simulator for your particular environment, the world often always presents you with a new edge case. And the limited diversity that's present in almost any data set means that you're always going to have to keep updating that content and have to create more.

    So what CSM is doing is building a world model. It lets anybody create scalable and diverse data and content for those these e-simulators on demand without a human in the loop. We take in multimodal input-- this could be videos, images, or even some text descriptions of the kinds of content that you want-- and our system produces data and content that can load directly into the simulation platforms of today.

    And we've developed new simulation platforms that use learning at their very core. One way you can think about this is that CSM's technology lets anybody copy from the real world-- so taking content from the world-- and bring it directly into a simulated environment, paste it into simulation.

    So I'm going to walk through a short demo we did in a grocery store context. This is just me playing with a few objects that you could find in your typical CVS. The capturing content from the real world takes less than a minute. You just need an iPhone or any other camera, and you can record a short video clip of the objects.

    From there, we create 3D world models. So these are models. On the left, you can see the original inputs. And on the right, you can see the simulated output. So these are fully 3D. They have the geometry of the 3D world. And some of these models are the traditional kind of assets you would find in a game engine, like Unity or Unreal, that could load into an Nvidia Omniverse or something of that nature.

    But we've also developed-- and these are the things you're seeing on the right. They're indistinguishable. But they're actually not models at all in the traditional sense. They're neural networks. And we're showing neural renders of these models' outputs. So we're able to actually capture the real world in an implicit neural representation that can then be used in far more diverse and flexible ways than the traditional simulation modeling workflow.

    We can use these models to create unlimited data. So here's an auto labeling application. With those same objects, now I'm showing the system of video that it's never seen before. But we can label the content from-- in this data without any humans in the loop. So this is the kind of process that Tesla and others are scaling. We're letting everybody do this without much expertise.

    Finally, I just want to preview some of the stuff that's ongoing in our R&D. You can see that little cheese dish from before. That's the neural network output. So that's a implicit model of the world captured by a machine learning system. And outside, everything else is traditional computer graphics. So that's the kind of thing that a 3D designer could create.

    And we call these hybrid worlds. We're able to bring together these two technologies really for the first time, and it's letting us capture a far more diverse and complex content in simulation that has ever been able to be captured before. And just to touch on a few pilots that we've done. We've worked with robotics companies on developing 3D content for their simulation environments, both for testing and training-- testing, and training their systems.

    We've also worked with an auto manufacturer on creating content for a kind of industrial metaverse where they want to take part of their process production facilities, bring them into an AR context for optimization and planning. So we'd love to discuss future partnerships with you. We have a wait list of people who are looking to get their hands on this technology, whether it's gaming, automation, industrial metaverse. We'd all be excited to chat and share more over lunch. Thank you.

    SPEAKER: Thank you max.