2022-Korea-Showcase-CATALOG

Startup Exchange Video | Duration: 8:43
November 4, 2022
  • Interactive transcript
    Share

    [APPLAUSE]

    SPEAKER: [NON-ENGLISH SPEECH]

    SEAN MIHM: Hello. My name is Sean Mihm. I'm the director of Mechanical Engineering at Catalog. I'm here on behalf of our cofounder and CEO, Hyunjun Park. Hyunjun is an MIT and SNU alum.

    At Catalog, we're building a scalable DNA data storage and computing platform. So data, it's growing exponentially. In the next few years alone, we will more than double the total amount of data ever produced in the world. However, the ability to compute on this data, it's just not keeping up. Data is increasingly sitting idle, adding little value to companies and losing the ability to unlock new findings.

    And so at Catalog, we're trying to bridge this gap to allow the massive amounts of data to be analyzed and remain useful. And so we're leveraging DNA to do that. DNA is nature's data structure. For billions of years, DNA has been used to store and compute nature's data.

    And so why DNA? Well, DNA, it's a very dense storage medium. DNA is a million times more dense than solid-state drives. You can fit petabytes of data in just 1 gram of DNA.

    DNA is a very stable storage medium. It can last thousands of years, as compared to current technologies that only last decades. And once data is in DNA, it can be very easily replicated using molecular processes we've developed that are relatively fast and low cost. So you can easily go from ones of copies of your data to millions.

    And DNA, and more importantly, your data, will always be readable. You don't have to worry about new versions, new sequencers, new technologies because as long as we can read back DNA, we can decode your data. And what we're particularly interested in at Catalog is the ability to compute on this data. And so once you put data in DNA molecules, we can use enzymes, we can use biological probes that manipulate these DNA molecules and ultimately act as a parallel computer to your data.

    So we wanted to create a scalable solution, really make this a reality. And so similar to how books got scaled, when books got started, you would hand write all of the books. It wasn't fast, took a long time.

    But then people came out with new technologies the moveable type the printing press and you're able to scale the production of books. And so a movable type here, it uses components or letters. And you assemble them. And that can build words. It can build documents. And that allowed that production to scale.

    And so at Catalog, we're really trying to build that DNA printing press. And so at Catalog, we use what we call a combinatorial encoding scheme. We have these mass-manufactured DNA molecules. They're synthetic DNA we build. And our DNA printing press assembles these to represent the data.

    To put it another way, we basically have created an alphabet of DNA molecules or component of these. And so this alphabet contains a lot of letters. And our DNA writer stitches these together to build words, to build documents. And that's really what's storing your data.

    And so to walk through a quick example of what this can do, this is a recent image from the James Webb telescope. These space images, they're really big. They take up a lot of room. If we had thousands of these, if we had millions of these, that would be petabytes of data, exabytes of data. That takes a really long time to search.

    And so scientists, though, they want to find things in these images. So one of the questions they would ask is they want to know maybe what intensity or what stars of a certain intensity or brightness exist in this image? They might want to look at certain features in this image and say, is this a galaxy, is this a star, and better understand what's in this image.

    And another kind of one that they would ask one of these sets of images is when did one of these galaxies maybe dim? When did it change in its brightness over a long period of time? And so what's really cool is if we take these images, take all of these pixels, put them in DNA molecules, we can then search this data while it's in DNA.

    We can do this query. We can tell you exactly where these stars are of a certain brightness, what image they're in, where they are in an image. We can take these features and classify them. Is it a galaxy or is it not a galaxy?

    We can do the anomaly detection. We can look at those brightness of that galaxy over a long period of time and find the DNA molecules that represent the dimming behavior and tell you when it happened, which image it happened in, and by how much.

    What's really unique, though, about that DNA computation and what leverages DNA is the ability to do this search in parallel. So if this was a petabyte of data or an exabyte of data, it would take right about the same amount of time to do it in DNA. If you did that in a normal standard computer today, it would scale, right? Your time to process that data scales with the data size. That's just not the same scaling behavior we have to deal with in DNA.

    And so how do we use this? What do our collaborators, users really look like? And so in the oil and gas field, we're working with these big data sets to work on modeling and optimization.

    In the defense space, we have these big databases of text, of images, of vectors. And we can do database search. We can do inference, similarity searching. In finance we're working on signal processing. And in media and entertainment, the storage of media assets, leveraging that intrinsic property of DNA to store for the long term individuals' datas.

    And so what does this look like to a customer? A customer would give us data, give us a file. We encode that by creating those DNA molecules. Once we've created those, we now have a sample.

    We can do a lot of things to the sample. We can compute, do some of those applications we just talked about. We can preserve it. We can go ahead and store it and say, this is going to be an archival, long-term storage sample.

    Or we can retrieve it. Maybe you want to read back all of your data. Maybe you want to read back a certain file of your data. And so we can do that with just any standard off-the-shelf sequencer, sequence and read the DNA. And then we'll decode it back into the binary and ultimately back into your data.

    And so why are we here today? So Catalog already has a presence in Korea. We have an established subsidiary already in Korea, as well as our latest funding round was actually led by Hanwha, the Korean Conglomerate. And there's a Korean government contract expected in the field of DNA storage coming up.

    But what do we need from users, from collaborators, from partners? There's two main types. So the first here is ones that can help us work towards next-generation technologies, specifically in electronics and fluids. So how do we manipulate single DNA molecules? How do we interrogate them, do unique characteristics or features to them, so sequencing and other technologies? As well as how do we move really small amounts of liquid, so femtoliter, picoliter drops to continue to scale this technology and really try to make it a reality?

    Now, the other type of user we're really interested in as well is people that have these large databases that are looking for new findings in them. They may take days or months to currently run your equations or optimization problems across these data sets. And we want to better understand these use cases, better understand the algorithms and techniques that are needed in order to leverage DNA computation in your use.

    And so thank you again. My name is Sean Mihm. Again, we're here with Catalog. There's actually a picture of our DNA writer. So our DNA writer today prints out about a megabit per second. And we're continuing to scale new technologies, new ideas to really make this a new industry and a new solution. Thank you.

    [APPLAUSE]

  • Interactive transcript
    Share

    [APPLAUSE]

    SPEAKER: [NON-ENGLISH SPEECH]

    SEAN MIHM: Hello. My name is Sean Mihm. I'm the director of Mechanical Engineering at Catalog. I'm here on behalf of our cofounder and CEO, Hyunjun Park. Hyunjun is an MIT and SNU alum.

    At Catalog, we're building a scalable DNA data storage and computing platform. So data, it's growing exponentially. In the next few years alone, we will more than double the total amount of data ever produced in the world. However, the ability to compute on this data, it's just not keeping up. Data is increasingly sitting idle, adding little value to companies and losing the ability to unlock new findings.

    And so at Catalog, we're trying to bridge this gap to allow the massive amounts of data to be analyzed and remain useful. And so we're leveraging DNA to do that. DNA is nature's data structure. For billions of years, DNA has been used to store and compute nature's data.

    And so why DNA? Well, DNA, it's a very dense storage medium. DNA is a million times more dense than solid-state drives. You can fit petabytes of data in just 1 gram of DNA.

    DNA is a very stable storage medium. It can last thousands of years, as compared to current technologies that only last decades. And once data is in DNA, it can be very easily replicated using molecular processes we've developed that are relatively fast and low cost. So you can easily go from ones of copies of your data to millions.

    And DNA, and more importantly, your data, will always be readable. You don't have to worry about new versions, new sequencers, new technologies because as long as we can read back DNA, we can decode your data. And what we're particularly interested in at Catalog is the ability to compute on this data. And so once you put data in DNA molecules, we can use enzymes, we can use biological probes that manipulate these DNA molecules and ultimately act as a parallel computer to your data.

    So we wanted to create a scalable solution, really make this a reality. And so similar to how books got scaled, when books got started, you would hand write all of the books. It wasn't fast, took a long time.

    But then people came out with new technologies the moveable type the printing press and you're able to scale the production of books. And so a movable type here, it uses components or letters. And you assemble them. And that can build words. It can build documents. And that allowed that production to scale.

    And so at Catalog, we're really trying to build that DNA printing press. And so at Catalog, we use what we call a combinatorial encoding scheme. We have these mass-manufactured DNA molecules. They're synthetic DNA we build. And our DNA printing press assembles these to represent the data.

    To put it another way, we basically have created an alphabet of DNA molecules or component of these. And so this alphabet contains a lot of letters. And our DNA writer stitches these together to build words, to build documents. And that's really what's storing your data.

    And so to walk through a quick example of what this can do, this is a recent image from the James Webb telescope. These space images, they're really big. They take up a lot of room. If we had thousands of these, if we had millions of these, that would be petabytes of data, exabytes of data. That takes a really long time to search.

    And so scientists, though, they want to find things in these images. So one of the questions they would ask is they want to know maybe what intensity or what stars of a certain intensity or brightness exist in this image? They might want to look at certain features in this image and say, is this a galaxy, is this a star, and better understand what's in this image.

    And another kind of one that they would ask one of these sets of images is when did one of these galaxies maybe dim? When did it change in its brightness over a long period of time? And so what's really cool is if we take these images, take all of these pixels, put them in DNA molecules, we can then search this data while it's in DNA.

    We can do this query. We can tell you exactly where these stars are of a certain brightness, what image they're in, where they are in an image. We can take these features and classify them. Is it a galaxy or is it not a galaxy?

    We can do the anomaly detection. We can look at those brightness of that galaxy over a long period of time and find the DNA molecules that represent the dimming behavior and tell you when it happened, which image it happened in, and by how much.

    What's really unique, though, about that DNA computation and what leverages DNA is the ability to do this search in parallel. So if this was a petabyte of data or an exabyte of data, it would take right about the same amount of time to do it in DNA. If you did that in a normal standard computer today, it would scale, right? Your time to process that data scales with the data size. That's just not the same scaling behavior we have to deal with in DNA.

    And so how do we use this? What do our collaborators, users really look like? And so in the oil and gas field, we're working with these big data sets to work on modeling and optimization.

    In the defense space, we have these big databases of text, of images, of vectors. And we can do database search. We can do inference, similarity searching. In finance we're working on signal processing. And in media and entertainment, the storage of media assets, leveraging that intrinsic property of DNA to store for the long term individuals' datas.

    And so what does this look like to a customer? A customer would give us data, give us a file. We encode that by creating those DNA molecules. Once we've created those, we now have a sample.

    We can do a lot of things to the sample. We can compute, do some of those applications we just talked about. We can preserve it. We can go ahead and store it and say, this is going to be an archival, long-term storage sample.

    Or we can retrieve it. Maybe you want to read back all of your data. Maybe you want to read back a certain file of your data. And so we can do that with just any standard off-the-shelf sequencer, sequence and read the DNA. And then we'll decode it back into the binary and ultimately back into your data.

    And so why are we here today? So Catalog already has a presence in Korea. We have an established subsidiary already in Korea, as well as our latest funding round was actually led by Hanwha, the Korean Conglomerate. And there's a Korean government contract expected in the field of DNA storage coming up.

    But what do we need from users, from collaborators, from partners? There's two main types. So the first here is ones that can help us work towards next-generation technologies, specifically in electronics and fluids. So how do we manipulate single DNA molecules? How do we interrogate them, do unique characteristics or features to them, so sequencing and other technologies? As well as how do we move really small amounts of liquid, so femtoliter, picoliter drops to continue to scale this technology and really try to make it a reality?

    Now, the other type of user we're really interested in as well is people that have these large databases that are looking for new findings in them. They may take days or months to currently run your equations or optimization problems across these data sets. And we want to better understand these use cases, better understand the algorithms and techniques that are needed in order to leverage DNA computation in your use.

    And so thank you again. My name is Sean Mihm. Again, we're here with Catalog. There's actually a picture of our DNA writer. So our DNA writer today prints out about a megabit per second. And we're continuing to scale new technologies, new ideas to really make this a new industry and a new solution. Thank you.

    [APPLAUSE]

    Download Transcript