Forget mere “Big Data.” Around the globe, we are generating data at an incomprehensible rate. By one estimate, we’ll create 160 zettabytes (trillions of gigabytes) annually by 2025. And this tsunami of data is now raising a previously unthinkable challenge. “That’s a lot more useful data than we will have the ability to store,” says Hyunjun Park, cofounder and chief executive officer of CATALOG, an MIT STEX25 startup company. CATALOG aims to solve this problem with a novel technology that employs the first known form of information storage on this planet: DNA.
In recent years, a number of labs have shown the ability to encode and store digital information in synthetic DNA. As odd as it may seem to use the molecule that captures biology’s genetic code for digital tasks, DNA offers compelling potential advantages. “DNA has incredible information density; you can store about a million times as many bits in the same volume as compared to flash drives or magnetic media such as hard drives and data tape,” Park says. “It’s also got an extremely long shelf life; DNA can last for thousands of years.” The DNA data storage techniques demonstrated in labs, however, have been extremely slow and expensive compared to current storage technologies. One key bottleneck is the time required to synthesize the data-encoding DNA. CATALOG is bringing a distinctive technical approach to speed this process, readying a demonstration system for commercial service this year. Based in Boston, CATALOG is looking for partnerships with large organizations who struggle with extreme data archival needs—and perhaps take an interest in even more radical technologies down the road to perform parallel computing in DNA itself.
Combining prefab DNA to encode data CATALOG began with a connection in the lab of Timothy Lu, MIT associate professor of biological engineering and of electrical engineering and computer science. Park, who trained as a microbiologist and was working as a postdoctoral researcher, began talking with Nathaniel Roquet, who was finishing up a doctorate in biophysics. Roquet was studying a class of enzymes called recombinases that can recognize and manipulate specific sequences within a longer piece of DNA. “These enzymes offer a way to change the state of a DNA molecule, so if you think about it, it is a way to store arbitrary digital information using those different states of DNA molecules, working in test tubes instead of inside of a cell,” Park says. In CATALOG’s technology demonstrations, “a computer reads the binary data and generates instructions for our liquid handler to move around our premade short pieces of DNA, and combine them in combinations that represents the ones and zeros that we want to store,” Park says. Another machine then collects the encoded DNA molecules and concentrates them into pellet form. To retrieve the information, the pellets are rehydrated and the DNA molecules are read by a genome sequencer, in a method that is essentially error-free. By midyear, CATALOG expects to complete a prototype machine that can encode about 125 gigabytes of information into DNA every 24 hours, “at a cost that's about a million times cheaper than what's been possible before with DNA,” Park says. The company will offer storage as a service to organizations interested in examining the technology. It plans another major milestone for a next-generation platform offering 125-terabyte-per-day encoding by 2022, as a fully commercial product. Formed in 2016, the company has raised $10.5 million in funding to date. It faces competition from very large firms such as Microsoft as well as several other DNA synthesis startups. However, “CATALOG is in a unique position where we're positioned to make this a reality within the next year or two, rather than in five or six years,” Park says.
Partnering for pilots In January 2019, CATALOG was selected for membership in STEX25, a startup accelerator within the MIT Startup Exchange that includes 25 “industry ready” startups that are prepared for significant growth. “The Startup Exchange has been really valuable for us in getting warm introductions to ILP member corporations that could become partners in the long run,” Park says. “We’re looking for organizations with lots of data that are interested in long-term partnerships who can pilot our machine with us, to see how this totally new storage medium could fit within their data pipelines,” says Park. Many companies in industries such as entertainment and petroleum production, and numerous government agencies, are faced with the need to archive gigantic amounts of data. “If you’re a large entity like these, you're already looking for a new solution for data archival,” Park says. The two current options are to maintain an inhouse tape library or to outsource the archive to a cloud provider. Both options are far from perfect, with limitations in storage capacity, high expense and serious concerns about the reliability of data retrieval over the years. “Our partners will influence how we develop our software layer as well as the technical features in the final product,” Park says. “We want to have as many technical and business conversations with these partners as possible throughout the processes, so that we can build the right product around their needs.” CATALOG is particularly interested in partners who are also intrigued by the possibility of taking an even more radical step into digital DNA computation. “Eventually, we want to be able to compute directly on data that's stored in DNA,” Park says. “We want to build an active information storage system, rather than something that just keeps data on the shelf forever.” “Using other DNA molecules or enzymes, we could do highly parallel computation on a massive dataset in a way that isn't really possible with classical computing,” he says. “That could solve a lot of problems in computing that are difficult to solve right now.” This approach also could potentially save on computing costs, because it avoids the need to move data from storage into memory for computing and then back again, which demands a lot of energy. “Our long-term goal is to bring computation to the data,” Park says. “Organizations that want to explore both DNA-based storage and computation could be an ideal match for us.”