
10.10.23-Showcase-Seoul-Cleanlab

-
Video details
Startup Lightening Talk
-
Interactive transcript
JONAS MUELLER: Hi, everybody. I'm super excited to be here, and thank you for hosting. And so my talk is going to focus on everybody in the room who's interested in data science, AI, machine learning, analytics. And really, what you want is better analytics, better machine learning, so higher accuracy. You want it cheaper, and you want it faster. And one way to get two of those is to hire smarter people with more experience and try and improve all the models through years and years of training and experimentation.
You might be able to get some improvement by improving the models, but really, this has plateaued in recent times. And today, most of the improvements you can make in these areas is by actually improving the data itself. And so what we're building at Cleanlab lab is AI systems that can automatically help you improve the data itself more systematically and more automatically to give you better analytics, give you better machine learning, and basically any better downstream outcome that you would have by having a better version of your data.
And so this is part of this movement towards data-centric AI, and really, the main reason for this is just that most machine learning and data science people today are spending 80% of their time just on data preparation, data cleaning, and this is costing a lot of money and leading to bad results if you don't do it properly.
And so me and my co-founders have been doing research in this space for over a decade at MIT, really thinking about how can we use AI, how can we use algorithms, to actually understand what are the common problems that occur in data and propose fixes for these problems.
And we have developed the most popular open-source library today for data-centric AI, which is a tool that can essentially find and fix all kinds of problems in your machine learning datasets and is used by tens of thousands of data scientists in all kinds of companies around the world, probably in Korea as well.
And today, I'm excited to introduce you to our SaaS platform, Cleanlab Studio, which essentially does all this in a much more seamless way. So first of all, it provides an interface for you to actually understand the problems in your data and correct them to actually edit big groups of data as well as individual data points.
And this works for image and text data, and the kinds of issues that our AI can automatically detect in your datasets. For example, in, say, this cat-dog dataset-- image dataset here, include mislabeled data; nearly duplicated data; really weird data, like outliers, which are usually indicating some problem in your data sources-- not independent data-- as well as all kinds of low-quality examples, like blurry images and other low-quality data, which naturally tends to appear in most enterprise datasets.
As I mentioned, all of this works for texts, like we have a lot of customers using us for their customer service applications, where there's usually large data science teams involved.
And it all works for structured, tabular datasets as well, which might come from a database or Excel file. And also, some of our users love using us for multimodal datasets. For example, in e-commerce, you have products that are described by numerical features in a tabular format, images and text descriptions. And our system can basically find all kinds of problem in your data, like miscategorized products, low-quality product images, et cetera.
And so what you can get from Cleanlab Studio is really three things. The first one is automated validation of your data sources. So data teams, data annotation teams, data review teams love using this as essentially a quality control system for their data. It allows you to ship high-quality data with much less cost, less people reviewing it, less people labeling it.
The second group of users that love our product are data science and machine learning teams, where you already are typically having some kind of data science, data analytics, or machine learning pipeline. And it's processing your raw data, and you can just use our software to quickly produce an improved version of your dataset and then plug that improved version of your data in where you used to use your raw data and just get more reliable machine learning, more reliable analytics, with no change in any of your existing tech stack.
And finally, how we're doing all of this behind the scenes is we have a cutting edge AutoML system that's being trained on your data automatically to learn the statistics of your data and the kinds of natural properties. And we combine that with pretrained foundation models, like LLMs, that know a lot about the world and are able to contextualize your data with their world understanding. And we combine these two systems together to really detect all the problems in your dataset.
But actually, after you go through and clean up your dataset in our application, you can just click one button and retrain this entire system on your fixed dataset and click one more button and actually deploy the system to be able to serve predictions in your applications. And this is one of the fastest ways today to go from messy, raw data to a deployed and highly accurate machine learning model.
Some of our customers include the three biggest tech companies in the world, which use our software to find and fix different kinds of errors in their voice assistant datasets. And they are spending hundreds of millions of dollars on curating these data sets, and our software allows them to produce better versions of the data for lower costs.
One of the biggest banks in the world, BBVA, is one of our customers. And they were trying to do categorization of financial transactions in their online banking app. So they used to basically label three million data points, three million of these transactions, and now they only have to label 40,000. And they've improved the modeling performance by 30% and reduced, essentially, the costs by 98%.
And all of this is achieved without any change in any of their existing machine learning. So the machine learning model is the same. Their machine learning code is the same. They just plug in our software and are improving the data itself and then using that better data to retrain the same model they already had before.
Tech consulting firms really love our company, like Berkeley Research Group because essentially, our software is so horizontal it can be used for all industries and all sectors. And so they discover new use cases all the time, for example, discovering that our software is really useful for legal proceedings, where you have to determine what data is relevant to the court case or not relevant evidence.
And the people, the paralegals and lawyers annotating this data, make many errors. It's really expensive to keep reviewing this data with more lawyers. And so these firms have been able to save a lot of money for the document curation for court cases.
Finally, we have customers in the e-commerce space, as I alluded to before, where really, in e-commerce the data is the product. Having a bad dataset means your website is bad and your shop is bad. And so we are able to find all kinds of problems in such datasets, like bad text, like toxicity, or personally identifiable information in your text, all kinds of content moderation functionalities, like not-safe-for-work images.
But more importantly, we're also able to identify miscategorized products and outliers and nearly duplicated products that really require understanding the information in the data.
And all of this works with LLMs as well. So you can pretty much improve all of the open OpenAI LLMs by improving their data, just like you can improve any other machine learning and analytics by improving its data. And to summarize, as I mentioned before, we really are a super horizontal platform that can be used by the biggest tech companies in the world.
But also we have a lot of really small startups that are our customers. We have customers in law and finance. We're partnered with the three big data platforms Databricks, AWS, and Snowflake. So if you're on there and you want to use us, it's really easy to get started. And really, in Asian markets, we're just seeking more partners and more customers across all industries to discover how you might use this horizontal platform for improving your own data and getting better machine learning and analytics
Thank you.
-
Video details
Startup Lightening Talk
-
Interactive transcript
JONAS MUELLER: Hi, everybody. I'm super excited to be here, and thank you for hosting. And so my talk is going to focus on everybody in the room who's interested in data science, AI, machine learning, analytics. And really, what you want is better analytics, better machine learning, so higher accuracy. You want it cheaper, and you want it faster. And one way to get two of those is to hire smarter people with more experience and try and improve all the models through years and years of training and experimentation.
You might be able to get some improvement by improving the models, but really, this has plateaued in recent times. And today, most of the improvements you can make in these areas is by actually improving the data itself. And so what we're building at Cleanlab lab is AI systems that can automatically help you improve the data itself more systematically and more automatically to give you better analytics, give you better machine learning, and basically any better downstream outcome that you would have by having a better version of your data.
And so this is part of this movement towards data-centric AI, and really, the main reason for this is just that most machine learning and data science people today are spending 80% of their time just on data preparation, data cleaning, and this is costing a lot of money and leading to bad results if you don't do it properly.
And so me and my co-founders have been doing research in this space for over a decade at MIT, really thinking about how can we use AI, how can we use algorithms, to actually understand what are the common problems that occur in data and propose fixes for these problems.
And we have developed the most popular open-source library today for data-centric AI, which is a tool that can essentially find and fix all kinds of problems in your machine learning datasets and is used by tens of thousands of data scientists in all kinds of companies around the world, probably in Korea as well.
And today, I'm excited to introduce you to our SaaS platform, Cleanlab Studio, which essentially does all this in a much more seamless way. So first of all, it provides an interface for you to actually understand the problems in your data and correct them to actually edit big groups of data as well as individual data points.
And this works for image and text data, and the kinds of issues that our AI can automatically detect in your datasets. For example, in, say, this cat-dog dataset-- image dataset here, include mislabeled data; nearly duplicated data; really weird data, like outliers, which are usually indicating some problem in your data sources-- not independent data-- as well as all kinds of low-quality examples, like blurry images and other low-quality data, which naturally tends to appear in most enterprise datasets.
As I mentioned, all of this works for texts, like we have a lot of customers using us for their customer service applications, where there's usually large data science teams involved.
And it all works for structured, tabular datasets as well, which might come from a database or Excel file. And also, some of our users love using us for multimodal datasets. For example, in e-commerce, you have products that are described by numerical features in a tabular format, images and text descriptions. And our system can basically find all kinds of problem in your data, like miscategorized products, low-quality product images, et cetera.
And so what you can get from Cleanlab Studio is really three things. The first one is automated validation of your data sources. So data teams, data annotation teams, data review teams love using this as essentially a quality control system for their data. It allows you to ship high-quality data with much less cost, less people reviewing it, less people labeling it.
The second group of users that love our product are data science and machine learning teams, where you already are typically having some kind of data science, data analytics, or machine learning pipeline. And it's processing your raw data, and you can just use our software to quickly produce an improved version of your dataset and then plug that improved version of your data in where you used to use your raw data and just get more reliable machine learning, more reliable analytics, with no change in any of your existing tech stack.
And finally, how we're doing all of this behind the scenes is we have a cutting edge AutoML system that's being trained on your data automatically to learn the statistics of your data and the kinds of natural properties. And we combine that with pretrained foundation models, like LLMs, that know a lot about the world and are able to contextualize your data with their world understanding. And we combine these two systems together to really detect all the problems in your dataset.
But actually, after you go through and clean up your dataset in our application, you can just click one button and retrain this entire system on your fixed dataset and click one more button and actually deploy the system to be able to serve predictions in your applications. And this is one of the fastest ways today to go from messy, raw data to a deployed and highly accurate machine learning model.
Some of our customers include the three biggest tech companies in the world, which use our software to find and fix different kinds of errors in their voice assistant datasets. And they are spending hundreds of millions of dollars on curating these data sets, and our software allows them to produce better versions of the data for lower costs.
One of the biggest banks in the world, BBVA, is one of our customers. And they were trying to do categorization of financial transactions in their online banking app. So they used to basically label three million data points, three million of these transactions, and now they only have to label 40,000. And they've improved the modeling performance by 30% and reduced, essentially, the costs by 98%.
And all of this is achieved without any change in any of their existing machine learning. So the machine learning model is the same. Their machine learning code is the same. They just plug in our software and are improving the data itself and then using that better data to retrain the same model they already had before.
Tech consulting firms really love our company, like Berkeley Research Group because essentially, our software is so horizontal it can be used for all industries and all sectors. And so they discover new use cases all the time, for example, discovering that our software is really useful for legal proceedings, where you have to determine what data is relevant to the court case or not relevant evidence.
And the people, the paralegals and lawyers annotating this data, make many errors. It's really expensive to keep reviewing this data with more lawyers. And so these firms have been able to save a lot of money for the document curation for court cases.
Finally, we have customers in the e-commerce space, as I alluded to before, where really, in e-commerce the data is the product. Having a bad dataset means your website is bad and your shop is bad. And so we are able to find all kinds of problems in such datasets, like bad text, like toxicity, or personally identifiable information in your text, all kinds of content moderation functionalities, like not-safe-for-work images.
But more importantly, we're also able to identify miscategorized products and outliers and nearly duplicated products that really require understanding the information in the data.
And all of this works with LLMs as well. So you can pretty much improve all of the open OpenAI LLMs by improving their data, just like you can improve any other machine learning and analytics by improving its data. And to summarize, as I mentioned before, we really are a super horizontal platform that can be used by the biggest tech companies in the world.
But also we have a lot of really small startups that are our customers. We have customers in law and finance. We're partnered with the three big data platforms Databricks, AWS, and Snowflake. So if you're on there and you want to use us, it's really easy to get started. And really, in Asian markets, we're just seeking more partners and more customers across all industries to discover how you might use this horizontal platform for improving your own data and getting better machine learning and analytics
Thank you.