
10.5.23-Showcase-Tokyo-Cleanlab

-
Video details
Startup Lightening Talk
-
Interactive transcript
JONAS MUELLER: Hi, everybody. So my talk is going to be of interest to those of you working on data science or machine learning and if you're interested in getting better performance, reducing your costs, deploying models faster, and more reliably. And one way to do this is by actually fixing your data itself. And this is a way that's very orthogonal to all of the fancy modeling improvements you can get by hiring big teams of PhDs. And often, these kinds of improvements stack together with the modeling improvements that your scientists are making.
And at Cleanlab, which is a company I founded about a year ago, we are using AI to figure out different kinds of problems in your data that can be automatically corrected so that you can get better downstream benefits like better analytics, better machine learning or large language models, as well as many other downstream benefits that you would get just from having a better version of your data.
And so really the motivation behind this is that 80% of most data scientists and AI developers are spending all of-- 80% of their time on data preparation tasks and data cleaning. And all your data sets, whether you know it or not, are probably full of all kinds of issues.
And so we've been doing research in this space for over a decade with my two cofounders and really have been exploring, how can we use algorithms to discover problems in data that can be fixed automatically to give you better downstream benefits? We opensourced the most popular library for this field, which is called data-centric AI, over many years ago actually as part of our research. And today, this is the most popular software library for doing this used by tens of thousands of data scientists across all industries.
Recently, we launched our SaaS platform called Cleanlab Studio, which is basically doing this on steroids and providing you an interface to actually go in and fix all of the issues detected by our AI very efficiently. Here are shown for fixing some issues in an image data set where the given label for this image is speaker. But, clearly, this is some kind of concert. And a better label might be, say, stage.
Also, this works for text data, for example, customer service requests at a bank, which is a common use case we have. And really, the kinds of issues in data that can be automatically detected by our AI include things like in this cat-dog data set, mislabeled data, nearly duplicated data, really strange outliers in your data set that are often indicative of a problem in your data sources, nonindependent data that's actually linked together, low-quality images or low-quality text.
And as I mentioned, all of this works for text data, structure data, where we can find incorrect values often due to data entry mistakes or data merging mistakes when you're merging two different data tables, and also multimodal data sets like e-commerce platforms, where we can find miscategorized products. We can find low-quality product images, nearly duplicated products, and all kinds of problems to quickly clean up the e-commerce data set and deliver a better experience for your customers.
And so what you can get from our software includes automated data validation, which you can think of as basically quality control for your data team. A second thing you can get is quickly using our software to produce a better version of your data set. And then you can just plug in the better version of your data set in place of your original data set. It has the exact same format. And you can just get more reliable machine learning, analytics, whatever you were originally doing in your existing pipeline. And you don't even have to change your existing pipeline at all.
And, finally, how we're detecting all of these issues in your data in the first place is we've developed a really sophisticated automated machine learning system that both trains models specific to your data and uses pretrained foundation models like LLMs that know a lot about the world and are able to contextualize your data with their knowledge about the world.
And after you've fixed up your data set in our system, you can actually just click one button and retrain the entire system on the clean data set and then click one more button and deploy this machine learning system to be available for prediction on new data. So this is pretty much the fastest way today to go from messy, raw data to a highly accurate and reliable deployed machine learning model.
Some of our customers include all of the biggest tech companies in the world. And they have used our software for their voice assistant data sets, which are humongous. They spend hundreds of millions of dollars cleaning these data sets. And using our software, they're able to drastically reduce the costs for doing so while still producing data at the same quality as they were before.
Another customer is BBVA, one of the biggest banks in the world. And here they used our software for their machine learning problem of categorizing financial transactions in their online banking app. And what they achieved by just improving the data-- so they didn't change any of their existing machine learning-- was they were able to reduce how much data they were labeling by 98%.
So this is cost savings. And they were able to improve the model accuracy by 28%. So this is improved customer experience. And again, this is all with no change in any of their existing stack.
Another kind of customer that really loves using our technology is tech consulting firms like Berkeley Research Group because our technology is super general purpose and useful across pretty much all industries.
Here is an example in a legal application where our software is used to determine which annotations in legal documents are incorrect. And these annotations are often given by lawyers and paralegals and used to determine, for example, what evidence is permissible in a court case or what evidence is privileged information that must be concealed from the court case. And as you can imagine, they are very expensive to obtain. So by using our software to find errors and reduce how much review you need of such data, you achieve great cost savings and improve your machine learning performance.
Another application is an e-commerce where, as I mentioned before, our software helps e-commerce platforms automatically curate their data sets to provide a much better e-commerce experience for their customers, including content moderation functionalities like detecting toxicity or personally identifiable information in the product descriptions or reviews or low-quality or not safe for work images in the product images.
And the final thing I want to mention is all of this works for pretty much any kind of machine learning, including large language models. So here we show different OpenAI large language models being trained on different versions of data sets that have been automatically corrected using our tool, as well as corrected via the interface I showed before.
And you can see how the performance of any kind of large language model-- here the three latest ones from OpenAI-- just continues going up as you improve the data. And, again, you're never changing anything in the large language model itself here. Obviously, that's very difficult when you're just using an OpenAI API.
And with that, I just want to emphasize our technology is extremely horizontal. So we're pretty much looking for customers across all industries working on any kind of data science and machine learning tasks. We're here to help you discover the problems in your data, improve them, and get better machine learning and analytics as a result. Thank you.
-
Video details
Startup Lightening Talk
-
Interactive transcript
JONAS MUELLER: Hi, everybody. So my talk is going to be of interest to those of you working on data science or machine learning and if you're interested in getting better performance, reducing your costs, deploying models faster, and more reliably. And one way to do this is by actually fixing your data itself. And this is a way that's very orthogonal to all of the fancy modeling improvements you can get by hiring big teams of PhDs. And often, these kinds of improvements stack together with the modeling improvements that your scientists are making.
And at Cleanlab, which is a company I founded about a year ago, we are using AI to figure out different kinds of problems in your data that can be automatically corrected so that you can get better downstream benefits like better analytics, better machine learning or large language models, as well as many other downstream benefits that you would get just from having a better version of your data.
And so really the motivation behind this is that 80% of most data scientists and AI developers are spending all of-- 80% of their time on data preparation tasks and data cleaning. And all your data sets, whether you know it or not, are probably full of all kinds of issues.
And so we've been doing research in this space for over a decade with my two cofounders and really have been exploring, how can we use algorithms to discover problems in data that can be fixed automatically to give you better downstream benefits? We opensourced the most popular library for this field, which is called data-centric AI, over many years ago actually as part of our research. And today, this is the most popular software library for doing this used by tens of thousands of data scientists across all industries.
Recently, we launched our SaaS platform called Cleanlab Studio, which is basically doing this on steroids and providing you an interface to actually go in and fix all of the issues detected by our AI very efficiently. Here are shown for fixing some issues in an image data set where the given label for this image is speaker. But, clearly, this is some kind of concert. And a better label might be, say, stage.
Also, this works for text data, for example, customer service requests at a bank, which is a common use case we have. And really, the kinds of issues in data that can be automatically detected by our AI include things like in this cat-dog data set, mislabeled data, nearly duplicated data, really strange outliers in your data set that are often indicative of a problem in your data sources, nonindependent data that's actually linked together, low-quality images or low-quality text.
And as I mentioned, all of this works for text data, structure data, where we can find incorrect values often due to data entry mistakes or data merging mistakes when you're merging two different data tables, and also multimodal data sets like e-commerce platforms, where we can find miscategorized products. We can find low-quality product images, nearly duplicated products, and all kinds of problems to quickly clean up the e-commerce data set and deliver a better experience for your customers.
And so what you can get from our software includes automated data validation, which you can think of as basically quality control for your data team. A second thing you can get is quickly using our software to produce a better version of your data set. And then you can just plug in the better version of your data set in place of your original data set. It has the exact same format. And you can just get more reliable machine learning, analytics, whatever you were originally doing in your existing pipeline. And you don't even have to change your existing pipeline at all.
And, finally, how we're detecting all of these issues in your data in the first place is we've developed a really sophisticated automated machine learning system that both trains models specific to your data and uses pretrained foundation models like LLMs that know a lot about the world and are able to contextualize your data with their knowledge about the world.
And after you've fixed up your data set in our system, you can actually just click one button and retrain the entire system on the clean data set and then click one more button and deploy this machine learning system to be available for prediction on new data. So this is pretty much the fastest way today to go from messy, raw data to a highly accurate and reliable deployed machine learning model.
Some of our customers include all of the biggest tech companies in the world. And they have used our software for their voice assistant data sets, which are humongous. They spend hundreds of millions of dollars cleaning these data sets. And using our software, they're able to drastically reduce the costs for doing so while still producing data at the same quality as they were before.
Another customer is BBVA, one of the biggest banks in the world. And here they used our software for their machine learning problem of categorizing financial transactions in their online banking app. And what they achieved by just improving the data-- so they didn't change any of their existing machine learning-- was they were able to reduce how much data they were labeling by 98%.
So this is cost savings. And they were able to improve the model accuracy by 28%. So this is improved customer experience. And again, this is all with no change in any of their existing stack.
Another kind of customer that really loves using our technology is tech consulting firms like Berkeley Research Group because our technology is super general purpose and useful across pretty much all industries.
Here is an example in a legal application where our software is used to determine which annotations in legal documents are incorrect. And these annotations are often given by lawyers and paralegals and used to determine, for example, what evidence is permissible in a court case or what evidence is privileged information that must be concealed from the court case. And as you can imagine, they are very expensive to obtain. So by using our software to find errors and reduce how much review you need of such data, you achieve great cost savings and improve your machine learning performance.
Another application is an e-commerce where, as I mentioned before, our software helps e-commerce platforms automatically curate their data sets to provide a much better e-commerce experience for their customers, including content moderation functionalities like detecting toxicity or personally identifiable information in the product descriptions or reviews or low-quality or not safe for work images in the product images.
And the final thing I want to mention is all of this works for pretty much any kind of machine learning, including large language models. So here we show different OpenAI large language models being trained on different versions of data sets that have been automatically corrected using our tool, as well as corrected via the interface I showed before.
And you can see how the performance of any kind of large language model-- here the three latest ones from OpenAI-- just continues going up as you improve the data. And, again, you're never changing anything in the large language model itself here. Obviously, that's very difficult when you're just using an OpenAI API.
And with that, I just want to emphasize our technology is extremely horizontal. So we're pretty much looking for customers across all industries working on any kind of data science and machine learning tasks. We're here to help you discover the problems in your data, improve them, and get better machine learning and analytics as a result. Thank you.