
10.25.23-Digital-Cleanlab

-
Video details
Fixing datasets for improved ML
-
Interactive transcript
CHEN LU: Hi, everyone. My name is Chen, and I'm an engineer at Cleanlab. At Cleanlab, we increase the dollar value of your data sets with AI by generating smart metadata that flags issues in your data sets. So Cleanlab is founded by three MIT PhDs, and we're used by some of the world's leading organizations.
The problem that we want to solve is that real world data sets are messy and dirty. Now, why should you care about this? There are two reasons. The first is that your model is only ever as good as the data that it is trained on. So if you train your model on dirty, messy data, you'll get models that are unreliable and have poor performances.
OpenAI has talked extensively about how data quality is critical for the performances of their generative models. And this is an example of their image generation AI, DALL-E 2. And the point here is that, without carefully curated data sets, these models would not be possible to exist.
The second reason you should care about messy data is that data cleaning takes a very long time. So think very high costs. Here, on the top, I'm showing the pipeline of steps that you need in order to deploy AI models in production. And you see all the steps in the middle have to do with enhancing data quality. And you would often need to do them multiple times, and you would need a dedicated team of data scientists doing this step. So think very high costs, again.
And all of this is before you even deliver any value with your models. In fact, Andrew Ng observed that 80% of the developer's time is spent on data preparation.
So how can Cleanlab help with this? We help by automating the entire process in the middle. We can identify issues in your data sets automatically, such as label errors or near duplicates or outliers, that are all difficult and time consuming to identify otherwise. We can work with most common data modalities, including images, text, and tabular data. And we make it very easy for you to identify the kinds of errors that you have in your data sets and fix them. And with one click, you can clean your data set and deploy models on that clean data.
So what can you use Cleanlab for? Really anything where you want to improve the quality of your data. Maybe you have many large data sets that are high value and you want to ensure there are no errors in them. You can use Cleanlab to identify the issues. Or maybe you want to deploy AI models in production and you have a continuous stream of new data that you want to curate. And so you can embed Cleanlab in the workflow and automatically clean your data and retrain your models on the new data.
And since Cleanlab is built with a data-centric view, it doesn't really matter where your data actually comes from. So you can even use Cleanlab on top of generative AI to make the outputs more robust. An example of this is that Cleanlab recently released our trustworthy language model API.
So you can use this on top of any large language model that you like, and it will give you, in addition to the usual text answer to questions, a confidence score and how confident the model was in its answers. So in the example here, you see the first one we correctly give a high confidence in the language model, which, in fact, the answer is correct there. And in the second one, we show that the model is unsure, and the answer that it give is actually incorrect.
So here's a specific use case of Cleanlab with the bank BBVA. They wanted to build an ML model to categorize financial transactions in their app. And the key bottleneck for them was obtaining high quality training data. So they used Cleanlab in an active learning loop where they would get new unlabeled transaction data. They would get multiple human annotators to label the data. Then they would use Cleanlab to identify the errors in their labels and to assess the quality of the annotators. And then, with this new batch, they would retrain the model and repeat.
So this model is already deployed in many countries. And compared to the previous iteration of the model, in this iteration, they were able to reduce labeling efforts by 98%. So it was a big saving cost, while improving accuracy by 28%. And all of this was achieved only with data improvements alone. So there was no change in the underlying model code. And Cleanlab-- the use of Cleanlab to ensure data quality was critical in the success of this.
So here are some sectors of companies that partner with or use Cleanlab. And since Cleanlab is such a general purpose tool, we have users in many different areas. And today we are seeking any company that wants to automatically handle messy, real-world data, no matter whether you're in the sectors shown before or not. We're industry agnostic. We're also globally focused. And we're mainly looking for customers because we already have a record of delivering value.
So I'll be happy to talk to anybody who is affected by messy data afterwards. Or if you're unsure how to evaluate the impact of messy data on your business, we'll be very happy to chat. And there's my contact information again, and thanks very much.
-
Video details
Fixing datasets for improved ML
-
Interactive transcript
CHEN LU: Hi, everyone. My name is Chen, and I'm an engineer at Cleanlab. At Cleanlab, we increase the dollar value of your data sets with AI by generating smart metadata that flags issues in your data sets. So Cleanlab is founded by three MIT PhDs, and we're used by some of the world's leading organizations.
The problem that we want to solve is that real world data sets are messy and dirty. Now, why should you care about this? There are two reasons. The first is that your model is only ever as good as the data that it is trained on. So if you train your model on dirty, messy data, you'll get models that are unreliable and have poor performances.
OpenAI has talked extensively about how data quality is critical for the performances of their generative models. And this is an example of their image generation AI, DALL-E 2. And the point here is that, without carefully curated data sets, these models would not be possible to exist.
The second reason you should care about messy data is that data cleaning takes a very long time. So think very high costs. Here, on the top, I'm showing the pipeline of steps that you need in order to deploy AI models in production. And you see all the steps in the middle have to do with enhancing data quality. And you would often need to do them multiple times, and you would need a dedicated team of data scientists doing this step. So think very high costs, again.
And all of this is before you even deliver any value with your models. In fact, Andrew Ng observed that 80% of the developer's time is spent on data preparation.
So how can Cleanlab help with this? We help by automating the entire process in the middle. We can identify issues in your data sets automatically, such as label errors or near duplicates or outliers, that are all difficult and time consuming to identify otherwise. We can work with most common data modalities, including images, text, and tabular data. And we make it very easy for you to identify the kinds of errors that you have in your data sets and fix them. And with one click, you can clean your data set and deploy models on that clean data.
So what can you use Cleanlab for? Really anything where you want to improve the quality of your data. Maybe you have many large data sets that are high value and you want to ensure there are no errors in them. You can use Cleanlab to identify the issues. Or maybe you want to deploy AI models in production and you have a continuous stream of new data that you want to curate. And so you can embed Cleanlab in the workflow and automatically clean your data and retrain your models on the new data.
And since Cleanlab is built with a data-centric view, it doesn't really matter where your data actually comes from. So you can even use Cleanlab on top of generative AI to make the outputs more robust. An example of this is that Cleanlab recently released our trustworthy language model API.
So you can use this on top of any large language model that you like, and it will give you, in addition to the usual text answer to questions, a confidence score and how confident the model was in its answers. So in the example here, you see the first one we correctly give a high confidence in the language model, which, in fact, the answer is correct there. And in the second one, we show that the model is unsure, and the answer that it give is actually incorrect.
So here's a specific use case of Cleanlab with the bank BBVA. They wanted to build an ML model to categorize financial transactions in their app. And the key bottleneck for them was obtaining high quality training data. So they used Cleanlab in an active learning loop where they would get new unlabeled transaction data. They would get multiple human annotators to label the data. Then they would use Cleanlab to identify the errors in their labels and to assess the quality of the annotators. And then, with this new batch, they would retrain the model and repeat.
So this model is already deployed in many countries. And compared to the previous iteration of the model, in this iteration, they were able to reduce labeling efforts by 98%. So it was a big saving cost, while improving accuracy by 28%. And all of this was achieved only with data improvements alone. So there was no change in the underlying model code. And Cleanlab-- the use of Cleanlab to ensure data quality was critical in the success of this.
So here are some sectors of companies that partner with or use Cleanlab. And since Cleanlab is such a general purpose tool, we have users in many different areas. And today we are seeking any company that wants to automatically handle messy, real-world data, no matter whether you're in the sectors shown before or not. We're industry agnostic. We're also globally focused. And we're mainly looking for customers because we already have a record of delivering value.
So I'll be happy to talk to anybody who is affected by messy data afterwards. Or if you're unsure how to evaluate the impact of messy data on your business, we'll be very happy to chat. And there's my contact information again, and thanks very much.