10.3.23-Showcase-Osaka-Cleanlab

Startup Exchange Video | Duration: 5:34
October 3, 2023
  • Interactive transcript
    Share

    JONAS MUELLER: Hi, everybody. Thanks for having me. Super-excited to be here. So at Cleanlab what we do is we provide AI that we invented that can find and fix problems in data sets. This helps you turn your messy, unreliable data into more reliable analytics, more reliable machine learning, more reliable large language models, and pretty much any downstream thing that you're trying to use your data for.

    So all of your companies are probably collecting a lot of information and storing it as data, as all good companies are nowadays. But actually there's a lot of problems in most of that information that you're storing. And when people are creating AI models and doing data science, they typically spend about 80% of their time just preparing the data, cleaning it up, and fixing all these issues.

    So our goal and what we've been doing research in in the past decade is actually automating a lot of that process by figuring out algorithms that can find and fix problems systematically in data. We invented the first library that does this, called Cleanlab, which we open sourced back in grad school. And today it's the most popular data-centric AI library for algorithmically finding problems in data. And we have thousands and thousands of data scientists using this from all kinds of companies.

    I'm here today to talk about our SaaS platform called Cleanlab Studio. And what Cleanlab Studio is is it provides an interface that runs all these algorithms behind the scenes, but also provides the way to fix your data as efficiently as possible, via human in the loop fashion, so that you, as a data scientist, can inject your domain knowledge and triage the problems in your data based on how you're using the data. This works for text data. For example, here we have a customer service request and it's been mis-categorized in the bank's data system.

    And our software automatically detects that this category that was chosen for the original example is incorrect. We can detect all kinds of issues in image data, for example, in this cat-dog data set, things that are mislabeled, and almost duplicated data. You often just have junk in your data that are outliers, that maybe indicate a problem with the data sources.

    You have non-independent data, and all kinds of other issues like low quality images coming from various types of forms. All of this works for text data as well, and structured tabular data sets, too, where you might have just incorrect entries due to data entry problems. This also works for multimodal data sets, like on e-commerce platforms, where obviously finding and fixing problems in the data is very valuable to delivering a better experience.

    So what can you get from our software? You can get automatic validation of your data sources. So think of this like quality control for your data team. You can then use our software to produce a better version of your data set, and use that better data set in place of your original data set for any ML or analytics work. And you can just get more reliable results without changing anything in any of your existing pipeline.

    You can also, behind the scenes, all of what we're doing is based on a really sophisticated AutoML system that combines models trained on your data with pre-trained foundation models like LLMs and computer vision models that just know a lot about the world and are able to contextualize your data with their knowledge. And you can just deploy that entire system to make predictions on new data if you would like to.

    Some of our customers include all three of the biggest tech companies in the world, mainly for their voice assistants where they are spending hundreds of millions of dollars cleaning the data for these voice assistants. And our software is able to save them massive costs. Another customer is BBVA, one of the biggest banks in the world. They used our system for categorizing financial transactions in their online banking app via machine learning.

    That was a task they were doing. And they were able to reduce how much money they spend on labeling data by 98%, and improve their machine learning performance by 30%, without actually changing any of their machine learning. Tech consulting firms love using us because of the general purpose nature of our technology.

    So in this case, an example being used here is for law firms for curating data sets to decide what evidence is relevant or not in a legal court case. And e-commerce platforms use us because they can find all kinds of problems in their websites that are hurting the consumer experience. And all of this works for large language models.

    I know there's a huge amount of interest in that today. And here we're showing how you can improve all three of the most famous large language models from OpenAI, solely by either automatically filtering out bad data from their training sets, or from fixing the data in their training sets to get better performance. In all three cases here, we are not changing anything about the large language model itself. Most of you who are using OpenAI, you're just using it from a black box API and you don't really have any control.

    So all you can really change is the data itself. And our software helps you do that efficiently. So again, some of our customer profiles include the biggest tech companies in the world, but also really tiny startups, tech consulting firms, finance. It's a really, really horizontal platform. And so I'm interested in speaking with folks from any kind of industry, really, where you're dealing with big data sets, doing AI or analytics, and just trying to ensure that you are getting the most reliable results that you can.

    We also have partnerships with Databricks, Snowflake, and AWS, so if any of your data is stored in one of these, it's a no-work integration really to use our technology. And with that, I'm happy to take questions. Thanks for your time.

  • Interactive transcript
    Share

    JONAS MUELLER: Hi, everybody. Thanks for having me. Super-excited to be here. So at Cleanlab what we do is we provide AI that we invented that can find and fix problems in data sets. This helps you turn your messy, unreliable data into more reliable analytics, more reliable machine learning, more reliable large language models, and pretty much any downstream thing that you're trying to use your data for.

    So all of your companies are probably collecting a lot of information and storing it as data, as all good companies are nowadays. But actually there's a lot of problems in most of that information that you're storing. And when people are creating AI models and doing data science, they typically spend about 80% of their time just preparing the data, cleaning it up, and fixing all these issues.

    So our goal and what we've been doing research in in the past decade is actually automating a lot of that process by figuring out algorithms that can find and fix problems systematically in data. We invented the first library that does this, called Cleanlab, which we open sourced back in grad school. And today it's the most popular data-centric AI library for algorithmically finding problems in data. And we have thousands and thousands of data scientists using this from all kinds of companies.

    I'm here today to talk about our SaaS platform called Cleanlab Studio. And what Cleanlab Studio is is it provides an interface that runs all these algorithms behind the scenes, but also provides the way to fix your data as efficiently as possible, via human in the loop fashion, so that you, as a data scientist, can inject your domain knowledge and triage the problems in your data based on how you're using the data. This works for text data. For example, here we have a customer service request and it's been mis-categorized in the bank's data system.

    And our software automatically detects that this category that was chosen for the original example is incorrect. We can detect all kinds of issues in image data, for example, in this cat-dog data set, things that are mislabeled, and almost duplicated data. You often just have junk in your data that are outliers, that maybe indicate a problem with the data sources.

    You have non-independent data, and all kinds of other issues like low quality images coming from various types of forms. All of this works for text data as well, and structured tabular data sets, too, where you might have just incorrect entries due to data entry problems. This also works for multimodal data sets, like on e-commerce platforms, where obviously finding and fixing problems in the data is very valuable to delivering a better experience.

    So what can you get from our software? You can get automatic validation of your data sources. So think of this like quality control for your data team. You can then use our software to produce a better version of your data set, and use that better data set in place of your original data set for any ML or analytics work. And you can just get more reliable results without changing anything in any of your existing pipeline.

    You can also, behind the scenes, all of what we're doing is based on a really sophisticated AutoML system that combines models trained on your data with pre-trained foundation models like LLMs and computer vision models that just know a lot about the world and are able to contextualize your data with their knowledge. And you can just deploy that entire system to make predictions on new data if you would like to.

    Some of our customers include all three of the biggest tech companies in the world, mainly for their voice assistants where they are spending hundreds of millions of dollars cleaning the data for these voice assistants. And our software is able to save them massive costs. Another customer is BBVA, one of the biggest banks in the world. They used our system for categorizing financial transactions in their online banking app via machine learning.

    That was a task they were doing. And they were able to reduce how much money they spend on labeling data by 98%, and improve their machine learning performance by 30%, without actually changing any of their machine learning. Tech consulting firms love using us because of the general purpose nature of our technology.

    So in this case, an example being used here is for law firms for curating data sets to decide what evidence is relevant or not in a legal court case. And e-commerce platforms use us because they can find all kinds of problems in their websites that are hurting the consumer experience. And all of this works for large language models.

    I know there's a huge amount of interest in that today. And here we're showing how you can improve all three of the most famous large language models from OpenAI, solely by either automatically filtering out bad data from their training sets, or from fixing the data in their training sets to get better performance. In all three cases here, we are not changing anything about the large language model itself. Most of you who are using OpenAI, you're just using it from a black box API and you don't really have any control.

    So all you can really change is the data itself. And our software helps you do that efficiently. So again, some of our customer profiles include the biggest tech companies in the world, but also really tiny startups, tech consulting firms, finance. It's a really, really horizontal platform. And so I'm interested in speaking with folks from any kind of industry, really, where you're dealing with big data sets, doing AI or analytics, and just trying to ensure that you are getting the most reliable results that you can.

    We also have partnerships with Databricks, Snowflake, and AWS, so if any of your data is stored in one of these, it's a no-work integration really to use our technology. And with that, I'm happy to take questions. Thanks for your time.

    Download Transcript