Navigating the World of Machine Learning Datasets

Many companies are turning to external datasets to launch AI successfully. We are now in an age where finding datasets is getting easier than ever before, and they are

Many companies are turning to external datasets to launch AI successfully. We are now in an age where finding datasets is getting easier than ever before, and they are getting more and more important for the performance of machine learning models. There are many sites that host repositories for data that covers an incredible range of topics—from images of rare frogs, all the way to handwriting samples. Whatever your machine learning (ML) project is, you’re likely to find a relevant dataset to serve as a starting point.In this article, we’ve collected 40+ links to some of the best ML data repositories and datasets available. We’ve sorted these by project type and by industry for ease of use. It's good to remember, while these datasets are usually good starting points, your use case might require additional labeling on top of what is available off-the-shelf.

What Kind of Data Do I Need?

Before beginning your search for the right dataset(s), you’ll want to consider asking yourself a few critical questions to guide your endeavors:

What am I trying to accomplish with AI?
Do I have sufficient in-house data I can leverage for this project?
What data do I wish I had?
What use cases do I need my data to cover?
What edge cases do I need my data to cover?

These are simply starting questions to help paint a clearer picture on the specific type of data that you’ll need. If you’re working with protected classes (that is, people of a specific race, sex, sexual orientation, or other factors), you’ll need to put in extra effort to ensure your dataset appropriately represents these people. In any case, be intentional in your search for data; a machine learning project can be easily derailed through the usage of poor quality data.

Why Off-the-Shelf Datasets?

Your team may end up deciding that you want to use off-the-shelf datasets to train your model. These options are growing increasingly prevalent in the field of AI for one reason: building AI is hard. Most AI projects fail to reach deployment due to a variety of factors:

Low budgets. Investing in AI often requires a large amount of capital.
Lack of talent. A skills gap persists not only in tech, but in AI and ML specifically. The industry lacks sufficient highly-skilled individuals to launch all of the existing AI initiatives, much less those in the future. This gap may only widen over time as the industry grows.
Early on the AI journey. Organizations must be set up properly to be able to build AI. That means they need to have the right internal processes in place, the right strategies, and the right collaboration in order to achieve success.
Low data quality or not enough data. It’s this last piece that proves to be one of the biggest hurdles in AI. ML models typically require lots of data to perform with accuracy. Acquiring this data can be challenging depending on the use case. In addition, transforming poor quality data into high quality, labeled data can be a time-consuming, inefficient process.

Given that reaching deployment continues to prove difficult for many organizations, it’s no surprise that they’re turning to third parties for help. To address the data bottleneck, companies are purchasing or accessing free off-the-shelf datasets. These can prove a useful starting point for building an ML model, or in some cases provide sufficient coverage for all use cases. Let’s talk about their benefits:

Compliance. There are growing data security requirements from customers and authorities that make it more difficult for companies to use in-house data. Some companies naturally have access to a lot of data given the work they do, but that doesn’t mean that data can be used for ML models, especially if it would violate customer privacy.
Reduced bias. The topic of responsible AI is more frequently discussed than ever before, as companies realize the importance of mitigating bias in their models. When companies rely on in-house data, it can be difficult to detect and reduce bias. But with an off-the-shelf dataset, you can research the source of the data to understand if they already incorporated bias checks while creating the data. A trusted provider will provide a diverse, high-quality dataset.
Fast time-to-market. Collecting and preparing data is a very time-consuming task, and one that data scientists spend most of their time on a project doing. With off-the-shelf datasets, the work is mostly done (although obviously you’ll want to check the dataset’s quality yourself). This leads to a faster time-to-market in an industry where speed matters.
Cost-effective. Aggregating, reviewing, and preparing in-house data can be a costly process. Many off-the-shelf datasets available online are free or inexpensive compared to that alternative. If your AI budget isn’t very high, leveraging off-the-shelf datasets may be the right path.

Many of the benefits of off-the-shelf datasets help overcome the common challenges in AI development. Using off-the-shelf datasets is undoubtedly a useful strategy to consider in implementation of ML models.

The Best Places to Find Datasets

The internet is full of quality off-the-shelf datasets. The following lists cover many of the best places to search for and to discover datasets online, in no particular order. We start with data repositories, then list the best datasets for specific use cases.

Data Repositories

Data repositories feature collections of datasets from all over the web.

Kaggle

Kaggle has one of the largest libraries of datasets available online covering a range of topics like sports, medicine, and government. Its platform is community-driven, meaning users can upload their own datasets. Given the varied sources of data, it’s important to thoroughly check the quality of any datasets you use from Kaggle. Kaggle also features discussion about machine learning topics as well as tutorials on key processes.

Google Datasets

Google offers a dataset search engine, where you can search datasets by name. The engine lets you sort datasets by several features, such as file type, theme, last update, and relevance. It also captures datasets from thousands of databases around the internet, so you’re truly searching through a wide range of options. The uploaders of datasets include international organizations, such as Harvard and the World Health Organization.

Papers with Code

Papers with Code has over four thousand (and counting) datasets available. These datasets are uploaded by the community. You can easily filter these datasets by modality, task, and language. Included in the database are also links to other databases that offer multiple varieties of datasets as well.

DataFlair

DataFlair links to over 70 machine learning datasets, and includes useful information like the source code as well as project ideas. For example, in a listing for a dataset that features handwritten digits, DataFlair suggests creating an image classification algorithm to recognize handwritten digits from a paper. The site is useful to use as a jumping off point for new ideas.

EliteDataScience

EliteDataScience includes a curated list of free datasets and their favorite aggregators. The datasets are organized by use case, so you’ll discover datasets for deep learning, natural language processing, web scraping, and more.

UCI ML Repository

UCI features over 500 machine learning datasets that are sortable by file type, task, area of application, and subject. Many of their datasets include links to academic papers that you can use for benchmarking.

Github Awesome Public Datasets

Github offers an open-source collection of public datasets. Review the table of contents to select a topic, which ranges from Agriculture to Transportation and many options in between. Github also includes a collection for general machine learning models. Most of the datasets linked are free.

Azure Public Datasets

Microsoft Azure has a database of public datasets that developers can use for prototyping and testing. The categories include US government and agency data, other statistical and scientific data, and online service data. In addition, you can read documentation on SQL and how to build mobile and web apps.Snowflake Data MarketplaceSnowflake gives data scientists, business intelligence and analytics professionals, and everyone who desires data-driven decision-making, access to more than 650 live and ready-to-query data sets from over 175 third-party data providers and data service providers.

Registry of Open Data on AWS

AWS has a registry that features datasets available through AWS resources. Users can share their own datasets or add examples of how to use specific datasets. Over 280 searchable datasets are available in the registry.

KDNuggets

KDNuggets has a comprehensive list of data repositories, where you’ll be able to find a wide variety of datasets. The list features over 75 repositories, including some that are international.

Appen

Appen offers a variety of off-the-shelf training datasets. Our catalog includes 250+ licensable datasets across 80 languages with multiple dialects included. The datasets cover many machine learning use cases, including speech recognition and natural language processing, and cover a range of file types (text, image, video, speech, and audio). For example:

Fully transcribed speech datasets for broadcast, call center, in-car, and telephony applications
Pronunciation lexicons, including both general and domain-specific (e.g. names, places, natural numbers)
Part-of-speech-tagged lexicons and thesauri
Text corpora notated for morphological information and named entities.

We offer datasets of only the highest levels of quality to support your AI needs.

Computer Vision Datasets

These databases and datasets include image data to service your computer vision project.

ImageNet

ImageNet is a selection of nouns organized according to the WordNet hierarchy, where each node has thousands of associated images. The repository’s data is free to researchers.

MNIST Database

MNIST features images of handwritten digits. It includes a training set of 60,000 examples and a test set of 10,000 examples.

IMDB-Wiki Dataset

IMDB-Wiki Dataset provides the largest collection of face images, with over 500,000 images. Many of the images are obtained from celebrities and from Wikipedia. Each image has a gender and age label attached.

LabelMe Dataset

LabelMe Dataset was built using the LabelMe annotation tool. The tool enables the user to outline an object and add a label to that object. This dataset can be used for image recognition projects.

MS COCO Dataset

MS COCO stands for Microsoft Common Objects in Context Dataset, and was published for the Common Objects in Context challenge. It includes over 120,000 images and each image has multiple tags related to object detection, segmentation, and other image annotation techniques. There are 91 categories of images in the set.

Chars74K

Chars74K includes, as the name suggests, 74,000 images. The data includes character recognition in natural images (for instance, an image of a restaurant sign).

Kinetics-700

Kinetics-700 includes a selection of YouTube video links labeled with human-focused actions. There are over 650,000 video clips across 700 human actions.

Places2 Database

Places2 Database is a dataset released by MIT that has more than 10 million images of over 400 scenes. It can help on projects covering scene classification and scene parsing.

Open Images

Open Images dataset is one of the largest datasets featuring object location annotations. It has over 9 million images that are each labeled with object bounding boxes, segmentation, and other annotations. In all, there are 16 million bounding boxes across 600 classes.

MPII Human Pose Dataset

MPII Human Pose Dataset includes about 25,000 images of 410 human poses. There are approximately 40,000 different people included in the images, and each image has annotated body joints. The images were collected from YouTube videos.

Natural Language Processing Datasets

The following datasets feature natural language examples across text and audio that can be used for your natural language processing projects. These examples cover sentiment analysis, speech recognition, transcription, and more.

Google Blogger Corpus

Google Blogger Corpus includes close to 700,000 blog posts taken from blogger.com. In each entry, there are at least 200 English words. Overall, the blog posts include many commonly occurring English words.

Yelp Reviews

Yelp Reviews cover rankings and reviews for restaurants, and the dataset is rich with information related to this topic. The dataset features reviews for sentiment analysis.

WikiQA Corpus

WikiQA Corpus is a dataset featuring question and answer pairs compiled from Bing search data. With over 3,000 questions, it offers 29,000 answer sentences, 1,500 of which are labeled as answer sentences.

M-AI Labs Speech Dataset

M-AI Labs Speech Dataset includes close to 1,000 hours of audio paired with transcriptions. Both female and male voices are represented across several languages.

LibriSpeech

LibriSpeech includes about 1,000 hours of speech data that has been segmented and aligned. The data was compiled from the reading of audiobooks from the LibriVox project.

WordNet

WordNet is a database of English words that are grouped together by meaning. There are 117,000 synsets (words paired together based on synonymy), which are then linked to related synsets. Use this for your next text classification project.

OpinRank Dataset

OpinRank Dataset features 300,000 reviews curated from Edmunds and TripAdvisor. They’re categorized by travel destination, hotel, and other relevant factors.

Multi-Domain Sentiment Dataset

Multi-Domain Sentiment Dataset consists of Amazon.com product reviews across four domains: DVDs, books, kitchen, and electronics. Each domain has a few thousand reviews with star ratings from 1 to 5 attached. As the name suggests, this could be a useful dataset for a sentiment analysis project.

Twitter Sentiment Analysis

Twitter Sentiment Analysis dataset includes over 1.5 million classified tweets. Each row of the dataset has a ranking: 1 for a positive sentiment and 0 for a negative sentiment.

20 Newsgroups

20 Newsgroups contains 20,000 documents from, as the name suggests, over 20 different newsgroups. There are many topics included, some of which are relatively similar. The dataset includes three versions: one in its initial form, one with dates removed, and one with duplicates removed.

Datasets by Industry

It’s worth mentioning several valuable resources to obtain industry-specific data.

US Government Data Portal

US Government Data Portal includes all government data, which the US pledged to make available. By visiting the portal, you can search through over 300,000 datasets (for example, student loan data or healthcare provider charges data). Industry: Government

European Union Open Data Portal

European Union Open Data Portal offers a way to search through data from European Union institutions, such as population data, education, and more. Industry: Government

World Health Organization

World Health Organization features data covering important topics like world hunger, healthcare, and disease. Industry: Healthcare

Broad Institute

Broad Institute provides many datasets that cover cancer-related topics, from sequencing to classification. Industry: Healthcare

Google Finance

Google Finance includes over 40 years of stock market data, and is continually updated in real time. Industry: Finance

Berkeley DeepDrive

Berkeley DeepDrive was created by UC Berkeley and contains over 100,000 video clips of different geographic, environmental, and weather conditions. These clips are annotated with bounding boxes to detect objects, lane markings, and various forms of segmentation. The dataset can be used to help train autonomous vehicles. Industry: Automotive

Level5

Level5 was created by Lyft, the ride sharing company. The dataset features raw sensor camera and LiDAR data captured by multiple autonomous vehicles in a specific geographic area. The dataset is labeled with 3D bounding boxes of specific target objects. Industry: Automotive

USDA Open Data Catalog

USDA Open Data Catalog includes data captured by the US Department of Agriculture. The topics range from measured productivity of US agriculture to cost estimates of foodborne illnesses and much in between. Industry: Agriculture

Fashion-MNIST

Fashion-MNIST includes close to 60,000 images and 10,000 test images of fashion industry products across 10 classes. These are useful for a product categorization project. Industry: Retail

eCommerce Search Relevance

eCommerce Search Relevance dataset features links to products, what rank those products achieved on a page, the search query that provided that result, and other relevant attributes. The data was collected from five major English-language ecommerce sites. Industry: RetailTo find datasets in industries not mentioned here, simply search through the data repositories above using the appropriate industry tag.

Expert Insight from Monchu Chen - Principal Data Scientist

What to consider when selecting database

When starting a new project, it’s best not to rush to any available datasets immediately. Take a step back and look at the user needs that your applications or services are going to serve. Sometimes, the same product design opportunity could be addressed by different AI-driven features. The potential solutions you identify can rely on choosing between contrasting ML models which can have different price points to develop and build, and will likely require different approaches to training data. Once you are ready to move forward, here are some tips to select publicly available datasets to kick start your development when you don’t have access to a dedicated budget for curating your own collections.

Subset of a dataset?

When choosing a dataset, don’t be frightened by the complexity of the whole dataset. Sometimes, you could extract a subset of the whole, and that can be exactly what you need for your ML project.

Combining multiple datasets?

Sometimes, the dataset you chose might not fit exactly with what you need to develop your model. Consider combining multiple datasets (or subsets) to form a training set with a better resemblance to the total population of the use case you want to tackle.

API available?

Many datasets come with APIs or libraries for easy data access and transformation. This could save you valuable time early in your journey.

Sample projects available?

You can also go out and find people that have worked on projects that utilize popular datasets and have made their work public using repositories like Github. Use their source codes, models, or even pre-trained models as the foundation or simply as a reference when making your data choices.

License issues?

Just like software, datasets do have different types of licenses. Some may require you to share your work on that specific dataset. Others may limit your applications for non-commercial use only. A typical strategy is to separate your code as far as possible from the datasets. The best way to stay safe is to seek legal advice before choosing a dataset for your applications.

Short/Long term consideration?

When making short-term decisions, such as selecting your very first dataset to work with, it’s best to consider its long-term impact. Look at the big picture and you may find the second-best choice at the beginning may save you a lot of time, effort and budget, when you need to transition from a public domain dataset to your own curated dataset.

What We Can Do For You

As you decide to further enrich the off-the-shelf datasets, you can leverage our data collection and annotation services and our platform to get the data your machine learning model needs to work at scale. As a global leader in our field, our clients benefit from our capability to quickly deliver large volumes of high-quality data across multiple data types, including image, video, speech, audio, and text for your specific AI program needs. We offer several data solutions and services to best fit your needs, including our off-the-shelf datasets. Operating with over 25+ years of expertise, we’ll work with you to optimize your data pipeline efficiency to its maximum.To discuss your training data needs, contact us.

Navigating the World of Machine Learning Datasets: Where to Find the Best Machine Learning Datasets