What Kind of Data Do I Need?Before beginning your search for the right dataset(s), you’ll want to consider asking yourself a few critical questions to guide your endeavors:
- What am I trying to accomplish with AI?
- Do I have sufficient in-house data I can leverage for this project?
- What data do I wish I had?
- What use cases do I need my data to cover?
- What edge cases do I need my data to cover?
Why Off-the-Shelf Datasets?Your team may end up deciding that you want to use off-the-shelf datasets to train your model. These options are growing increasingly prevalent in the field of AI for one reason: building AI is hard. Most AI projects fail to reach deployment due to a variety of factors:
- Low budgets. Investing in AI often requires a large amount of capital.
- Lack of talent. A skills gap persists not only in tech, but in AI and ML specifically. The industry lacks sufficient highly-skilled individuals to launch all of the existing AI initiatives, much less those in the future. This gap may only widen over time as the industry grows.
- Early on the AI journey. Organizations must be set up properly to be able to build AI. That means they need to have the right internal processes in place, the right strategies, and the right collaboration in order to achieve success.
- Low data quality or not enough data. It’s this last piece that proves to be one of the biggest hurdles in AI. ML models typically require lots of data to perform with accuracy. Acquiring this data can be challenging depending on the use case. In addition, transforming poor quality data into high quality, labeled data can be a time-consuming, inefficient process.
- Compliance. There are growing data security requirements from customers and authorities that make it more difficult for companies to use in-house data. Some companies naturally have access to a lot of data given the work they do, but that doesn’t mean that data can be used for ML models, especially if it would violate customer privacy.
- Reduced bias. The topic of responsible AI is more frequently discussed than ever before, as companies realize the importance of mitigating bias in their models. When companies rely on in-house data, it can be difficult to detect and reduce bias. But with an off-the-shelf dataset, you can research the source of the data to understand if they already incorporated bias checks while creating the data. A trusted provider will provide a diverse, high-quality dataset.
- Fast time-to-market. Collecting and preparing data is a very time-consuming task, and one that data scientists spend most of their time on a project doing. With off-the-shelf datasets, the work is mostly done (although obviously you’ll want to check the dataset’s quality yourself). This leads to a faster time-to-market in an industry where speed matters.
- Cost-effective. Aggregating, reviewing, and preparing in-house data can be a costly process. Many off-the-shelf datasets available online are free or inexpensive compared to that alternative. If your AI budget isn’t very high, leveraging off-the-shelf datasets may be the right path.
The Best Places to Find DatasetsThe internet is full of quality off-the-shelf datasets. The following lists cover many of the best places to search for and to discover datasets online, in no particular order. We start with data repositories, then list the best datasets for specific use cases.
Data RepositoriesData repositories feature collections of datasets from all over the web.
KaggleKaggle has one of the largest libraries of datasets available online covering a range of topics like sports, medicine, and government. Its platform is community-driven, meaning users can upload their own datasets. Given the varied sources of data, it’s important to thoroughly check the quality of any datasets you use from Kaggle. Kaggle also features discussion about machine learning topics as well as tutorials on key processes.
Google DatasetsGoogle offers a dataset search engine, where you can search datasets by name. The engine lets you sort datasets by several features, such as file type, theme, last update, and relevance. It also captures datasets from thousands of databases around the internet, so you’re truly searching through a wide range of options. The uploaders of datasets include international organizations, such as Harvard and the World Health Organization.
Papers with CodePapers with Code has over four thousand (and counting) datasets available. These datasets are uploaded by the community. You can easily filter these datasets by modality, task, and language. Included in the database are also links to other databases that offer multiple varieties of datasets as well.
DataFlairDataFlair links to over 70 machine learning datasets, and includes useful information like the source code as well as project ideas. For example, in a listing for a dataset that features handwritten digits, DataFlair suggests creating an image classification algorithm to recognize handwritten digits from a paper. The site is useful to use as a jumping off point for new ideas.
EliteDataScienceEliteDataScience includes a curated list of free datasets and their favorite aggregators. The datasets are organized by use case, so you’ll discover datasets for deep learning, natural language processing, web scraping, and more.
UCI ML RepositoryUCI features over 500 machine learning datasets that are sortable by file type, task, area of application, and subject. Many of their datasets include links to academic papers that you can use for benchmarking.
Github Awesome Public DatasetsGithub offers an open-source collection of public datasets. Review the table of contents to select a topic, which ranges from Agriculture to Transportation and many options in between. Github also includes a collection for general machine learning models. Most of the datasets linked are free.
Azure Public DatasetsMicrosoft Azure has a database of public datasets that developers can use for prototyping and testing. The categories include US government and agency data, other statistical and scientific data, and online service data. In addition, you can read documentation on SQL and how to build mobile and web apps. Snowflake Data Marketplace Snowflake gives data scientists, business intelligence and analytics professionals, and everyone who desires data-driven decision-making, access to more than 650 live and ready-to-query data sets from over 175 third-party data providers and data service providers.
Registry of Open Data on AWSAWS has a registry that features datasets available through AWS resources. Users can share their own datasets or add examples of how to use specific datasets. Over 280 searchable datasets are available in the registry.
KDNuggetsKDNuggets has a comprehensive list of data repositories, where you’ll be able to find a wide variety of datasets. The list features over 75 repositories, including some that are international.
AppenAppen offers a variety of off-the-shelf training datasets. Our catalog includes 250+ licensable datasets across 80 languages with multiple dialects included. The datasets cover many machine learning use cases, including speech recognition and natural language processing, and cover a range of file types (text, image, video, speech, and audio). For example:
- Fully transcribed speech datasets for broadcast, call center, in-car, and telephony applications
- Pronunciation lexicons, including both general and domain-specific (e.g. names, places, natural numbers)
- Part-of-speech-tagged lexicons and thesauri
- Text corpora notated for morphological information and named entities.