Datasets Resource Center


Created and curated for teams working on world-class AI applications



Image

Off-the-Shelf Datasets



Our high-quality licensable datasets to jumpstart your AI projects

We offer an extensive catalog of ‘Off-the-Shelf’ datasets, with over 250 licensable datasets comprising of over 11,000 hours of audio, over 25,000 images and over 8.7 million words across 80 languages and multiple dialects. Our Off-the-Shelf datasets are designed to effectively improve accuracy, overall performance and to quickly deliver high quality datasets at scale for specific AI program needs​. Among our offerings, you will find datasets across multiple data types, including image, video, speech, audio, and text. We are constantly building new datasets to meet the needs of our global customer base.


Learn more


Image Image Image




Image

Open-Source Public Datasets


Curated recommendations from our data scientists for your Al projects


Machine Learning and Artificial Intelligence applications require significant amounts of data to train. You can search for open datasets to access, modify, reuse, and share, from our recommended resources. Use these publicly available datasets to influence the development of AI and ML applications or if you want a simple dataset to benchmark a solution or compare different algorithms before tackling a real dataset. These open datasets are a great option to consider for access to data that lies outside the scope of your organization.


Dataset Finders



Image
Use Kaggle to find data sets, explore and build models and work with other data scientists and Machine Learning engineers. Explore and analyze a collection of over 50,000 public datasets on everything from bone x-rays to results from boxing bouts.
Learn More
Image
Explore over 500 data sets of the Machine Learning Repository from UC Irvine, through a searchable interface. Datasets range across many topics, vary in terms of size, from only a few cases (or “instances”) up to over 43 million and from only 1 or 2 variables (or “attributes”) to over a million variables.
Learn More


Computer Vision



Computer vision enables computers to identify and process objects in images and videos in the same way that humans do, by emulating parts of the complexity of the human vision system. Leverage Machine Learning for image applications such as enabling self-driving cars to make sense of their surroundings, facial recognition applications, augmented and mixed reality or automate tasks finding symptoms in x-ray and MRI scans in healthcare. Build a robust Computer Vision model using a rich collection of Computer Vision datasets.


Image
Accelerate AI development using 1000+ High-quality Open Datasets. Choose from 50+ application scenarios, 30+ annotation types, and 10+ data formats.
Learn More
Image
These datasets include diverse topics from recognizing objects to reconstructing a 3D room, from finding a person in a video to identifying a shirt in a photo. The datasets can be sorted by published date or topic, and users can search with keywords to locate images appropriate to their needs.
Learn More
Image
Use these open datasets to build facial recognition applications, virtual reality gadgets, sensory detection, holographic imaging and much more.
Learn More
Image
More than 3,000 Machine Learning Datasets. Find datasets by task and modality, compare usage over time, browse benchmarks and more.
Learn More
Image
Open-source datasets for Computer Vision Machine Learning models across a wide array of domains- animals, board games, self-driving cars, medicine, thermal imagery, aerial drone images, and even synthetically generated data. You can freely download images and annotations in any format: VOC XML, COCO JSON, YOLOv3 flat text files, even TFRecords.
Learn More


Speech Corpora



Recording and transcribing new Speech Corpora to create acoustic models and train Speech Recognition engines can be time consuming and expensive. Use open databases of speech audio files and text transcriptions to quickly and cheaply building transcribed Speech orpora containing utterances from many speakers in a variety of acoustic conditions.


Image
A central place for speech resources, OpenSLR hosts speech and language resources, such as training Corpora for Speech Recognition, and software related to Speech Recognition.
Learn More

Candlewill


A collection of Speech Corpus for Automatic Speech Recognition (ASR) and Text-To-Speech (TTS).
Learn More

Edresson


The dataset has 71,358 total number of words, with 13,311 distinct words, approximately 10 hours and 28 minutes of speech from a single speaker, recorded at 48Khz, containing a total of 3,632 audio files in Wave format. Audio files range from 0.67 to 50.08 seconds.
Learn More
Image
Designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of Automatic Speech Recognition systems. Contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States.
Learn More

VoxCeleb


Audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube. This dataset contains 7,000 + speakers, 1 million + utterances and 2,000 + hours of both audio and video.
Learn More

msang


A Twitter Corpus built with the aim of representing and analyzing hate speech against some minority groups in Italy: immigrants in particular, but also Muslims and Roma. The contains the tweets' ID and their annotation.
Learn More

VoxForge


Transcribed speech for use in Speech Recognition Engines; categorize and make available all submitted audio files (Speech Corpus) and Acoustic Models.
Learn More

homink


A Korean read Speech Corpus of about 120 hours from the National Institute of Korean Language (NIKL).
Learn More

siddiquelatif


URDU dataset contains emotional utterances of Urdu speech gathered from Urdu talk shows. It contains 400 utterances of four basic emotions: Angry, Happy, Neutral, and Emotion. There are 38 speakers (27 male and 11 female).
Learn More
Image
Common Voice dataset, an open-source dataset of voices, currently consists of over 7,000 validated hours in 60 languages and includes demographic metadata like age, sex, and accent that can help train the accuracy of Speech Recognition engines. Each entry in the dataset consists of a unique MP3 and corresponding text file.
Learn More
Image
A large database of sentences and translations to see examples of how words are used in the context of a sentence.
Learn More
Image
Made from audio talks and their transcriptions, the dataset contains 1495 audio talks in NIST sphere format (SPH), 1495 transcripts in STM format, dictionary with pronunciation (159, 848 entries) and selected monolingual data for language modeling.
Learn More


Image

Data Collection



If a more customized data set is needed for your specific use case, we provide data collection as a standalone service as well as a part of a multi-component deliverable such as an ASR speech database that typically includes audio data, transcription, pronunciation lexicon, and a language-specific document or an annotated image dataset. Our data collection services span a variety of data types and collection methodologies for a range of environments to best meet your unique data requirements.

Learn more


Image Image Image