Uncover the latest AI trends in Appen's 2024 State of AI Report.

Off-the-Shelf AI Training Datasets

Access hundreds of ready-to-use AI training datasets. The culmination of Appen’s 25+ years of expertise in multimodal data collection, transcription, and annotation.

What Are Off-the-Shelf
AI Datasets?

Pre-existing AI training datasets are a fast and affordable way to quickly deploy your model for a variety of use cases. The effectiveness of any AI model depends upon the quality and diversity of its training data and off-the-shelf datasets are a great way to access large amounts of data quickly and affordably.

290+

Datasets

80+

Languages

80+

Countries

10K+

Hours

80K+

Images

10M+

Words

Off-the-Shelf vs. Custom AI Training Datasets

The choice between off-the-shelf datasets and custom AI data collection depends on the specific requirements, budget, and timeline of your project. Off-the-shelf datasets are ideal for general applications where quick deployment and cost-effectiveness are priorities, while custom datasets are best suited for specialized tasks where precision, customization, and flexibility are essential for achieving superior performance.

Recently added datasets

Dataset Name
Dataset ID
Description
Location entrance videos
HUMAN_BODY_VID002
2.85 hours of synchronized footage from 130 sessions, capturing groups of people moving through doorway entrances in the UK and Philippines.
Action videos
HUMAN_BODY_VID003
300 videos of prompted actions, e.g. "zip up a jacket"
Selfie image and video collection
IMG_VID_SELFIE_US
1400 selfie sessions containing of participants making various facial expressions under different conditions, e.g. "while blinking", "while wearing a scarf".
Arabic (Levantine) scripted microphone
ARU_ASR002
32 hours of microphone recordings of speech prompts for ASR model development.  Text prompts only; transcription can be developed upon request.
Text message conversations
eng_USA_SMS003
100 US English text message conversations
View more datasets

Our Appen China team has been hard at work developing a number of new datasets which are now ready for delivery

Dataset Name
Dataset ID
Description
LLM applications
Chinese news text summaries corpus
DMXWB_corpus_CN
20000 summaries in Chinese of main events and themes from news data in 15 domains
Chinese command and control prompt response corpus
DSDH_corpus_CN
20000 app commands, and question & response pairs in Chinese, tagged with categories and intents, for use with TV player controls, lifestyle services, and device control.
Chinese instruction set sentence corpus
ZLJ_corpus_CN
Chinese corpus of 200,000 sentences of instruction sets for LLM training including questions and answers, multi-turn dialogues, logical reasoning, programming code, text rewriting, roleplay, long text instructions and text generation instructions.
Chinese multidisciplinary test questions corpus
MTQ_CN
Chinese corpus of 300k+ prompt response pairs for junior-high school subjects, including Geography, Chemistry, History, Biology, Math, Physics, Chinese language, and Politics.
Code Q&A Dataset
DM_CNRD
A specialized dataset featuring 12M programming Q&A pairs in English to train LLMs for technical problem-solving.
Chinese and English related texts
GLWB_CN
long article content sourced from publicly available books including title, author and language metadata. ~400k texts
Image datasets for OCR and object detection
Arabic printed text
IMG_OCR_ARU002_CN
20,000 annotated text images with 50 bounding boxes per image, perfect for OCR model training.
Vehicle tail light images
IMG_WD_CN
30k+ images annotated for object detection in automotive AI.
Home environment pictures
IMG_HOME_CN
10k images of living rooms and studies
Electric vehicles in elevators
IMG_DDC_CN
17k annotated images
Baking Pictures
IMG_BAKE_CN
6k images of bread, cakes and cookies
Audio datasets for ASR and TTS model development
Turkish (Turkey) scripted speech
TUR_ASR003_CN
over 700 hours of smartphone recordings of speech prompts for ASR model development.
Indonesian (Indonesia) conversational telephony
IND_DH_ASR001_CN
150 hours of natural conversation, transcribed and timestamped.
Russian + German Female TTS
ED_TTS001_CN
2.3 hours of female voice talent 48kHz studio microphone recordings
Cantonese (China) business dialogues
YYDH_ASR001_CN
98 hours of business meetings and conversations with transcription and timestamping, across a variety of industries.

Datasets in development

This data has already been collected and is currently undergoing quality checks and relevant annotation. Most are expected to be ready for delivery in Q1 2025, but can be prioritized upon request.

Dataset Name
Dataset ID
Description
LLM training Text or multimodal datasets
Harmful and harmless prompts and responses
eng_USA_LLM001
300 US English prompts and responses annotated for Harm category, Intensity, Voice, and Phrasing
Adversarial prompts for LLM red teaming
eng_USA_LLM002
500 adversarial prompts have been collected, with a total of 1000 planned for development.  Please enquire about our optional benchmarking service to rate harm levels in model responses to these prompts.
Chatbot conversations
eng_USA_LLM003
1800 real-world conversations between a user and a chatbot.  Domains include: financial, retail, entertainment, IT
Text descriptions for images
-
we are annotating our existing image library IMG_TAG_CN with English text descriptions, for LLM image generation applications
Image and video datasets for OCR, object and action recognition applications
Product labels
IMG_OCR_USE_ProductLabels
60,000 images of various products including the label, with category annotations, for OCR applications.
Receipts
IMG_OCR_USE_RECEIPTS
4500 document images of US English receipts, bills or invoices, to be annotated with bounding boxes and transcribed text, for OCR applications.  PII will be redacted.
Symbols
IMG_SYMBOLS_US
1500 images of symbols (e.g. recycling or laundering instructions) with text descriptions.
Street signs
IMG_OCR_USE_STREET002
3500 photos of US street signs, to be annotated with bounding boxes, transcribed text and a description of the sign. For OCR and LLM image applications.
Hand gesture videos
HUMAN_BODY_VID004
5000 videos of prompted hand gestures, e.g. thumbs up or a wave, for body movement recognition.
Object videos
VID_OBJECT_US
5500 videos of various everyday objects (e.g. desk, kettle) taken from different angles, distances and lighting conditions, for object recognition
Garments
IMG_VID_GARMENTS_US
A collection of images and videos of ~300 items of clothing, for retail applications.
Audio datasets for ASR, voice assistant model training (all US English). QA of the audio and transcription is in progress
Scripted sentences smartphone recordings
USE_ASR005
~500 hours of US English speech, with participants reading out a total of 9000 unique prompted sentences.
Device commands smartphone recordings
USE_ASR006
40 hours of US English speakers giving a device command in response to a variety of prompts, such as "Tell the device to disable shuffle mode".
Answers to questions smartphone recordings
USE_ASR007
65 hours of US English speakers spontaneously responding to 1000 unique question prompts such as "What's your favourite food?"
Conversational speech smartphone recordings
USE_ASR008
2 hours of natural conversation on a selected topic, e.g. Hobbies, history. Includes some toxic speech.

Types of AI Training Datasets

AI data comes in many forms, with diverse options available to suit the needs of your project. Training your model on high-quality data is crucial to maximize your AI model’s performance.

Speech

Audio files with corresponding timestamped transcription for applications such as automatic speech recognition, language identification, and voice assistants.

Key features:

  • Speech types: scripted (including TTS), conversational, broadcast
  • Recording types: microphone, telephony (mobile, landline), smartphone
  • Environments: quiet (home, office, studio), noisy (public place, in-car, roadside)
  • Audio quality: 8kHz – 96kHz

Text

Tailored, ethically-sourced text datasets that drive smarter insights for more accurate language processing and machine learning models.

Text datasets include:

  • Pronunciation Dictionaries (Lexicons): 5.4M words in 75 languages
  • Part-of-speech (POS) dictionaries: 3.2M words in 18 languages
  • Named Entity Recognition (NER): 344k+ entity labels in 9 languages
  • Inverse Text Normalization: 36k+ test cases in 7 languages

Image

115k+ images in 14+ languages to develop diverse applications such as optical character recognition (OCR) and facial recognition software.

Featured image datasets include:

  • 15.8K images of documents in 14 languages with mixed premium and challenging conditions for OCR
  • 13.5K human facial images of 99 participants in various lighting conditions, angles, and expressions.

Video

High-quality video data to enhance AI models, like multi-modal LLMs, for tasks such as object detection, gesture recognition, and video summarization.

Featured video dataset:

  • 130 sessions documenting human body movement of 100 diverse participants in the United Kingdom and the Philippines
  • Multi-camera recordings in several locations with varied background, weather, and lighting conditions.

Location

Precise location data for insights into user movements and interactions with specific points of interest, enabling location-based analytics and targeted strategies.

  • Accurate GPS signals collected in-app from SDKs
  • Global: 200+ countries
  • Compliant: 100% user opt in
  • Scale: 1.5+ billion devices and 500+ billion events
We were expanding to a new market. Although we had a fully localized software, we were lacking resources, so our clients could not optimally use it. Appen helped us out with French lexicon data.
Ines Wendler
Product Manager, MediaInterface

Benefits of Using Pre-Existing
AI Training Datasets

Appen's datasets are carefully constructed through a detailed data annotation process and reviewed by experienced annotators to provide a reliable foundation for training models and performance across various applications.

Speed

Immediately available for rapid deployment

Cost

Licensed datasets are an economical solution

Quality

Developed by Appen’s internal data experts

How to Choose the Right
Data for Your AI Project

The most important factors to consider when selecting data for your AI project are the quality, size, and accuracy of the dataset. Make sure your data is ethically sourced to provide your model with reliable and diverse information.

How much data do you need to train AI?

The amount of data needed to train an AI model depends on the model type and task complexity. Simple models, like basic image recognition, may need thousands of labeled images, while complex tasks like NLP or advanced computer vision often require millions of data points. For example:

  • Image Recognition: Training a basic model might require 10,000 to 100,000 images per class, whereas more complex models like those used in self-driving cars need millions of images.
  • Natural Language Processing (NLP): Language models like GPT-3 were trained on hundreds of billions of words to achieve high accuracy across a wide range of tasks.
  • Custom Models: Smaller datasets might work if the model is fine-tuned on pre-trained networks (transfer learning), but more data will generally lead to better performance.

Get Started with Off-the-Shelf AI Training Datasets

Appen’s extensive catalog of off-the-shelf (OTS) datasets spans multiple data types and industries, providing comprehensive coverage for various AI applications. These datasets are crafted to the highest standards of quality and accuracy, ensuring reliable training data for AI models.

Talk to an expertExplore datasets

Contact us

By submitting, you confirm that you agree to the processing of your personal data by Appen as described in the Privacy Statement.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Thank you for getting in touch! We appreciate you contacting Appen. One of our colleagues will get back in touch with you soon! Have a great day!