Beyond the leaderboard.

Off-the-Shelf AI Training Datasets

Access hundreds of ready-to-use AI training datasets. The culmination of Appen’s 25+ years of expertise in multimodal data collection, transcription, and annotation.

Talk to an expert Explore our datasets

What Are Off-the-Shelf
AI Datasets?

Pre-existing AI training datasets are a fast and affordable way to quickly deploy your model for a variety of use cases. The effectiveness of any AI model depends upon the quality and diversity of its training data and off-the-shelf datasets are a great way to access large amounts of data quickly and affordably.

290+

Datasets

80+

Languages

80+

Countries

10K+

Hours

80K+

Images

10M+

Words

Off-the-Shelf vs. Custom AI Training Datasets

The choice between off-the-shelf datasets and custom AI data collection depends on the specific requirements, budget, and timeline of your project. Off-the-shelf datasets are ideal for general applications where quick deployment and cost-effectiveness are priorities, while custom datasets are best suited for specialized tasks where precision, customization, and flexibility are essential for achieving superior performance.

Recently added datasets

Dataset Name

Dataset ID

Description

English (United States) product labels

IMG_OCR_USE_ProductLabels

58000 images of various products including the label, annotated with the product category, brand name, packaging type, product description, angle and image quality.

English (United States)device commands

USE_ASR006

32 hours of US English speakers giving a device command in response to a variety of prompts, such as "Tell the device to disable shuffle mode".

Roomba view images

IMG_SDJ_CN

82000 images taken from the perspective of a robotic vacuum cleaner, at a clarity of 2K+

View more datasets

Offensive wordlists in 14 languages:

Lists of several thousand offensive words in 14 language varieties (Gulf Arabic, Dutch, French, German, Italian, Portuguese – Brazil and Portugal, Spanish – Spain and 3 Latin American varieties, Swedish, Tagalog), which have been labelled for 11 offensive categories (Blasphemy, Derogatory, Drugs, Extolling Alcohol Consumption, Gambling, People, Political, Religiously Extremist Nature, Sex, Sexual Impropriety, Undermining The State).

The words are rated on both a Slang-Standard scale and Offensiveness scale, and further annotated for inflection (Noun, Verb, Adj, Adv, Other), and spelling/regional variants where applicable.

This data can be used to moderate content, to train models to recognise offensive content and distinguish between offensive and non-offensive terms.

Dataset Name

Dataset ID

Arabic (Gulf/Levantine) Offensive Wordlist

ARB_NER002

Indonesian (Indonesia) Offensive Wordlist

ind_IND_NER001

Dutch (Netherlands) Offensive Wordlist

nld_NLD_NER001

French (France) Offensive Wordlist

fra_FRA_NER001

German (Germany) Offensive Wordlist

deu_DEU_NER001

Italian (Italy) Offensive Wordlist

ita_ITA_NER001

Portuguese (Brazil) Offensive Wordlist

por_BRA_NER001

Portuguese (Portugal) Offensive Wordlist

por_PRT_NER001

Spanish (Argentina) Offensive Wordlist

spa_ARG_NER001

Spanish (Spain) Offensive Wordlist

spa_ESP_NER001

Spanish (Colombia) Offensive Wordlist

spa_COL_NER001

Spanish (Mexico) Offensive Wordlist

spa_MEX_NER001

Swedish (Sweden) Offensive Wordlist

swe_SWE_NER001

Tagalog (Philippines) Offensive Wordlist

tgl_PHL_NER001

Datasets in development

This data has already been collected and is currently undergoing quality checks and relevant annotation. Most are expected to be ready for delivery in Q1 2025, but can be prioritized upon request.

Dataset Name

Dataset ID

Description

LLM training Text or multimodal datasets

Harmful and harmless prompts and responses

eng_USA_LLM001

300 US English prompts and responses annotated for Harm category, Intensity, Voice, and Phrasing

Adversarial prompts for LLM red teaming

eng_USA_LLM002

500 adversarial prompts have been collected, with a total of 1000 planned for development. Please enquire about our optional benchmarking service to rate harm levels in model responses to these prompts.

Chatbot conversations

eng_USA_LLM003

1800 real-world conversations between a user and a chatbot. Domains include: financial, retail, entertainment, IT

Text descriptions for images

we are annotating our existing image library IMG_TAG_CN with English text descriptions, for LLM image generation applications

Image and video datasets for OCR, object and action recognition applications

Receipts

IMG_OCR_USE_RECEIPTS

4500 document images of US English receipts, bills or invoices, to be annotated with bounding boxes and transcribed text, for OCR applications. PII will be redacted.

Symbols

IMG_SYMBOLS_US

1500 images of symbols (e.g. recycling or laundering instructions) with text descriptions.

Street signs

IMG_OCR_USE_STREET002

3500 photos of US street signs, to be annotated with bounding boxes, transcribed text and a description of the sign. For OCR and LLM image applications.

Hand gesture videos

HUMAN_BODY_VID004

5000 videos of prompted hand gestures, e.g. thumbs up or a wave, for body movement recognition.

Object videos

VID_OBJECT_US

5500 videos of various everyday objects (e.g. desk, kettle) taken from different angles, distances and lighting conditions, for object recognition

Garments

IMG_VID_GARMENTS_US

A collection of images and videos of ~300 items of clothing, for retail applications.

Audio datasets for ASR, voice assistant model training (all US English). QA of the audio and transcription is in progress

Scripted sentences smartphone recordings

USE_ASR005

~500 hours of US English speech, with participants reading out a total of 9000 unique prompted sentences.

Answers to questions smartphone recordings

USE_ASR007

65 hours of US English speakers spontaneously responding to 1000 unique question prompts such as "What's your favourite food?"

Conversational speech smartphone recordings

USE_ASR008

2 hours of natural conversation on a selected topic, e.g. Hobbies, history. Includes some toxic speech.

Types of AI Training Datasets

AI data comes in many forms, with diverse options available to suit the needs of your project. Training your model on high-quality data is crucial to maximize your AI model’s performance.

Speech

Audio files with corresponding timestamped transcription for applications such as automatic speech recognition, language identification, and voice assistants.

Key features:

Speech types: scripted (including TTS), conversational, broadcast
Recording types: microphone, telephony (mobile, landline), smartphone
Environments: quiet (home, office, studio), noisy (public place, in-car, roadside)
Audio quality: 8kHz – 96kHz

Text

Tailored, ethically-sourced text datasets that drive smarter insights for more accurate language processing and machine learning models.

Text datasets include:

Pronunciation Dictionaries (Lexicons): 5.4M words in 75 languages
Part-of-speech (POS) dictionaries: 3.2M words in 18 languages
Named Entity Recognition (NER): 344k+ entity labels in 9 languages
Inverse Text Normalization: 36k+ test cases in 7 languages

Image

115k+ images in 14+ languages to develop diverse applications such as optical character recognition (OCR) and facial recognition software.

Featured image datasets include:

15.8K images of documents in 14 languages with mixed premium and challenging conditions for OCR
13.5K human facial images of 99 participants in various lighting conditions, angles, and expressions.

Video

High-quality video data to enhance AI models, like multi-modal LLMs, for tasks such as object detection, gesture recognition, and video summarization.

Featured video dataset:

130 sessions documenting human body movement of 100 diverse participants in the United Kingdom and the Philippines
Multi-camera recordings in several locations with varied background, weather, and lighting conditions.

Location

Precise location data for insights into user movements and interactions with specific points of interest, enabling location-based analytics and targeted strategies.

Accurate GPS signals collected in-app from SDKs
Global: 200+ countries
Compliant: 100% user opt in
Scale: 1.5+ billion devices and 500+ billion events

We were expanding to a new market. Although we had a fully localized software, we were lacking resources, so our clients could not optimally use it. Appen helped us out with French lexicon data.

Case study

Ines Wendler

Product Manager, MediaInterface

Benefits of Using Pre-Existing
AI Training Datasets

Appen's datasets are carefully constructed through a detailed data annotation process and reviewed by experienced annotators to provide a reliable foundation for training models and performance across various applications.

Speed

Immediately available for rapid deployment

Cost

Licensed datasets are an economical solution

Quality

Developed by Appen’s internal data experts

How to Choose the Right
Data for Your AI Project

The most important factors to consider when selecting data for your AI project are the quality, size, and accuracy of the dataset. Make sure your data is ethically sourced to provide your model with reliable and diverse information.

How much data do you need to train AI?

The amount of data needed to train an AI model depends on the model type and task complexity. Simple models, like basic image recognition, may need thousands of labeled images, while complex tasks like NLP or advanced computer vision often require millions of data points. For example:

Image Recognition: Training a basic model might require 10,000 to 100,000 images per class, whereas more complex models like those used in self-driving cars need millions of images.
Natural Language Processing (NLP): Language models like GPT-3 were trained on hundreds of billions of words to achieve high accuracy across a wide range of tasks.
Custom Models: Smaller datasets might work if the model is fine-tuned on pre-trained networks (transfer learning), but more data will generally lead to better performance.

Off-the-Shelf AI Training Datasets

What Are Off-the-Shelf
AI Datasets?

290+

80+

80+

10K+

80K+

10M+

Off-the-Shelf vs. Custom AI Training Datasets