Multimodal AI Models – Part 1: Exploring Datasets for Training

Published on

June 27, 2024

Author

Authors

Si Chen

Head of Strategy

Appen

Kenneth Saw

LLM Specialist

Appen

With the rapid advancement of artificial intelligence (AI), we find ourselves amidst a technological revolution that is reshaping industries and transforming the way we interact with technology. Multimodal AI systems integrate multiple types of data, such as image, video, speech, sound, and text.

By combining these modalities, AI models have enriched contextual information that allows them to achieve more human-like cognitive capabilities. Multimodal AI can improve accuracy and robustness, for example, identifying objects and environments in videos that can provide context to text or audio. This can play a crucial role in accessibility by providing solutions for individuals with diverse needs, for example, by converting visual content to descriptive audio to “narrate the world”.

Multimodal Generative AI can create rich, diverse content for a range of applications, for example, building an immersive multisensory virtual environment. With the rise of Large Language Models (LLMs) and their impressive human-like text interactions, Multimodal LLMs are driving the next frontier in AI, enabling new era of realistic and natural interaction between humans and machines.

Appen, a leading AI data company, plays a vital role in this domain by providing high-quality human-in-the-loop training and AI model evaluation data. By harnessing the power of diverse datasets, Appen enables AI models to achieve superior performance in multimodal tasks.

Challenges in Multimodal AI

Despite the promise of Multimodal AI, most AI systems today are single modalities. Some key challenges include:

Data Availability: Multimodal AI models need large, diverse datasets for training and validation. Multimodal pairs required for training are limited in volume and availability. Large existing open-source datasets tend to be concentrated in more mature pairs, such as text-image, and are typically general-purpose datasets. Custom datasets are required to improve multimodal AI performance across more modalities and tailor models for specific applications.
Annotation Quality: Compared to single modalities, the annotation of multimodal data tends to be more complex. For example, video content can involve timestamping events, contextualizing actions, and providing a series of descriptions. These open-ended descriptions may require specialised domain expertise and annotation in an instructional format, further complicating the annotation process.
Evaluation Metrics: The absence of agreed-upon benchmarks and evaluation metrics poses a significant challenge to multimodal AI systems. Metrics are dependent on context and use case and can be highly subjective. Developing matrix-style metrics that allow evaluation across intersecting modalities also poses a challenge.

Training Data for Multimodal LLMs

With LLMs growing in popularity, humans are increasingly interacting with visual data using open-ended Natural Language. Queries about an image may be as simple as ‘Which vegetables do I have in my fridge?’, to more complex knowledge-based queries that might require additional synthesis such as ‘What meals can I cook with these ingredients?’ These queries can relate to different forms of input, including video, where queries could relate sequence of frames, audio tracks or speech content in the video.

Example: Multimodal Prompts and Responses

To train Multimodal LLMs, a large and diverse set of visual data and accompanying prompts or prompt-response pairs is required. Additional annotations can be included in the prompts and responses to link keywords in the text with objects and events in the visual, which can further enrich the data and improve performance.

Example: Video to Text Data

Before LLMs can answer queries about different modalities, models need to be trained to ‘understand’ this data. This process involves creating paired datasets with text description or narration of the contents of the video.

In these, text is added to video to describe or narrate what is happening. Different to transcription for subtitling which captures the speech content of the video, text provides a description of events in the video and may link events together in a narrative sequence. Timestamps can be added to link visual cues with their description in the text. The visual media itself can also be annotated and linked to annotations in the text to highlight key visuals, and further enrich the data.

With a text description or narration of the video, an LLM can now answer queries about it, for example “what is happening in this video clip”, “who are the actors in this episode”, or “give me the instructions for this game level as shown in this walkthrough”.

Example: Visual and Audio Transcription and Captioning

The audio content of a video, as well as any on-screen text, provide important contextual data for a Multimodal AI. They allow the model to summarise not only the visuals in a video, but what was said and shown. While capturing speech content is important for any video containing speech, capturing in-video text is particularly important for videos such as presentations, news bulletins, or sports events where scores are displayed.

In addition to transcribing the audio or in-video text, timestamps can be added to link audio and visual cues with their corresponding text. Annotations can also be added to link transcribed text with their location in the visual.

Not all audio is speech, and a video may contain other sounds such as animal noises, ambient noise, or music. For these, datasets that describe the audio are needed, with timestamps that link key sound events with their description in the text.

Next Steps

The dataset types above are just a few examples of the types of custom datasets Appen has created for our clients over 25 years to suit their multimodal training data needs.

Our team brings extensive experience in designing datasets and collection methodologies to ensure diversity, annotation processes that ensure optimal data enrichment, as well as the tools to support linked annotation across various modalities.