Types of Errors We See with Training Data: How to Recognize and Avoid Common Data Error
It’s helpful to contrast AI development with traditional software development. In traditional software, you write code that’s deterministic (i.e., every time you run it with the same inputs, you’ll receive the same outputs). But with AI development, it’s not the code that’s most important, it’s the data. Specifically, the labeling of the data.
High-quality, accurately-labeled data is crucial to building a high-performing model. But poor quality data isn’t always obvious. To illustrate this, let’s start with defining what training data actually is. Each unit of data contains a file (the image, text, snippet of audio, or video), file attributes (the labels assigned to the file that give it meaning), and the attributes of the label (including what time it was labeled, who it was labeled by, and under what conditions).
For instance, let’s say we’re building a model that uses LiDAR data. LiDAR works by sending out pulses to capture distances between it and target objects, like cars or pedestrians. When working with LiDAR, an example task for an annotator may be to draw a 3-D bounding box, or cuboid, around a car. Training data for this model might feature a JSON file with code that specifies: where the cuboid is, its height and depth, and what’s contained within it (in this case a car). During this annotation process, there are numerous opportunities for errors to be introduced. Being cognizant of the potential errors will help you create a more complete, representative dataset.
Three Common Data Errors
The following errors in training data are three of the most common Appen sees during the annotation process.
1. Labeling Errors
Labeling errors are one of the most common issues in developing high-quality training data. There are several types of labeling errors that can occur. Imagine, for example, that you provide your data annotators with a task: draw bounding boxes around the cows in an image. The intended output is a tight bounding box around each cow. Here are the type of errors that may occur with this task:
Missing labels: The annotator misses placing a bounding box around one of the cows.
Incorrect fit: The bounding box isn’t tight enough around each cow, leaving unnecessary gaps around them.
Misinterpretation of instructions: The annotator places one bounding box around all of the cows in the image, instead of one for each cow.
Handling occlusion: The annotator places a bounding box around the expected size of a partially-hidden cow, instead of around only the visible part of the cow.
These types of mistakes can occur with many types of projects. It’s essential to provide annotators with clear instructions to avoid these scenarios.
2. Unbalanced Training Data
The composition of your training data is something you’ll want to carefully consider. An unbalanced dataset causes bias in model performance. Data imbalance occurs in the following cases:
Class imbalance: This happens when you don’t have a representative set of data. If you’re training your model to identify cows, but you only have image data of dairy cows on a sunny, green pasture, your model will perform well at identifying a cow under those conditions but not so much under any others.
Data recency: All models degrade over time because the real world environment evolves. A perfect real-life example of this is the coronavirus. If you searched “corona” in 2019, you would likely see results for Corona beer listed at the top. In 2021, however, the results would be filled with articles on the coronavirus. A model needs to be regularly updated on new data as changes like this occur.
3. Bias in Labeling Process
When we talk about training data, bias often gets brought up. Bias can be introduced in the labeling process if you’re using a very homogenous group of annotators, but also when your data requires specific knowledge or context for accurate labeling. For example, let’s say you want an annotator to identify breakfast foods in images. Included in your dataset are images of popular dishes around the world: black pudding from the UK, hagelslag (sprinkles on toast) from the Netherlands, and vegemite from Australia. If you ask American annotators to label this data, they would likely struggle to identify these dishes, and certainly make mistaken judgments on whether or not they were breakfast foods. The result would be a dataset that’s biased toward an American mindset. Ideally, you’d instead have annotators from around the world to ensure that you’re capturing accurate information for each culture’s dish.
As an AI practitioner, what can you do to avoid these common errors? Implement quality checks throughout your data labeling process to ensure you’re catching mistakes before they influence your model. You may choose to leverage AI to double-check annotators’ judgments (a method known as smart labeling) before they submit them. And always have a human-in-the-loop to monitor model performance for any bias. Reducing bias is paramount: in addition to recruiting a diverse group of annotators (with the domain knowledge your data requires), here are several other ways to unbias your data.