Dynamic Judgments: Accelerate Your Data Pipeline

Published on

March 13, 2025

Author

Authors

Alice Desthuilliers

Principal Product Manager

Appen

No items found.

Introduction: The Wisdom of the Crowd

The concept of “wisdom of the crowd” is key to many human-in-the-loop tasks. It means that getting agreement from multiple people is a strong way to solve unclear questions. Directly inherited from Aristotle and popularized by Francis Galton in the early 20th century, this concept is often used to find the best answer for ambiguous units in data annotation.

In a famous experiment, Francis Galton observed that when 800 people guessed the weight of an ox, the median estimate of 1,207 pounds was remarkably accurate, deviating by less than 1% from the true weight of 1,198 pounds.

This demonstrated that gathering multiple opinions from non-experts can yield the same results as relying on a single expert. This concept has been widely applied to community-based media such as Wikipedia, Quora, and Reddit.

The Challenge in Data Annotation

For AI training data, the wisdom of the crowd means that when a task doesn’t require deep domain knowledge, collecting inputs from multiple well-trained contributors usually leads to high-quality results.

The key question is how many non-expert judgments are needed to confidently trust the final decision. Collecting as many as possible isn’t always practical since we must balance cost, quality, and scope.

Deciding how many people to ask to get a fair answer can be tricky. It’s common to collect up to 10 judgments for complex and subjective tasks like content moderation. While simpler tasks usually need fewer, sometimes contributors still don’t agree.

You might choose to gather a fixed set of 3 judgments, but if your contributors don’t reach an agreement, you’ve essentially paid for those judgments without getting a final label. To avoid this, you might be tempted to systematically collect more, like 10 judgments, to ensure a critical mass for agreement, but that leads to overcollection and unnecessary spending without improving quality.

While this increases the chances of reaching agreement, it also slows down your process and raises costs. Dynamically detecting when agreement is reached, and when more judgments are needed, helps speed up project delivery while maintaining high-quality results.

Our “Dynamic Judgment” feature in Appen’s AI Data Platform (ADAP) lets you set the minimum and maximum number of judgments per unit as a basic setting and supports additional advanced settings for further customization.

Cost Efficiency vs. High Confidence Approaches

The maximum number of judgments can be set as a fixed number (e.g., collect 3 judgments, up to 5) or a confidence score (e.g., collect judgments up to a confidence score equaled to .8). While the former will collect an exact maximum number of judgments, the latter ensures judgments keep coming in until the collected data reaches a set confidence threshold. Confidence is a score that is automatically calculated in your ADAP job and gives you a sense of the confidence you can have in the aggregated judgment. It shows the level of agreement between the contributors, weighted by their trust scores (like inter-annotator agreement).

Imagine you use the fixed number as a limit, and you set a minimum of 3 judgments and a maximum of 5. If agreement is reached early, fewer judgments are collected, helping to limit costs. However, confidence may vary across units since it depends on the number of judgments received.

But if you use the confidence threshold as a limit and set up a 0.8 threshold, the system keeps collecting judgments until the predefined confidence level is met. Some units may reach this threshold with just 3 judgments, while others may require more (e.g., 7). This ensures consistent reliability across units but can increase costs when agreement takes longer to achieve. The choice between the two depends on whether cost efficiency or high confidence is the priority.

Handling Complex Job Designs

Whether you choose the cost efficiency or high confidence approach, you will still be able to further refine the number of judgments you will be collecting.

Some job designs are extremely complex and involve multiple questions within a single unit, requiring selective application of dynamic judgments. With ADAP, job designers can specify which parts of a job should dynamically collect answers.

For example, if annotators must classify an image as either a Chihuahua or a muffin and also count how many are in the image, you can apply dynamic judgments only to the classification question while using a fixed number of judgments for the count. Alternatively, you could seek agreement for both questions. This flexibility ensures agreement is reached efficiently on the most critical aspects of your task.

What Dynamic Judgment Means for Your Project

Leveraging the wisdom of the crowd ensures reliable decisions through agreement among contributors. By dynamically adjusting AI data collection with confidence thresholds, ADAP’s Dynamic Judgement feature maximizes quality while minimizing unnecessary judgments. This smart approach improves efficiency, reducing costs while maintaining high-quality results.

If you’re ready to optimize your training data, then let’s talk.