In November 2022, the public release of the widely recognized Large Language Model (LLM), ChatGPT, along with its inherent alignment challenges, marked a significant turning point in public interest regarding the "human-in-the-loop" process.
Until that time, the art of AI data preparation, labeling, human judgment, or human computation remained largely mysterious to most. These tasks were primarily carried out within large tech companies, working in secrecy to develop machine learning models within their data science organizations, in partnership with companies like Appen. Data operations, or "data ops," was still a niche activity, learned in the field rather than formalized by major consulting firms.
With the rise of LLMs and the increasing awareness of the working conditions of click workers—who endlessly collect, annotate, or review biased or harmful data to feed these large models—the public has grown more curious about the reasons why and methods used for data preparation.
Data ops is an art where quality is key.
Far from being a straightforward process, preparing data for models involves multiple steps, each contributing to the overall quality of the output. When preparing data, we always focus on its intended purpose: data consumption. We don't prepare, analyze, label, evaluate, compare, or rank data just for the sake of it, but with the goal of making it as consumable as possible. For data to be ingestible, it must be accurate, explainable, and structured. Discrepancies or differing opinions on labels for a given data point may arise, but these should not result from errors.
Our mission in data ops is to ensure, through guardrails and quality control measures, that the data points are reliable. However, there is no one-size-fits-all approach, and depending on the type of data we are handling and its ultimate use, we design tailored strategies accordingly.
When implementing quality in human tasks, control is often seen as the main lever. Ensuring that human contributors perform well, stay focused, and follow strict guidelines is crucial to achieving quality. In reality, multiple levers need to be pulled simultaneously to unlock higher quality.
As we design human tasks, there’s a common misconception that more control automatically leads to better quality. But quality isn’t achieved through control alone. We often concentrate on controlling human contributors instead of doing everything possible to help them deliver data at the expected level of quality. Simply adding more rounds of QA won't help our contributors, as often happens after their work is done.
Instead, creating favorable conditions for human workers to input higher-quality data will be much more beneficial. This approach reduces the need for extensive QA, minimizes rework, and lowers the attrition rate among workers. Well-known concepts such as risk mitigation and operational excellence can be adapted to enhance quality in data preparation.
The typical process of completing data preparation with humans-in-the-loop involves curating a crowd, designing a task, gathering human contributions, and delivering the expected output. By introducing quality improvement mechanisms at each of these steps, we can significantly move the quality needle - much more effectively than by merely increasing the QA review phase. Ideally, the earlier in the process we intervene, the better, as this broadens the scope for compliant inputs and reduces the risk of discarding judgments later.
We should approach this process by thinking backward from the output: if units reviewed during the QA phase are of poor quality, it’s because we allowed issues to arise at earlier stages. Every time contributors engage with a task, it’s an opportunity to support them in delivering the highest quality, reducing the number of units that end up rejected by the reviewer.
Engage with contributors
By treating contributors as partners and striving to make their work easier, we tend to reduce the number of low-quality units in the output.
Engaging with contributors can be done in several ways:
- It can involve clarifying the instructions or improving their usability: Best practices include developing instructions with clear steps, helpful tips, and multiple examples that show both positive and negative cases, including main, corner, and edge cases. Enhancing contributors' ability to navigate dense guidelines efficiently by offering direct communication channels or chatbot-like solutions to "converse" with the instructions can also make them more actionable, reducing the risk of errors.
- It can aim to prevent the submission of irrelevant answers: Using Smart Validators on an ad hoc basis or integrating them into our Smart Text feature helps prevent the submission of incorrect answers by detecting inaccurate language, misspellings, gibberish, duplicates, and more.
- It can involve enhancing contributors' abilities: AI models can support contributors by helping them focus on what truly matters. For example, pre-annotating data for their review or assisting them in reviewing their answers before submission can streamline their work.
Feedback, whether automated or human, is the key ingredient for improving quality. It trains contributors, reinforces good behavior, and uses incorrect responses to identify areas for improvement.
Monitor contributors’ behavior
Once we have everything in place to enhance contributors' ability to submit their best judgments, we can shift our focus to monitoring how they are actually performing.
The most common high-level approaches include manually reviewing a sample of labeled data, comparing how different contributors agree with each other to create consensus, or benchmarking their judgments against a ground truth.
Each of these techniques has its pros and cons: when reviewing a sample, there's no guarantee the data reviewed will be representative. This is why developing slicing strategies is crucial to ensure the data you review is insightful. When calculating inter-annotator agreement, using methods like Fleiss’ Kappa, it's important to account for chance and possible false positives. Lastly, creating a reliable ground truth is often time-consuming.
However, investing in a combination of these strategies will help keep your task on solid ground. Carefully designed and diverse test questions can monitor contributors’ consistency throughout the task. Inter-annotator agreement among accurate contributors can provide insight into crowd consensus. Additionally, relevant data slicing to focus QA efforts on specific cases will ensure genuine agreement between trustworthy workers.
Quality at the core
Our approach at Appen is to envision tech solutions and combine them to streamline the data preparation process.
We base our product development on research that spans disciplines, from psychology and game theory to mathematics and data science.
We don’t seek to implement AI just for the sake of it but always start by addressing the problems we need to solve.
Below are examples of how we improve quality by implementing innovative and slightly unconventional solutions.
1. Curate AI experts at the college level: Assessment AI + math experts for HBGM
A common way to assess the domain expertise of workers is by giving them Multiple Choice Questionnaires (MCQ), but creating these exam materials is extremely time-consuming. We need to ensure that the questions are highly relevant and well-scoped to accurately assess the workers' mastery of their domain. Additionally, we aim to frequently refresh these quizzes, not only to keep up with evolving domains but also to prevent the correct answers from being shared among workers.
To tackle this, we developed a prompt engineering and human input bootstrapping approach to generate domain quizzes at scale – explore this technique in more detail in the Chain of Thought prompting eBook. Our validation study showed that it is possible to save up to 30 hours when creating 150 questions. We anticipate that the time savings will continue to increase as the demand for domain-specific MCQs increases. Importantly, this time savings does not come at the expense of quality or factual correctness—the AI-generated MCQs meet the same standards as those created solely by humans. In our evaluation, we found that 93.1% of AI-generated MCQs were considered of good quality, compared to 87.3% of human-generated MCQs. Factual correctness was also equivalent between the two types of MCQs.
2. Leverage multimodal approach to review uploaded images before submission
Data collection tasks are usually long-running and large-scale, making it difficult to ensure high quality by relying solely on a sampling-based QA strategy. These tasks are also hard to guardrail using test questions, as it’s not always easy to benchmark collected data against a ground truth. This often leads to overcollection, which negatively impacts both project timelines and budgets.
Post-processing techniques are a good initial step to spot potential quality issues in the collected data, but they don’t prevent overcollection. This is why we’ve developed solutions to stop data submission by workers if it doesn’t meet the guidelines. We can either develop specific machine learning models or rely on LLMs with a specially engineered prompt. The latter solution allows us to quickly adapt to a wide variety of situations.
In recent projects, we used different LLMs to review answers before submission and highlight non-compliant elements in the contributors’ attempted submissions. This approach tackles three issues in one go: we prevent overcollection, increase the quality of the collected data, and train contributors on what is expected.
3. Use LLM to evaluate annotators’ input using a rubric approach
Quality assurance is a costly process that takes time, especially if you want to be thorough and review all submissions. We need to enhance the reviewers' ability by helping them focus only on the submissions that meet most of the requirements and are worth reviewing for feedback.
In some cases, the quality of submissions is so far from expectations that it's not worth deliberating on whether they should be reviewed at all. For such cases, we use LLMs, along with rubrics and prompt engineering, to identify submissions that we can confidently mark as unworthy of review. When adopting this approach, we carefully derive the rubrics from the guidelines to avoid discrepancies between human and LLM judgments. We also prioritize a low false positive rate and high accuracy, as we don’t want to discard judgments the models are uncertain about.
This approach improves quality in multiple ways: we quickly identify contributors who shouldn’t participate in the task, save review time for QA specialists or project managers, and increase their capacity to focus on improving the quality of the relevant submissions.
4. Entropy-informed LLM annotation
Replacing humans with AI wherever possible might sound cheaper and sometimes more reliable, and it's a common request from customers looking to leverage human data. However, the better approach is to augment human capability, allowing for the annotation of large volumes of diverse data in less time, at lower cost, and with improved quality. Of course, LLMs can generate relevant predictions about the class of a snippet, image, or video, but this alone isn’t enough. We still need humans involved to kickstart the process and ensure the model’s outputs are accurate and make sense, as LLMs can still make mistakes.
One challenge is that LLMs don’t provide a confidence level with their answers, so we needed to find a way to use LLMs to ease the data annotation process without compromising the relevance of the output.
We developed an approach that combines multiple prompts and/or multiple LLMs and calculate the entropy of predictions to decide whether the AI's annotation is reliable enough or requires human review. Our field studies show that we can maintain an accuracy level of 87%, while saving up to 62% of AI data annotation costs and reducing the required time by a factor of 3.
Quality is about combining all tools
Quality is rarely the result of a single tool in a process; instead, it comes from how effectively you combine the most relevant tools at the critical stages. Neither test questions alone nor dynamic judgments alone can achieve the quality level necessary to ensure the data feeding your models is top-notch.
To ensure the right data at the end of the process, you need to focus on:
- The right workers
- Source or identify contributors.
- Develop domain expertise evaluation tools.
- The right job design
- Plan the task flow and break questions into small increments to reduce cognitive load.
- Implement smart routing for your data.
- Include features that automate outlier detection:
- Use test questions or honeypots to monitor individual contributors’ performance.
- Decide how many judgments per unit you need, considering dynamic judgment solutions.
- Use inter-annotator agreement to ensure the crowd is heading in the right direction.
- Prevent submission of erroneous answers (e.g., improper formatting, language issues, non-compliance with guidelines, rubric-based evaluations).
- The right execution
- Use behavioral metrics to maintain crowd quality.
- Conduct spot-check campaigns.
- Set up alerts for specific issues (e.g., missing test questions, time per page).
- The right QA approach
- Randomly sample data.
- Use advanced analyses based on multiple data points, such as time per unit, inter-annotator agreement, accuracy on specific classes, or unbalanced answer distributions.
- Define the objective of your QA:
- Correct inaccurate data.
- Educate the crowd and provide feedback.
In summary, the most successful human data campaigns are those that combine multiple quality tools, securing each step of the process and minimizing flaws along the way.