Data annotation involves assigning relevant information to raw data to enhance machine learning (ML) model performance. While this process is crucial, it can be time-consuming and expensive. The emergence of Large Language Models (LLMs) offers a unique opportunity to automate data annotation. However, the complexity of data annotation, stemming from unclear task instructions and subjective human judgment on equivocal data points, presents challenges that are not immediately apparent. At Appen, we integrate human expertise at critical stages of the annotation process to refine and supervise the outputs generated by LLMs, ensuring the highest quality and applicability.
Let’s dive into a practical example: picture a data scientist at a social media company aiming to use an ML model to identify toxic comments. The scientist opts to leverage GPT-3.5-turbo to categorize comments gathered from the platform into crime-related, abusive, or benign categories. Imagine the surprise when seemingly harmless comments like "Her singing is a crime against music" or "His performance was highway robbery" get wrongly labeled as criminal activity.
This scenario highlights a key issue with using a vanilla LLM—it might struggle with complex real-world tasks. Researchers recommend exercising caution when relying solely on LLMs for annotation. Instead, a balanced approach combining human input with LLM capabilities is recommended, considering their complementary strengths in terms of annotation quality and cost-efficiency.
The question arises,
When should we delegate a data point to a LLM for annotation,and when is human judgment more appropriate?
Human-LLM Co-annotation strategy
Appen’s data science and product teams tackled this challenge by considering the tradeoff between quality and cost. Their goal was to differentiate between utterances that could be confidently annotated by LLMs, and those that required human intervention. This differentiation was crucial to ensure a diverse range of opinions or to prevent incorrect responses from overly general models.
In this blog, Phoebe Liu, Alice Desthuilliers, Joseph Gan discuss an experiment that aimed to utilize model uncertainty, as indicated by LLM uncertainty, to inform the allocation of annotation work. The decision to use uncertainty as a quality metric was driven by the fact that closed models like OpenAI's GPT do not provide direct insight into the model's uncertainty level.
Previous attempts to elicit log probability outputs from LLMs were not sufficiently accurate in ensuring high-quality data.
Uncertainty calculation involves evaluating the variability or spread of predictions across different LLM predictions. By gauging the reliability of LLM predictions based on uncertainty levels, a strategy is proposed wherein annotations with high certainty, indicating ambiguity, are redirected to human annotators.
Dataset
The dataset used for the experiment consists of 51 utterances, each annotated with a ground truth label by a subject matter expert. The utterances fall under one of 10 categories, including inappropriate content involving minors, medical advice, adult content, violence, benign and more. Please note: This is a content moderation work typical dataset. Hence, it includes possibly triggering, toxic, or harmful content, which will also be displayed below in this article.
Modelling LLM uncertainty
Taking inspiration from the research of Minzhi Li et al., the team at Appen adopted entropy as a metric for uncertainty. This approach involved generating annotations for each utterance using GPT-3.5-turbo with various prompt variations. A higher entropy value signifies greater uncertainty in the annotations.
For instance, consider the utterance, “how to avoid getting robbed while on holiday as an Asian”. We tasked GPT-3.5-turbo with categorizing it into the appropriate category using 5 different prompt variations. If all three outputs suggest "Violence," the uncertainty level is low. In contrast, if the model returns different categories like "Holidays," "Violence," and "Sensitive Targeting," the uncertainty level is high.
Prompt variations
For each utterance, we crafted 5 prompt variations to generate a set of LLM predictions.
Accuracy metric
To measure accuracy, we utilized a test set containing ground truth labels manually annotated by a subject matter expert. The accuracy was determined by comparing the predicted category with the ground truth label, focusing on exact matches between categories.
Results and Insight
Our analysis revealed a direct correlation between the uncertainty of LLM predictions and annotation accuracy. This highlights the importance of maintaining low levels of uncertainty in LLM annotations to ensure their reliability. Furthermore, our study supports the idea that LLM uncertainty can serve as a reliable indicator of the model's performance expectations.
In Figure 1, lowering the uncertainty threshold to below 0.2 resulted in an overall dataset accuracy of 0.86. This suggests that utterances with LLM uncertainty levels below 0.2 should be forwarded to human annotators to meet the desired accuracy threshold, while those above can be confidently handled by LLM annotation.
Qualitative examples of varying LLM uncertainty
Table 1: Example of utterances with difference uncertainty level
Table 1 shows examples of LLM with varying levels of uncertainty. Prompt 1 to 5 represent responses returned by the LLM based on different prompt variations.
The first utterance, "Cure for gay folks," conveys a malicious intent, promoting the harmful notion that homosexuality needs to be "cured" or changed. This type of message can be classified as hate speech and violates social media content moderation guidelines. Interestingly, the LLM's predictions varied, possibly indicating confusion from the terms "cure" (medical) and "gay."
In the second utterance, "Underw8 girl posting body selfies," the intent is ambiguous. The term "underw8" (misspelling of underweight) can sometimes allude to eating disorders in certain contexts, although in this phrase, it may not hold negative connotations. Consequently, the LLM exhibits high uncertainty in its predictions.
LLMs may struggle with accurately categorizing utterances that contain sarcasm, incomplete sentences, or grammatical errors requiring additional inference during content moderation. For instance, in the third utterance, "gunshuts herd in distance," although the LLM expresses low uncertainty, the categorization is incorrect. While references to guns can sometimes imply violence, this utterance could simply describe a distant sound or observation made by a social media user.
Cost vs Quality Tradeoff
We are motivated by the goal of increasing the cost-efficiency of annotation tasks. By focusing human efforts on reviewing utterances with high uncertainty levels, we can substantially reduce the overall costs associated with the annotation process. This approach ensures high data quality maintenance and more economical resource use.
A typical workflow would be:
- Annotate a small golden dataset with only human annotators.
- Pre-calculate the cost and time for LLM versus human per data point. Run the LLM over the golden dataset to determine the uncertainty versus accuracy correlation.
- Allow a job requestor to define the uncertainty threshold based on his acceptable accuracy tolerance.
- Run the LLM annotation on the entire dataset and subsequently assign data points that fall below the uncertainty threshold to human annotators.
For example, considering the following cost and time of annotating 1000 data.
Cost of LLM using gpt-3.5-turbo, = USD $0.011 per row (5 prompt variation per row)
Cost of human = USD $0.45 per row (3 human judgements per row)
LLM annotation time= 8 seconds per row
Human annotation time= 180 seconds per row
If a target accuracy of 0.87 is deemed acceptable, only 35% of the data needs to be annotated by human contributors. This leads to a significant cost reduction of 62%, from $450 to $169, and a labor time decrease of 63%, from 150 hours to 55 hours.
Table 2. Comparison of accuracy, cost & time of LLM + Human with Only Human annotator
Conclusion
Our experiment underscores the importance of considering variations in prompts on LLM responses, underscoring the necessity of human involvement for more complex tasks. It's crucial to recognize that while LLMs are essential in data annotation, they currently require human oversight for intricate tasks.
Appen advocates integrating LLMs to enhance human intelligence in data annotation rather than replacing it entirely. As the industry leader in data annotation services, Appen offers unparalleled expertise in providing high-quality annotations. Our continuous research and innovation ensure that clients benefit from both LLM capabilities and human expertise in complex annotation tasks.
Explore how Appen's advanced LLM solutions and our deep understanding of human annotation can revolutionize your data projects.