Boosting Data Quality with Appen's Human-centric AI Detector Model

Published on

April 10, 2024

Author

Authors

Alice Desthuilliers

Principal Product Manager

Appen

No items found.

This blog post is part of our three-part series on human data in the AI training process.

‍

In our Deciphering AI from Human Generated Text article, Appen discusses how the development of generative AI technology simplified the generation of text, speech, and audio, which can be difficult to distinguish between human and AI content. This has become problematic for many crowdsourcing data platforms and AI model builders who require human-authored data.

Human data is generally more accurate, nuanced, diverse, representative, and higher quality than synthetic data. Today, we’ll discuss how Appen addresses this challenge by ensuring authentic data with our human-centric AI detector model.

Differentiating between human-generated and AI-generated content is paramount for many use cases, including academia, journalism, and notably training data for machine learning models.

Such training data involves a subject matter expert writing content of varying lengths and topics, which is later utilized to train or evaluate AI models. Given the necessity for data consumers to ascertain the authenticity of their data origin, it is critical that Appen implements measures ensuring the authenticity of the dataset created by human experts.

Following the release of the first Large Language Models (LLMs), the AI community quickly focused on the challenge of distinguishing between human and machine-created content. Soon after, AI detectors emerged, promising sufficient accuracy to serve as risk indicators. They were generally perceived as reliable truth sources across various applications.

Appen since revealed that three publicly available AI detectors fail to achieve satisfactory levels of both true positive and false positive rates for practical production use. This shortfall is particularly problematic for crowdsourcing use cases, which are significant to Appen and our customers.

Appen's benchmark results

In previous blog posts, Appen Data Scientists Arjun Patel and Phoebe Liu highlighted the issue of AI-generated text going undetected in crowdsourced data annotation and collection tasks. They noted that, while promising, current AI detection models heavily rely on contextual cues, resulting in unacceptably high false positive rates for data annotation contexts. As LLMs become increasingly adept at producing text indistinguishable from human writing, the challenge of identifying synthetic content is expected to intensify.

Patel and Liu evaluated three different APIs, aiming for a false positive rate below 9% while also trying to maximize true positive rates by adjusting thresholds. Their findings show that although some models achieve impressive true positive rates, up to 91% in some cases, the false positive rates were concerningly high, with the best performing model, GPTZero, reaching 73%. This trade-off significantly impacted the true positive rate (TPR) performance, highlighting the difficulty in maintaining a false positive rate (FPR) at or below 9%.

	FPR	TPR
Sapling	0.07	0.05
GPTZero document	0.07	0.15
OpenAI GPT2	0.08	0.15

Table 1: Maximized TPR for FPR below 0.09

Given the advancements in LLMs, how can we effectively mitigate the risk of AI-generated texts in crowdsourced tasks without mistakenly flagging genuine human contributions as machine-generated?

Our observation into the human creative writing process

To understand how crowd workers create content and whether we could identify AI-generated text, we analyzed real-user behavior, focusing on common writing workflows used in authentic content creation, versus those aiming to circumvent the process with the aid of an LLM.

Authentic workflows in the annotation platform typically may be:

Read a prompt, ideating briefly, and write a response from beginning to end.
Read a prompt, spend time writing and revising within the annotation platform, and submit the final piece.
Engage in typing "sprints" with rapid writing followed by slower, more thoughtful passages.
While writing, getting distracted and navigating away from the task screen multiple times.

Conversely, workflows suggesting the use of LLM assistance may involve:

Write a prompt into a LLM playground and paste the generated response back into the annotation platform.
Write a prompt into a LLM playground and edit the generated response in the external playground and paste the edited response back into the annotation platform.
Generate a response in a LLM playground externally and retype it verbatim into the task.
Generate a response in a LLM playground externally, paste it back in the annotation platform, and make further edits.

By analyzing these workflows, we can gain clearer insights into detecting AI content, showing that a process-based approach could be more revealing than text analysis alone. Although we cannot directly observe if human writers use external tools, their device interactions often betray the use of such AI aids. This method not only helps us differentiate between human and AI-written content but also leverages data variation, as contextual cues offer little lift and limited additional insight.

A human-centric model based on behavioral approach.

Leveraging Appen's wealth of experience with a human-centric approach to data collection, we uncover essential behavior patterns among contributors on our platform.

We used these findings to develop and evaluate a machine learning model, incorporating key behavior features identified through statistical analysis. Performance assessments of our custom AI-detector model involved 5-fold stratified cross-validation, aggregating results across all folds and assessing metrics such as accuracy, F1 score, false positive rate (FPR), and true positive rate (TPR).

Model	Accuracy	F1	FPR	TPR
Sapling	0.62	0.71	0.67	0.90
GPTZero	0.70	0.70	0.26	0.66
GPTZero document	0.61	0.71	0.73	0.91
OpenAI GPT2	0.51	0.31	0.16	0.21
Our Model	0.90	0.91	0.11	0.91

Table 2: Performance metric for third-party APIs

Our model surpasses all tested competitor APIs on various metrics, notably:

Outperforming the best API accuracy by 20 points.
Achieving the highest TPR overall with the lowest FPR at 0.11
Decrease the top performing FPR by 0.05 points.

By setting a maximum FPR of 0.09, we match the best observed FPR of 0.07 while boosting TPR by an impressive 73 percentage points, representing a sixfold increase in performance!

Model	FPR	TPR
sapling	0.07	0.05
GPTZero document	0.07	0.15
OpenAI GPT2	0.08	0.15
Our Model	0.07	0.88

Table 3: Maximized TPR for FPR below 0.09

The low FPR as a North Star

Maintaining an extremely low FPR was our guiding principle for this research, as detailed in a previous blog post. Our goal was to ensure trust among our contributors while minimizing the potential risk from malicious players. Therefore, it was critical to improve the TPR without comprising the mandated FPR. This balancing act aids in more effectively detecting AI-generated texts in tasks, while keeping the risk of false alerts minimal.

To understand how this improvement might be realized, especially in a low FPR context, let's consider the challenges our benchmarked third-party models faced. These models, primarily based on content, dealt with a high noise level concerning writing style, length, syntax, and other factors. They faced potential al bias risks, such as incorrectly non-native English contributors' work as AI-generated compared to native English contributors.

Our model pivots toward a human-centered approach, revolving around observed biometrics patterns during creating writing. Its additional training features are more resilient in distinguishing AI-generated texts from human ones, particularly when contextual cues are ambiguous but behavioral signals are definite. Future iterations could integrate content-based features with behavioral ones to boost performance where behavioral signals are similar across classes, but contextual cues vary.

Additionally, these models could be forming part of an AI detection pipeline, wherein the ML model's output, combined with historical contributor signals, could empower a subject matter expert make a more informed decision on contributors potentially using an LLM as an aid in their content creation process.

Conclusion

Appen's quest to enhance the quality of human-generated data for AI model training has seen numerous milestones and innovative methods. This project faced two key challenges: upholding contributors' confidence by maintaining a low False Positive Rate (FPR) and creating a reliable solution to protect crowdsourced tasks from AI-generated text infiltration.

Given the high FPRs of existing AI detectors that focus mainly on contextual cues, we shifted towards a behavioral approach. By studying behavioral patterns during text creation, our Data Science team devised a more robust method to differentiate human and AI-generated content. This strategy dovetails beautifully with Appen's deep expertise in human-in-the-loop systems, enabling exploitation of our invaluable experience in observing and analyzing human interactions across various tasks.

As a result, we've developed a unique model that excels in assuring high-quality human data. Significantly surpassing competitor APIs in accuracy, our model boasts the highest True Positive Rate (TPR) while keeping the FPR the lowest. This advanced capability uniquely positions Appen to offer superior data quality assurance, ensuring that customers' AI models are trained exclusively on authentic, human-generated data.

In our drive for a low FPR, we've nurtured the trust of our crowd workers while substantially raising TPR, improving our AI-generated text detection efficiency. This pioneering approach epitomizes Appen's commitment to excellence and our sustained mission to uphold data integrity and quality essential for AI model training.

Other articles in this series:

Navigating the AI Detection Landscape

Deciphering AI from Human Generated Text: The Behavioral Approach