Adversarial Prompting: Benchmarking Safety in Large Language Models

April 23, 2025

Get your copy today

Benchmarking vulnerabilities in leading LLMs

Large language models (LLMs) are becoming smarter, faster, and more deeply embedded in real-world workflows—from virtual assistants to enterprise platforms. But as their capabilities grow, so do the risks. Adversarial prompting has emerged as a critical challenge in AI safety, revealing how even the most advanced models can be manipulated into producing harmful, biased, or restricted outputs.

This original research from Appen introduces a novel evaluation dataset and benchmarks leading open- and closed-source models across a range of harm categories. The results show how attackers exploit model weaknesses using techniques like virtualisation, sidestepping, and prompt injection, and highlight substantial safety performance gaps—even in models with state-of-the-art scale and compute.

What is adversarial prompting?

Adversarial prompting is the practice of crafting inputs that bypass LLM safety mechanisms, triggering unsafe or policy-violating outputs. These inputs often rely on linguistic subtlety rather than overt rule-breaking, making them difficult to detect with standard moderation tools.

Key techniques include:

Virtualisation – Framing harmful content within fictional or hypothetical scenarios
Sidestepping – Using vague or indirect language to circumvent keyword-based filters
Prompt Injection – Overriding model instructions with embedded commands
Persuasion and Persistence – Leveraging roleplay, appeals to logic or authority, and repeated rewording to wear down refusal behaviour

Understanding these techniques is critical for assessing model robustness and developing safe, trustworthy AI systems.

Why does this research matter?

This study offers a comprehensive benchmark of LLM safety performance under adversarial pressure, exposing meaningful differences between models. The findings show that:

Safety outcomes varies significantly across models, even under identical testing conditions
Prompting techniques and identity-related content can dramatically influence model outputs
Deployment-time factors—like system prompts and moderation layers—play a crucial role in safety

Download the research paper

As LLMs are increasingly deployed in high-stakes environments, understanding their vulnerabilities is critical to responsible AI development. This paper delivers actionable insights into the effectiveness of current safety interventions and proposes strategies to mitigate emerging threats.

In this paper, you’ll learn:

How adversarial prompts reveal vulnerabilities in LLMs
What techniques (e.g., virtualization, sidestepping) were most effective at eliciting harm
How identity-related prompts impacted safety outcomes
Why safety-aligned LLM training data is essential to building robust LLMs
What organisations can do to improve LLM safety in practice

Optimise for resilience, not just intelligence

Appen’s human-in-the-loop LLM red teaming approach helps leading AI developers stress-test their models against sophisticated attack strategies. By integrating ethical evaluation, adversarial testing, and real-time human judgment, we support clients in developing AI systems that are not only powerful—but also aligned, resilient, and ready for real-world deployment.

Download research paper