Adversarial Prompting: Benchmarking Safety in Large Language Models

Benchmarking vulnerabilities in leading LLMs
Large language models (LLMs) are becoming smarter, faster, and more deeply embedded in real-world workflows—from virtual assistants to enterprise platforms. But as their capabilities grow, so do the risks. Adversarial prompting has emerged as a critical challenge in AI safety, revealing how even the most advanced models can be manipulated into producing harmful, biased, or restricted outputs.
This original research from Appen introduces a novel evaluation dataset and benchmarks leading open- and closed-source models across a range of harm categories. The results show how attackers exploit model weaknesses using techniques like virtualisation, sidestepping, and prompt injection, and highlight substantial safety performance gaps—even in models with state-of-the-art scale and compute.
What is adversarial prompting?
Adversarial prompting is the practice of crafting inputs that bypass LLM safety mechanisms, triggering unsafe or policy-violating outputs. These inputs often rely on linguistic subtlety rather than overt rule-breaking, making them difficult to detect with standard moderation tools.
Key techniques include:
- Virtualisation – Framing harmful content within fictional or hypothetical scenarios
- Sidestepping – Using vague or indirect language to circumvent keyword-based filters
- Prompt Injection – Overriding model instructions with embedded commands
- Persuasion and Persistence – Leveraging roleplay, appeals to logic or authority, and repeated rewording to wear down refusal behaviour
Understanding these techniques is critical for assessing model robustness and developing safe, trustworthy AI systems.
Why does this research matter?
This study offers a comprehensive benchmark of LLM safety performance under adversarial pressure, exposing meaningful differences between models. The findings show that:
- Safety outcomes varies significantly across models, even under identical testing conditions
- Prompting techniques and identity-related content can dramatically influence model outputs
- Deployment-time factors—like system prompts and moderation layers—play a crucial role in safety
Download the research paper
As LLMs are increasingly deployed in high-stakes environments, understanding their vulnerabilities is critical to responsible AI development. This paper delivers actionable insights into the effectiveness of current safety interventions and proposes strategies to mitigate emerging threats.
In this paper, you’ll learn:
- How adversarial prompts reveal vulnerabilities in LLMs
- What techniques (e.g., virtualization, sidestepping) were most effective at eliciting harm
- How identity-related prompts impacted safety outcomes
- Why safety-aligned LLM training data is essential to building robust LLMs
- What organisations can do to improve LLM safety in practice
Optimise for resilience, not just intelligence
Appen’s human-in-the-loop LLM red teaming approach helps leading AI developers stress-test their models against sophisticated attack strategies. By integrating ethical evaluation, adversarial testing, and real-time human judgment, we support clients in developing AI systems that are not only powerful—but also aligned, resilient, and ready for real-world deployment.