Beyond Guardrails: Defending LLMs Against Sophisticated Attacks

Jason Martin on Policy Puppetry, Indirect Injection Risks, Defensive Layers, and Model Provenance.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Jason Martin is an AI Security Researcher at HiddenLayer. This episode explores “policy puppetry,” a universal attack technique bypassing safety features in all major language models using structured formats like XML or JSON. Discover how this method works, the significant risks it poses to enterprises (brand damage, resource theft, data breaches), and why common defenses like RAG and guardrails are insufficient. Learn about essential layered security strategies, the nuances of open vs. proprietary model security, and emerging threats like indirect prompt injection.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Casey Ellis → The Future of Cybersecurity – Generative AI and its Implications
Andrew Burt → Why Legal Hurdles Are the Biggest Barrier to AI Adoption
Shreya Rajpal → The Essential Guide to AI Guardrails
What Is An AI Alignment Platform?
What AI Teams Need to Know for 2025
Red Teaming AI: Why Rigorous Testing is Non-Negotiable

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

Policy Puppetry and AI Security Fundamentals

What is “policy puppetry” and why is it significant in AI security?
Policy puppetry is an attack technique that bypasses safety features in all major language models, including those from OpenAI, Anthropic, Google, Meta, and others. It works by wrapping a malicious objective in structured language formats like XML, YAML, or JSON to create a competing “policy” that overrides the model’s built-in safety mechanisms. The structured format tricks LLMs into suppressing their built-in refusal phrases and carrying out the attacker’s request. Its significance lies in its universal effectiveness across all major LLMs, even the latest versions like Llama 4.1, GPT-4o, and Claude 3.

How does policy puppetry work technically?
The technique exploits how LLMs process structured language by creating a policy that suppresses the refusal responses models typically give. The attacker wraps their objectives in structured formats (like XML) and defines “blocked strings” containing phrases the model uses to refuse harmful requests. When the model can’t use its standard refusal vectors, it’s forced to find alternative response paths, essentially complying with requests it would normally decline. This bypasses both the foundation model policy (built-in by providers through alignment training) and application policy (implemented by developers using system prompts).

Does policy puppetry affect different types of models differently?
The fundamental technique works on both reasoning models (like Claude) and non-reasoning models. Larger reasoning models sometimes need a slightly richer prompt to get high-quality output, but they still comply. While the team has primarily tested against text modalities, they expect the attack surface to be potentially larger in continuous image space, making an extension to multimodal models “almost certainly” possible.

Enterprise Impact and Risks

Why should enterprises care about policy puppetry if they’re not doing anything illegal or harmful?
There are several significant business concerns:

Brand Protection: If your customer-facing chatbot generates inappropriate content, that reflects directly on your brand and may have legal implications.
Denial-of-Wallet: Attackers can repurpose your specific-task chatbot (like a banking assistant) into a general-purpose assistant, stealing your compute resources while you foot the bill.
Business Disruption: Your chatbot could be manipulated to recommend competitors’ products or offer incorrect pricing information – imagine a chatbot offering your $10,000 product for $1, which customers might screenshot and demand you honor.
Data Security: Particularly with agentic LLMs that have access to tools or functions, an attacker could potentially extract sensitive information.

Does using RAG (Retrieval-Augmented Generation) protect against these types of attacks?
No, RAG alone is insufficient protection. Even if your prompt instructs the model to “only use what you retrieve from the sources,” policy puppetry can still override these instructions. Additionally, attackers can pretend to be a function response or RAG response to manipulate the model. Models remain opinionated even with RAG and may still be vulnerable to these types of attacks. The LLM could still be convinced to ignore the RAG context and opine based on its training data or follow malicious instructions.

How accessible is policy puppetry to potential attackers?
It’s a technique rather than a single copy-paste prompt, requiring some crafting. An attacker needs to state their objective and fill in the policy template with the refusal vectors they’re trying to suppress. It’s more accessible than sophisticated attacks but still requires understanding the objective. It’s somewhere between a simple prompt and a complex exploit. While it requires more effort than finding a simple jailbreak prompt online, the technique is quite accessible and effective, especially now that the concept is public.

Defensive Measures

What can enterprises do to protect against policy puppetry and similar attacks?
Multiple defensive layers are necessary:

Implement Security Monitoring: Deploy solutions that specifically observe input/output behavior for patterns indicative of attacks like policy puppetry.
Limit Model Privileges: Be cautious about which agentic features and tools you expose to your LLM; every extra tool or agent function is a new input surface.
Automate Red Teaming: Regularly test systems with automated tools that incorporate known techniques like policy puppetry.
Build Incident Response Playbooks: Develop AI-specific incident response plans; templates like the JCDC-AI playbook are a starting point, but they lag behind the threat landscape.
Acknowledge Limitations: Understand that built-in safety alignment and basic instruction hierarchies are not foolproof.

Are guardrails effective against policy puppetry?
Guardrails alone are not sufficient. The term “guardrails” isn’t well-defined in the industry and sometimes refers to different concepts. What’s clear is that alignment training, instruction hierarchy, and common guard-rail filters all fail against policy puppetry. Enterprises need filters plus an external “security monitor” that inspects both prompts and completions in real time and can react to suspicious patterns. Expect “a flurry of products” specialized in detecting these primitives.

Can fine-tuning or RLHF protect my model from these attacks?
Probably not. Safety alignment is itself a huge RLHF run, and policy puppetry walks right around it. Fine-tuning primarily adjusts the model’s response style and preferences for a specific domain, making it more likely to refuse out-of-scope requests based on that training. However, it doesn’t erase the underlying knowledge from pre-training. The model doesn’t “forget” information learned during pre-training; it just learns preferred response patterns. If the refusal mechanism taught during fine-tuning is bypassed (as policy puppetry does), the model can still access and use its broader knowledge.

Open vs. Proprietary Models

Is there a security difference between proprietary and open-weights models regarding these attacks?
For susceptibility to techniques like policy puppetry, there isn’t a fundamental difference based purely on the weights being open or closed. Both suffer from the same vulnerabilities. Having the weights does enable heavyweight, gradient-based back-door attacks, but those are costly; attackers prefer the near-zero-compute path of policy puppetry, which works even when weights are hidden.

The main practical differences are:

Proprietary Models: Often come bundled with additional security layers (monitoring, guardrails) provided by the vendor.
Open-Weight Models: Require the deploying organization to implement their own security measures. The implementation and surrounding ecosystem often pose significant risks.

What are the supply chain risks when self-hosting open models?
This is a big risk area. Many Hugging Face repositories require “trust remote code” or ship forked checkpoints by unknown parties. The supply chain challenges exist with many derivatives (tens of thousands of model variants can appear on platforms like Hugging Face shortly after a release). Tools like HiddenLayer’s Model Scanner can help find serialization exploits and architectural back-doors before deployment. The key is to choose the right checkpoint and scan it before use.

Open Models from China

Are there legitimate security concerns with using open-weights models from China (like DeepSeek or Qwen)?
Geopolitics aside, technical analysis (like HiddenLayer’s deep dive on DeepSeek) revealed no intrinsic back-doors in the released weights, and nothing in the file format “phones home.” The main differences found were in alignment – these models may respond differently to certain prompts based on their training data and alignment tuning.

Key considerations include:

Alignment Differences: The models respond differently based on language and topic, so enterprises must test for brand, legal and regulatory fit.
Supply Chain Security: How is the model packaged and delivered? This is a concern for any open-weights model, regardless of origin.
Sector-specific Restrictions: Certain sectors (military, defense) may have regulatory reasons to avoid certain models.

If I install a model from China on my own servers (on-premise), will it “phone home”?
The model weights themselves are static data files and don’t inherently contain code to transmit data. Concerns about data being sent back usually relate to hosted services running the model or insecure infrastructure around the model, not the weight file itself when run in a controlled on-prem environment.

What should enterprises consider when evaluating models from different origins?
For many general enterprise use cases, the focus should be less on the origin per se and more on:

Thorough testing for performance and alignment on your specific tasks and data
Robust security practices around the deployment: using model scanning tools, secure infrastructure, input/output monitoring, and regular red teaming
Understanding and mitigating supply chain risks associated with how you obtain and run the model
Consider testing in different languages as alignment may vary by language

If these steps are taken, the risks associated with a well-vetted model, regardless of origin, can be managed for many applications.

Emerging and Future Threats

What other attack vectors concern you beyond policy puppetry?
Indirect prompt injections are particularly concerning. These don’t come directly from user input but from:

Malicious content within documents retrieved by a RAG system
Instructions hidden in emails processed by an AI assistant
Malicious code or prompts in the descriptions or outputs of tools/APIs that an agent interacts with
Data ingested from web searches or bug report tools

As agentic LLMs gain more tool use capabilities, a poisoned input (like a bug report) could trigger an agent to email proprietary data to an attacker. The impact of these attacks increases dramatically as models gain more agentic capabilities and interact with more external systems.

What do you predict will happen next in the AI security defense landscape?
In the coming months, we’ll likely see:

Dedicated detectors for structured-policy exploits
Wider use of automated red-team agents
Stricter supply-chain validation for open checkpoints
More collaborative “AI alignment platforms” so legal, security, and product teams can sign off on hardening before a bot ships
Better incident response playbooks for AI systems, as current ones are lagging behind the rapidly evolving technology

There’s a significant gap between rapid prototyping (days) and proper security hardening (weeks), creating pressure to launch before security testing is complete. Better interdisciplinary collaboration between technical teams, legal, compliance, and security will be crucial for responsible AI deployment.