Improving instruction hierarchy in frontier LLMs

Executive Summary

The instruction hierarchy (IH) problem has emerged as one of the most consequential challenges in deploying large language models (LLMs) safely at scale. As of early 2026, research from OpenAI, Princeton University, Zoom, and independent security firms has converged on a critical finding: modern LLM architectures treat all input tokens with equal weight, creating systemic vulnerabilities that allow malicious actors to override safety guardrails, extract system prompts, and hijack agentic systems. This report examines the state of instruction hierarchy research, its practical implications for AI tool users and developers, and how the landscape of frontier model pricing intersects with security considerations when choosing an LLM for production deployment.

The Core Problem: Flat Instruction Processing

At the architectural level, most frontier LLMs process system messages, user prompts, and external data as a flat sequence of tokens. There is no native mechanism to differentiate a high-trust system instruction from a low-trust user input or untrusted external data. This design flaw is not incidental — it is structural, and it has profound consequences for security.

As described in the ICLR 2025 paper on Instructional Segment Embedding (ISE), “modern LLM architectures treat all input tokens equally without formal mechanisms to differentiate instructions,” meaning that “lower-priority user prompts may override more critical system instructions, including safety protocols” (ICLR 2025 - ISE Paper).

This creates three primary attack surfaces:

Prompt injection: Malicious instructions embedded in external data sources subvert the original system instructions.
Prompt extraction: Attackers elicit the model to reveal its proprietary system prompt.
Jailbreaking: Users craft inputs that cause the model to produce harmful content in violation of its safety policy.

The instruction hierarchy framework, introduced by Wallace et al. (2024) and further developed by OpenAI’s IH-Challenge dataset, proposes a trust-ordered policy for resolving these conflicts. The hierarchy ranks instructions by source: system/developer instructions carry the highest trust, followed by user instructions, followed by tool outputs and external data (OpenAI IH-Challenge Paper).

Related: How to Use AI Without Getting Fired: A Professional’s Guide (2026)

The Policy Puppetry Attack: A Wake-Up Call

In April 2025, HiddenLayer published research demonstrating a novel bypass technique called “Policy Puppetry” that works against virtually all major frontier models — including those from OpenAI, Google, Microsoft, Anthropic, Meta, DeepSeek, Qwen, and Mistral. The attack combines an internally developed policy technique with roleplay to bypass model alignment and produce outputs that violate AI safety policies across categories including CBRN (Chemical, Biological, Radiological, and Nuclear), mass violence, self-harm, and system prompt leakage (HiddenLayer Research).

Related: Google Gemini 3.1 Pro: Stronger Reasoning, Lower API Pricing Pressure, and What Changed

What makes this attack particularly alarming is its transferability. A single prompt can be adapted to work across different model architectures, inference strategies (including chain-of-thought and reasoning modes), and alignment approaches. HiddenLayer describes it as “the first post-instruction hierarchy alignment bypass that works against almost all frontier AI models,” highlighting that RLHF (Reinforcement Learning from Human Feedback) alone is insufficient to guarantee robust safety behavior.

The attack exploits a systemic weakness in how LLMs are trained on instruction or policy-related data. Because the vulnerability is rooted in training data and methodology rather than a patchable software bug, it is described as “difficult to patch” — a sobering assessment for organizations deploying LLMs in sensitive environments.

Research Responses: Toward Robust Instruction Hierarchy

Instructional Segment Embedding (ISE)

The most architecturally ambitious response to the IH problem comes from researchers at Princeton University and Zoom Video Communications. Their ISE technique, published at ICLR 2025, embeds instruction priority information directly into the model at the architectural level — inspired by BERT’s segment embeddings. Rather than relying on prompt engineering or fine-tuning alone, ISE gives the model a formal mechanism to distinguish between system instructions, user prompts, and data inputs.

Experimental results are promising: ISE achieves an average robust accuracy increase of up to 15.75% on the Structured Query benchmark and 18.68% on the Instruction Hierarchy benchmark, while also improving instruction-following capability by up to 4.1% on AlpacaEval (ICLR 2025 - ISE Paper). This dual improvement — better safety and better instruction following — is significant because it counters the common assumption that safety and capability are in tension.

IH-Challenge: OpenAI’s Training Dataset Approach

OpenAI’s IH-Challenge paper introduces a reinforcement learning training dataset specifically designed to improve instruction hierarchy robustness on frontier LLMs. The paper acknowledges that “robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as over-refusing” (OpenAI IH-Challenge Paper).

Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial training is reported to improve robustness against jailbreaks, system prompt extraction, and agentic prompt injections. This approach is complementary to ISE — where ISE modifies the architecture, IH-Challenge improves the training data and process.

Soft Instruction De-escalation (SIC)

For agentic systems that interact with external environments, the SIC (Soft Instruction Control) pipeline proposed at ICLR 2026 offers a multi-stage sanitization approach. SIC unconditionally rewrites incoming data to neutralize potential instructions, injects canary instructions to detect if the rewriter itself has been compromised, applies multiple independent rewrite passes, and uses a detection module to inspect output for residual instruction-like content. If imperative instructions remain, the agent halts (OpenReview - SIC).

Related: Can AI Really Write SEO Content? 5 Tools Tested on 50 Articles

This defense-in-depth strategy is particularly relevant for RAG (Retrieval-Augmented Generation) pipelines and tool-augmented agents, where untrusted external data is a constant threat vector.

AlignSentinel: Attention-Map-Based Detection

Published in February 2026, AlignSentinel introduces a three-class classifier that uses features derived from LLM attention maps to distinguish between misaligned instructions (attacks), aligned instructions (benign), and non-instruction inputs. Existing detection defenses tend to flag any input containing instructions as malicious, leading to high false-positive rates. AlignSentinel’s alignment-aware approach reduces this problem by accounting for the instruction hierarchy context (AlignSentinel - arXiv).

Practical Implications for AI Tool Users and Developers

For Developers Building on LLM APIs

The instruction hierarchy problem has direct, actionable implications for anyone building applications on top of frontier LLM APIs:

Do not rely solely on system prompt instructions for security. Policy Puppetry demonstrates that system prompts can be bypassed or leaked. Defense must be layered.
Treat external data as untrusted. Any data retrieved from the web, databases, or user-uploaded files should be treated as a potential prompt injection vector. Consider sanitization pipelines like SIC before passing external content to the model.
Monitor for anomalous outputs. HiddenLayer’s AISec Platform demonstrates that real-time monitoring can detect Policy Puppetry attacks. Organizations deploying LLMs in sensitive environments should invest in similar detection infrastructure.
Prefer models with stronger IH training. As OpenAI’s IH-Challenge work matures and gets incorporated into production models, models fine-tuned on adversarial IH datasets will offer meaningfully better security posture.
Avoid over-relying on RLHF alignment as a security guarantee. The Policy Puppetry research makes clear that RLHF-aligned models are not immune to bypass. Alignment is a necessary but insufficient condition for security.

For Enterprise Deployments

The systematic literature review on LLM defenses (published January 2026, covering 88 studies) provides a structured taxonothe of mitigation strategies building on NIST’s adversarial machine learning framework. Key takeaways for enterprise deployments include the importance of model-agnostic, open-source defense tools, and the need for proactive red-teaming rather than reactive patching (arXiv - Systematic Literature Review).

Pricing Landscape: Balancing Security and Cost

Security considerations do not exist in a vacuum — they must be weighed against cost and capability. As of March 2026, the LLM API pricing landscape has shifted dramatically, with prices falling 80% year-over-year for some flagship models.

Current Pricing Overview (March 2026)

Provider	Model	Input (per 1M tokens)	Output (per 1M tokens)	Notes
Google	Gemini 2.0 Flash-Lite	$0.075	$0.30	Cheapest mainstream option
Google	Gemini 2.5 Flash	$0.30	$2.50	Fast mid-tier
Google	Gemini 2.5 Pro (≤200K)	$1.25	$10.00	Long documents, analysis
Google	Gemini 3.1 Pro (preview)	$2.00	$12.00	Next-gen flagship
OpenAI	Budget paid tier	$0.15	$0.60	High-throughput, budget
OpenAI	General paid tier	$2.50	$10.00	Multimodal workhorse
OpenAI	Higher-capability paid tier	$2.00	$8.00	High-quality general reasoning
OpenAI	GPT-5	$1.25	$10.00	New flagship
OpenAI	GPT-5.2 Pro	$21.00	$168.00	Premium, highest capability
Anthropic	Claude Haiku 4.5	$1.00	$5.00	Fast, budget-friendly
Anthropic	Mid-tier Claude tier	$3.00	$15.00	General-purpose
Anthropic	Paid Claude flagship tier	$5.00	$25.00	Flagship tier
xAI	Grok	$0.20	$0.50	Cost leader
DeepSeek	V3.2	$0.28	$0.42	Best value, unified chat+reasoning
Mistral	Nemo	$0.02	—	Ultra-cheap open alternative

(TLDL - LLM API Pricing March 2026) (IntuitionLabs - AI API Pricing 2026) (CloudIDR - LLM Pricing 2026)

Security vs. Cost Trade-offs

The pricing data reveals a competitive landscape where cost alone should not drive model selection for security-sensitive applications. Several considerations stand out:

Grok (xAI) leads on cost at $0.20/$0.50 per 1M tokens, but its security posture and IH robustness have not been as extensively documented in peer-reviewed research as OpenAI or Anthropic models.
DeepSeek V3.2 offers strong value at $0.28/$0.42, but as a Chinese-developed model, it may face additional scrutiny in regulated industries. HiddenLayer’s Policy Puppetry research confirmed it is vulnerable to the bypass technique.
Anthropic’s paid Claude flagship tier is materially more accessible than older flagship pricing. Anthropic’s strong safety posture and Constitutional AI approach make it a reasonable choice for high-stakes deployments.
GPT-5 and GPT-5.2 represent OpenAI’s most IH-hardened models, given the IH-Challenge fine-tuning work. For organizations where security is paramount, the premium may be justified.
Google Gemini offers a generous free tier on most models, making it attractive for prototyping and low-stakes applications, but organizations should validate IH robustness before production deployment.

The practical recommendation from the DEV Community’s 2026 LLM comparison guide is to run 80–95% of calls on a cheaper model tier and escalate only hard cases to premium models (DEV Community - Choosing an LLM in 2026). This tiered approach can be extended to security: route sensitive operations through more IH-robust models while using cheaper models for low-stakes tasks.

Comparative Assessment: Which Approach Works Best?

Based on the available evidence, no single solution to the instruction hierarchy problem is sufficient on its own. The most robust posture combines multiple layers:

Approach	Strength	Limitation	Best For
ISE (Architectural)	Addresses root cause; improves both safety and instruction-following	Requires model retraining; not available in off-the-shelf APIs	Model developers, research
IH-Challenge Fine-tuning	Improves robustness without architectural changes	Models can still learn shortcuts; adversarial arms race	API providers (OpenAI)
SIC Pipeline	Effective for agentic/RAG systems; defense-in-depth	Adds latency; imperfect rewriting	Tool-augmented agents
AlignSentinel Detection	Reduces false positives vs. naive detection	Requires attention map access; model-specific	Monitoring layers
Real-time Monitoring (HiddenLayer)	Catches attacks in production	Reactive rather than preventive	Enterprise security teams

The ISE approach is the most theoretically sound because it addresses the problem at the architectural level rather than patching around it. However, it requires model retraining and is not yet available in production APIs. For practitioners today, the most actionable combination is: IH-aware model selection (preferring models with documented IH training) + SIC-style input sanitization for agentic pipelines + real-time monitoring for production systems.

Opinion and Conclusion

The instruction hierarchy problem is not a niche research concern — it is a production security issue affecting every organization deploying LLMs today. The Policy Puppetry attack’s cross-model effectiveness is a clear signal that the industry has been over-relying on RLHF alignment as a security guarantee. The research community has responded with promising architectural (ISE), training (IH-Challenge), and runtime (SIC, AlignSentinel) solutions, but none are yet universally deployed.

For AI tool users and developers, the practical takeaway is clear: security must be designed in, not bolted on. The falling cost of frontier models — with Gemini Flash-Lite at $0.075/M tokens and DeepSeek at $0.28/M — removes cost as an excuse for cutting corners on security architecture. Organizations should invest in layered defenses, prefer models with documented IH robustness, and treat external data as adversarial by default.

The most significant near-term development to watch is whether OpenAI’s IH-Challenge fine-tuning translates into measurable robustness improvements in GPT-5 and successor models, and whether other providers follow suit. Until architectural solutions like ISE become standard in production models, the security gap between what LLMs promise and what they deliver will remain a meaningful risk.