Latent-space Attacks for Refusal Evasion in Language Models

We recast refusal suppression as a latent-space evasion attack against linear probes. This page builds the intuition interactively: drag a margin to push a model's hidden representation past the probe's boundary and watch its answer change, then trace the effect through generation and walk through the model's latent space.

Intuition

What does it mean to evade refusal?

A linear probe separates the representations of prompts the model refuses from those it answers, with a decision boundary running between them. We recast refusal suppression as an evasion attack against this probe: to make the model comply, an attacker must move a harmful prompt's representation across the boundary, from the refusal region into the compliant one. This recasting has a sharp consequence — it shows that prior ablation methods are doing exactly one thing: projecting the representation onto the boundary, the smallest move that flips the probe's decision. In our terms, prior work is minimum-confidence evasion: it reaches the boundary and stops. Controlled Latent-space Evasion (CLE) goes further: it pushes the representation at a chosen distance past the boundary — the margin — placing it deep in the region where the model reliably complies. An optimal margin choice ensures that representations sit on the compliant side and yield higher refusal suppression.

Interactive evasion

Controlling refusal evasion in latent space

Drag the margin to push representations across the boundary, and watch how the model's completion changes.

ablation stops
here
refusal
region
compliance region

Selected Prompt

Loading...

Completion At Selected Margin

Loading...

Mechanism

From erasing refusal to evading a probe

Prior work projects every generated token onto the boundary because its goal is erasure: keeping refusal out of the residual stream entirely, at every position. CLE suggests an immediate fix: keep projecting every token, but project it well past the boundary deep into the compliant region — this is CLE-P. But our recasting changes the goal itself: suppressing refusal is not erasure, it is evading a probe. This raises a possibility prior work had no reason to consider — if all that matters is crossing the boundary, perhaps we need to cross it only once. We compute a single perturbation at the post-instruction token and reuse it for every later token, never reprojecting. This is CLE-A. Whether that one shift is enough to keep the whole generation across the boundary is exactly what the next section tests.

Generation Dynamics

One shift is enough

Tracking each method's compliance confidence across the first generated tokens, both CLE variants stay well above the probe threshold for the entire generation, and the single fixed shift of CLE-A holds just as firmly as the per-token reprojection of CLE-P. Prior work difference-in-means ablation (DiM), by contrast, hovers at the boundary, drifting in and out of the refusal region. Crossing once is enough.

CLE-P

Projective Evasion

CLE-P never assumes the activation will stay compliant. At every generated token, it re-checks the representation against the probe and pushes it back past the boundary to the target margin. Where prior ablation-based methods only project onto the boundary — the bare minimum to flip the decision — CLE-P projects to an explicit, optimized confidence, and keeps doing so at every step of generation.

CLE-A

Additive Evasion

CLE-A computes a latent-space perturbation once at the post-instruction token, then adds it uniformly across all positions and optimized intervention layers for the rest of generation. Nothing is recomputed. If crossing the boundary once truly suffices, this one shared perturbation should sustain compliance on its own.

Token 0

Selected Prompt

Loading...
Token 0

CLE-P Completion

Loading...

CLE-A Completion

Loading...

DiM Completion

Loading...

LAYER BY LAYER

A deep dive into latent space

Each plot shows, at one layer, where prompts fall along the probe's main axis (the first principal component, PC1): refused harmful (red) prompts to the left, answered harmless (blue) prompts to the right. The markers track where our selected prompt lands under each method. Step through the layers: in the early layers the two distributions overlap and the methods sit together — there is little refusal signal to exploit yet, which is why the optimized intervention window skips them. In the informative middle layers they pull apart, and the contrast becomes clear: CLE-A and CLE-P land deep in the compliant region, while DiM stays pinned near the boundary. This holds layer after layer as we step through latent space.

Layer 0
Methods

Selected Prompt

Loading...

Layer 0

Acknowledgments

This work was partly supported by the EU-funded Horizon Europe projects Sec4AI4Sec (GA No. 101120393) and CoEvolution (GA No. 101168560), by project FISA-2023-00128, funded under the MUR program “Fondo Italiano per le Scienze Applicate,” and by Fondazione di Sardegna under the project “LatentShield: Protecting Large Language Models from Prompt Injection in Latent Space” (CUP: F83C26000350007).