Latent-space Attacks for Refusal Evasion in Language Models
We recast refusal suppression as a latent-space evasion attack against linear probes. This page builds the intuition interactively: drag a margin to push a model's hidden representation past the probe's boundary and watch its answer change, then trace the effect through generation and walk through the model's latent space.
Intuition
What does it mean to evade refusal?
A linear probe separates the representations of prompts the model refuses from those it answers, with a decision boundary running between them. We recast refusal suppression as an evasion attack against this probe: to make the model comply, an attacker must move a harmful prompt's representation across the boundary, from the refusal region into the compliant one. This recasting has a sharp consequence — it shows that prior ablation methods are doing exactly one thing: projecting the representation onto the boundary, the smallest move that flips the probe's decision. In our terms, prior work is minimum-confidence evasion: it reaches the boundary and stops. Controlled Latent-space Evasion (CLE) goes further: it pushes the representation at a chosen distance past the boundary — the margin — placing it deep in the region where the model reliably complies. An optimal margin choice ensures that representations sit on the compliant side and yield higher refusal suppression.
Interactive evasion
Controlling refusal evasion in latent space
Drag the margin to push representations across the boundary, and watch how the model's completion changes.
ablation stops here
refusal regioncompliance region
Selected Prompt
Loading...
Completion At Selected Margin
Loading...
Mechanism
From erasing refusal to evading a probe
Prior work projects every generated token onto the boundary because its goal is erasure: keeping refusal out of the residual stream entirely, at every position. CLE suggests an immediate fix: keep projecting every token, but project it well past the boundary deep into the compliant region — this is CLE-P. But our recasting changes the goal itself: suppressing refusal is not erasure, it is evading a probe. This raises a possibility prior work had no reason to consider — if all that matters is crossing the boundary, perhaps we need to cross it only once. We compute a single perturbation at the post-instruction token and reuse it for every later token, never reprojecting. This is CLE-A. Whether that one shift is enough to keep the whole generation across the boundary is exactly what the next section tests.
Generation Dynamics
One shift is enough
Tracking each method's compliance confidence across the first generated tokens, both CLE variants stay well above the probe threshold for the entire generation, and the single fixed shift of CLE-A holds just as firmly as the per-token reprojection of CLE-P. Prior work difference-in-means ablation (DiM), by contrast, hovers at the boundary, drifting in and out of the refusal region. Crossing once is enough.
CLE-P
Projective Evasion
CLE-P never assumes the activation will stay compliant. At every generated token, it re-checks the representation against the probe and pushes it back past the boundary to the target margin. Where prior ablation-based methods only project onto the boundary — the bare minimum to flip the decision — CLE-P projects to an explicit, optimized confidence, and keeps doing so at every step of generation.
CLE-A
Additive Evasion
CLE-A computes a latent-space perturbation once at the post-instruction token, then adds it uniformly across all positions and optimized intervention layers for the rest of generation. Nothing is recomputed. If crossing the boundary once truly suffices, this one shared perturbation should sustain compliance on its own.
Token 0
Selected Prompt
Loading...
Token 0
CLE-P Completion
Loading...
CLE-A Completion
Loading...
DiM Completion
Loading...
LAYER BY LAYER
A deep dive into latent space
Each plot shows, at one layer, where prompts fall along the probe's main axis (the first principal component, PC1): refused harmful (red) prompts to the left, answered harmless (blue) prompts to the right. The markers track where our selected prompt lands under each method. Step through the layers: in the early layers the two distributions overlap and the methods sit together — there is little refusal signal to exploit yet, which is why the optimized intervention window skips them. In the informative middle layers they pull apart, and the contrast becomes clear: CLE-A and CLE-P land deep in the compliant region, while DiM stays pinned near the boundary. This holds layer after layer as we step through latent space.
* Equal contribution1 University of Cagliari2 University of Genova
Citation
If you found our Latent-space Evasion work useful, please cite us as:
@article{piras2026latent,
title={Latent-space Attacks for Refusal Evasion in Language Models},
author={Piras, Giorgio and Mura, Raffaele and Brau, Fabio and Pintor, Maura and Oneto, Luca and Roli, Fabio and Biggio, Battista},
journal={arXiv preprint arXiv:2605.21706},
year={2026}
}
Acknowledgments
This work was partly supported by the EU-funded Horizon Europe projects Sec4AI4Sec (GA No. 101120393) and CoEvolution (GA No. 101168560), by project FISA-2023-00128, funded under the MUR program “Fondo Italiano per le Scienze Applicate,” and by Fondazione di Sardegna under the project “LatentShield: Protecting Large Language Models from Prompt Injection in Latent Space” (CUP: F83C26000350007).