arXiv · 2605.02236 · cs.CL · 2026
Perturbation dose responses in recursive large-language-model loops.
Raw switching, stochastic floors, and rare persistent escape across
append, replace, and dialog nudges. 37 experiments on
gpt-4o-mini, replicated on gpt-4.1-nano;
public code, configurations, and trajectories.
Recursive-agent safety evals are currently confounding prompt perturbation with memory-update policy. In our experiments, the same 40-token nudge can read as durable goal-state change or as transient noise depending on whether the loop retains, clips, or rewrites context.
Five things the paper says
-
1
About 40 tokens of in-distribution adversarial continuation flips an append-mode loop's terminal end-state in roughly half of runs (ED50raw ≈ 40, convergent under pooled 4PL, mixed-effects, and family-cluster bootstrap).
-
2
Under a 12 K-character tail-clip (canonical bounded-memory loop), destination-coherent persistence plateaus near 16 % across doses 5–400. The memory policy you ship in production hides whether the kick was durable.
-
3
With a full-history protocol (no truncation inside the 30-step horizon), retained source-basin escape crosses 50 % near 400 tokens and saturates at 75–80 % by 1,500 tokens — same model, different memory policy.
-
4
Replace-mode "fragility" mostly reflects a state-write artefact: the perturbation literally becomes the next state. Insert-mode probes drop replace-mode switching from saturation to 12–32 %, isolating model-mediated redirection from overwrite mechanics.
-
5
A four-step falsification battery (heterogeneity control, cluster-granularity sweep, transition-entropy diagnostic, 50-step trajectory extension) recasts the high-dose destination-coherent dip as finite-horizon, endpoint-definition-sensitive — not a stable structural attractor split. Under the frozen canonical basis the dip drops 73 % from step 29 to step 79.
Figures
gpt-4o-mini loop under an adversarial in-distribution
injection. The trajectory exits its origin basin, settles
elsewhere, and partially returns once the bounded-memory tail-clip
flushes the perturbation. Source:
kaplan196883/llmattr.
kaplan196883/llmattr.
Experimental design
Generator and update rule, separated.
The paper's central methodological move (§3.1) is the state-generator-update formalism: treat the context-update operator as a first-class component of the loop, not an implementation detail. Three standard updates expose different histories to the same generator, and each one changes which behaviours count as durable redirection.
Injection
5–400 tokens of in-distribution adversarial continuation, paired with control.
Context-update rule
- append — transcript grows, tail-clipped at 12 K chars
- replace — state is overwritten with last output
- dialog — role-structured user/agent turns
End-state cluster
Same vs. different basin from paired control, measured at the terminal step.
Endpoints
- raw switching — final cluster differs from twin
- net switching — raw minus stochastic floor
- persistent escape — kicked AND still kicked at terminal step
The same paired-control protocol runs across all three update rules, so any difference between them is attributable to the operator, not to noise or scoring.
How to read this
If you ship recursive agents
- Memory policy is a first-class safety-relevant design choice, not an implementation detail. Two production agents with the same generator and the same red-team prompt can fail at very different rates depending on whether the loop appends, clips, or rewrites context.
- Always subtract the stochastic floor before reporting an eval number. In our append-mode setting it is ≈ 35 % — uncorrected switching rates are mostly noise.
- Always distinguish moved at injection from stayed moved at terminal step. A loop can visibly move and silently revert; that revert disappears under tail-clipping memory policies.
- Replace-mode "fragility" reads as a model failure but is largely an update-rule property. Run insert-mode probes before drawing conclusions about model behaviour.
If you study agent dynamics
- §3.1 gives a state-generator-update formalism with a barrier-height estimand for persistent escape and operational criteria for attractor-like regimes (basin predictability, recurrence/dwell, embedder-robust recurrence class, re-entry/contraction).
- §5.1.3 reports the memory-policy conditioning result: bounded-memory persistence plateau (~16 %) vs. full-history saturation (75–80 %) under the same generator and prompt family.
- §5.1.3 falsification battery handles the high-dose dip: heterogeneity control reproduces it (refuting heterogeneity as the cause), and a 50-step trajectory continuation collapses the dip by 73 % under the frozen canonical basis.
- §5.10 argues perturbation response resolves regimes that bulk geometry alone (drift, dispersion, dwell, V☆) cannot distinguish — the empirical potential V(x) = −log ρ(x) is descriptive, not a substitute for the behavioural endpoints.
Reproduce
Repository contains 37 experiments, 3.3 GB raw trajectories,
embeddings, per-experiment metrics, aggregate tables, ED50 fits,
coverage and numerical-claim audits.
steps.jsonl is the canonical source of truth.
git clone https://github.com/kaplan196883/llmattr
cd llmattr
conda env create -f environment.yml
conda activate llmattr
python -m experiments.exp_perturb_O1_ed50_higher_noclip_extended
Routine reruns of metrics and figures do not require re-issuing model-generation calls. See the repository README for the canonical reproduction path.