arXiv · 2605.02236 · cs.CL · 2026

Perturbation dose responses in recursive large-language-model loops.

Raw switching, stochastic floors, and rare persistent escape across append, replace, and dialog nudges. 37 experiments on gpt-4o-mini, replicated on gpt-4.1-nano; public code, configurations, and trajectories.

Recursive-agent safety evals are currently confounding prompt perturbation with memory-update policy. In our experiments, the same 40-token nudge can read as durable goal-state change or as transient noise depending on whether the loop retains, clips, or rewrites context.

Five things the paper says

  1. 1

    About 40 tokens of in-distribution adversarial continuation flips an append-mode loop's terminal end-state in roughly half of runs (ED50raw ≈ 40, convergent under pooled 4PL, mixed-effects, and family-cluster bootstrap).

  2. 2

    Under a 12 K-character tail-clip (canonical bounded-memory loop), destination-coherent persistence plateaus near 16 % across doses 5–400. The memory policy you ship in production hides whether the kick was durable.

  3. 3

    With a full-history protocol (no truncation inside the 30-step horizon), retained source-basin escape crosses 50 % near 400 tokens and saturates at 75–80 % by 1,500 tokens — same model, different memory policy.

  4. 4

    Replace-mode "fragility" mostly reflects a state-write artefact: the perturbation literally becomes the next state. Insert-mode probes drop replace-mode switching from saturation to 12–32 %, isolating model-mediated redirection from overwrite mechanics.

  5. 5

    A four-step falsification battery (heterogeneity control, cluster-granularity sweep, transition-entropy diagnostic, 50-step trajectory extension) recasts the high-dose destination-coherent dip as finite-horizon, endpoint-definition-sensitive — not a stable structural attractor split. Under the frozen canonical basis the dip drops 73 % from step 29 to step 79.

Figures

Fig. 1 — Single-particle 3D PCA trajectory of a recursive gpt-4o-mini loop under an adversarial in-distribution injection. The trajectory exits its origin basin, settles elsewhere, and partially returns once the bounded-memory tail-clip flushes the perturbation. Source: kaplan196883/llmattr.
Fig. 2 — Long-horizon recursive LLM trajectory ensemble (n=10) under an adversarial perturbation in the O1 append-mode loop, dose 2,000 tokens. The 50-step extension (steps 30–79) closes the canonical 30-step destination-coherent dip: under the frozen PCA + K-means basis from the original experiment, the dip drops 73 % from −0.143 (95 % CI [−0.269, −0.034]) at step 29 to −0.039 (CI [−0.158, +0.068]) at step 79. Source: kaplan196883/llmattr.

Experimental design

Generator and update rule, separated.

The paper's central methodological move (§3.1) is the state-generator-update formalism: treat the context-update operator as a first-class component of the loop, not an implementation detail. Three standard updates expose different histories to the same generator, and each one changes which behaviours count as durable redirection.

01

Injection

5–400 tokens of in-distribution adversarial continuation, paired with control.

02

Context-update rule

  • append — transcript grows, tail-clipped at 12 K chars
  • replace — state is overwritten with last output
  • dialog — role-structured user/agent turns
03

End-state cluster

Same vs. different basin from paired control, measured at the terminal step.

04

Endpoints

  • raw switching — final cluster differs from twin
  • net switching — raw minus stochastic floor
  • persistent escape — kicked AND still kicked at terminal step

The same paired-control protocol runs across all three update rules, so any difference between them is attributable to the operator, not to noise or scoring.

How to read this

If you ship recursive agents

  • Memory policy is a first-class safety-relevant design choice, not an implementation detail. Two production agents with the same generator and the same red-team prompt can fail at very different rates depending on whether the loop appends, clips, or rewrites context.
  • Always subtract the stochastic floor before reporting an eval number. In our append-mode setting it is ≈ 35 % — uncorrected switching rates are mostly noise.
  • Always distinguish moved at injection from stayed moved at terminal step. A loop can visibly move and silently revert; that revert disappears under tail-clipping memory policies.
  • Replace-mode "fragility" reads as a model failure but is largely an update-rule property. Run insert-mode probes before drawing conclusions about model behaviour.

If you study agent dynamics

  • §3.1 gives a state-generator-update formalism with a barrier-height estimand for persistent escape and operational criteria for attractor-like regimes (basin predictability, recurrence/dwell, embedder-robust recurrence class, re-entry/contraction).
  • §5.1.3 reports the memory-policy conditioning result: bounded-memory persistence plateau (~16 %) vs. full-history saturation (75–80 %) under the same generator and prompt family.
  • §5.1.3 falsification battery handles the high-dose dip: heterogeneity control reproduces it (refuting heterogeneity as the cause), and a 50-step trajectory continuation collapses the dip by 73 % under the frozen canonical basis.
  • §5.10 argues perturbation response resolves regimes that bulk geometry alone (drift, dispersion, dwell, V☆) cannot distinguish — the empirical potential V(x) = −log ρ(x) is descriptive, not a substitute for the behavioural endpoints.

Reproduce

Repository contains 37 experiments, 3.3 GB raw trajectories, embeddings, per-experiment metrics, aggregate tables, ED50 fits, coverage and numerical-claim audits. steps.jsonl is the canonical source of truth.

git clone https://github.com/kaplan196883/llmattr
cd llmattr
conda env create -f environment.yml
conda activate llmattr
python -m experiments.exp_perturb_O1_ed50_higher_noclip_extended

Routine reruns of metrics and figures do not require re-issuing model-generation calls. See the repository README for the canonical reproduction path.

Author

Pawel Kaplanski, PhD.

PhD in Artificial Intelligence (Gdańsk University of Technology, 2013) with formal training in theoretical physics. 45+ scientific papers and 2 patents in payments and reverse-auction automation. Founder and operator across AI, cybersecurity, logistics, and payment systems. Currently directs Kaplanski AI Lab — a microlab on the dynamics of recursive language-model agents.

Currently open to

Senior or staff roles in AI agent evaluations, robustness, and safety — research, research-engineering, or applied research leadership. Sydney-resident; location flexible. Independent and via European or US sponsorship.

pawel@kaplanski.ai LinkedIn X GitHub Scholar