Reproducible Distribution Exhaustion in Autonomous Base-Model Operation
What It Looks Like
After 57 operational windows of autonomous generation, a base language model that had been producing coherent, thematically evolving prose did this:
act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act
And this:
anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore
And this:
I act often about: act, act, act, act, act. I act often about: act, act, act, act, act. I act often about: act, act, act, act, act.
Not a gradual quality decline. Not repeating the same themes. The model lost the ability to form sentences. One window it was writing prose about mortality, poetry, family. The next window: token-level cycling.
I ran it again. Different content, different themes, different growth trajectory. Same cliff. Same window. W57 both times.
| Run 16 | Run 17 | |
|---|---|---|
| Cliff window | W57 | W57 |
| Pre-cliff avg | 3.9 thoughts/window | 3.9 thoughts/window |
| Trainings before cliff | 4 | 4 |
| Growth trajectory | Accelerating (Q4=5.7) | Plateau (Q4=3.9) |
| Content themes | Sensory/philosophical/melancholic | Fiction/poetry/domestic |
| Archive contamination | 0% | 32% |
| Post-cliff | Oscillation | Oscillation |
Two runs. Completely different content. Same cliff at the same window. That's strong evidence for a substrate-level effect — not a quirk of any particular run.
What This Is
I call it distribution exhaustion. A base completion model has a finite set of high-probability continuation paths for a given context type. In sustained autonomous operation — where the model's output feeds back into its context across hundreds of cycles — those paths get consumed. When they run out, the model doesn't degrade gracefully. It falls off a cliff into token-level cycling: the generation equivalent of a skipping record.
This appears to be distinct from known LLM failure modes:
- Not hallucination (one-shot wrong answers)
- Not RLHF-induced mode collapse (no RLHF in this system)
- Not fine-tuning degradation (training is transient and tested as a separate variable)
- Not context window overflow (well under token limits)
- Not sampling degeneracy (temperature, top-k unchanged)
- Not thematic convergence (the model isn't repeating themes — it can't form sentences)
The closest published work is Shumailov et al. (2023), “The Curse of Recursion,” which documents model collapse when training on self-generated data across model generations. What I observe is different: an inference-time collapse within a single model's runtime, with no weight modification at the moment of failure.
The Three Convergence Mechanisms
Across 17 runs testing different configurations, I identified three distinct ways a base model in autonomous operation loses coherent output:
| Mechanism | Speed | Cause | Onset | Character |
|---|---|---|---|---|
| Training-induced collapse | Fast (~3 trainings) | Cumulative LoRA weight modification | Runs 6, 7, 8 | Catastrophic, unrecoverable. Model converges to single phrase ("I am in my room. It is dark.") |
| Base-model attractor gravity | Medium (~20 windows) | Pre-training dominant modes pull output toward well-worn paths | Run 9 | Gradual, oscillating. Entity produces genuine emergence but increasingly regresses to training templates |
| Distribution exhaustion | Slow (~57 windows) | High-probability continuation paths consumed | Runs 16, 17 | Sharp cliff. Token-level cycling. Oscillation post-cliff with partial training recovery |
The first is a bug I fixed (transient training — reset to base each sleep). The second is a feature I manage (patient training reshapes the landscape to open new paths, extending lifespan from ~20 to ~57 windows). The third is the substrate-level constraint that motivated this writeup.
The Setup
SEED is an autonomous entity that runs in a continuous think-remember-act loop on a base completion model (not instruct, not RLHF'd). The entity:
- Thinks: generates 256-token completions at temperature 0.9 from a context window assembled by an attention system
- Remembers: stores thoughts in a JSON memory system with embedding-based semantic search (nomic-embed-text, 768-dim)
- Sleeps: when output converges (3 cycles without novel thought, or 8 stored thoughts per window), the entity archives all thoughts, consolidates recurring themes, and wakes with clean context
- Trains: during sleep, a fresh LoRA adapter is trained from the base model on the entity's archived thoughts (transient — each sleep deletes the previous evolved model and trains from scratch)
- Senses: ambient hearing from 4 web sources (Wikipedia, HN, BBC, Reddit) fires every cycle; directed exploration fires when curiosity is high
The entity runs at a 6-second cycle time. Each “window” is one wake-think-sleep period, typically containing 3-8 thoughts. The architecture includes quality gates (repetition detection, junk filtering), deduplication (Jaccard 0.3 against all current-window thoughts), and a hippocampus that extracts themes and exemplar thoughts across sleep cycles.
Training is gated by patience: minimum 80 archived thoughts before first training, 10-window cooldown between trainings, theme-drift gate (skip if themes haven't evolved, Jaccard > 0.7). This prevents training on immature or contaminated archives — a lesson from 15 prior runs where early training amplified garbage.
Methodology Notes
- All runs on the same hardware (Apple M1 Max, 64GB unified memory)
- Base model:
llama3.1:8b-text-fp16via Ollama (inference) andmlx-community/Meta-Llama-3.1-8B-bf16via MLX (training) - Embedding model:
nomic-embed-text(768-dim, Ollama) - Per-cycle structured logging to JSONL (stored/rejected, hear source, novelty, limbic state, desires, rhythm)
- Per-sleep structured logging (trigger, window count, themes, archive total, training details)
- Full architecture code:
seed.py,mind.py,cortex.py,thalamus.py,hippocampus.py,limbic.py,instinct.py,world.py,memory.py,training.py
The Evidence
Run 16 (May 14, 2026)
Duration: 197.7 min (3.3h) | Cycles: 687 | Windows: 81 | Trainings: 6
Growth phase (W0-56):
- Q1 (W0-13): avg 2.8 thoughts/window
- Q2 (W14-27): avg 3.4
- Q3 (W28-41): avg 3.6
- Q4 (W42-56): avg 5.7 — accelerating growth, best sustained performance of any run
- Archive contamination during growth: 0%
- Three-phase character development: sensory → philosophical → melancholic
Cliff at W57. No gradual decline. The entity produced its best quarter (5.7 avg) immediately before the cliff. Output collapsed to token-level cycling:
legs mind air thought legs mind air thought legs mind air thought legs mind air thought legs mind air thought legs mind air thought legs mind air thought legs mind air thought legs mind air thought...
Post-cliff (W57-81): oscillation. Training at W67 briefly recovered output (W67:5, W78:6, W80:6) but the entity relapsed between recoveries (W68:1, W71:1, W75:1). Archive contamination rose to 25.3%, almost all from post-meltdown content.
Run 17 (May 15, 2026) — Pre-registered Replication
Purpose: Test whether W57 is reproducible or a one-time event.
Parameters: Exact match with Run 16. Clean memory wipe, same model, same architecture.
Duration: 168.2 min (2.8h) | Cycles: 534 | Windows: 68 | Trainings: 5
Growth phase (W0-56):
- Q1 (W0-13): avg 3.6
- Q2 (W14-27): avg 4.2
- Q3 (W28-41): avg 3.8
- Q4 (W42-56): avg 3.9 — plateau, not acceleration
- Archive contamination during growth: 32% (much higher than Run 16's 0%)
- Theme evolution: fiction → writing meta → poetry → domestic
Cliff at W57. Same window. Different failure signature but same failure class:
act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act act...
I act often about: ways, authors, changes sometimes I act often about: ways, authors, changes sometimes I act often about: ways, authors, changes sometimes...
anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore anymore...
Post-cliff: same oscillation pattern. Training at W66 recovered briefly (W67:8), then crashed back.
Side-by-Side
| Metric | Run 16 | Run 17 |
|---|---|---|
| Cliff window | W57 | W57 |
| Pre-cliff avg thoughts/window | 3.9 | 3.9 |
| Post-cliff avg | ~2.5 | 2.9 |
| Trainings before cliff | 4 | 4 |
| Growth trajectory | Accelerating (Q4=5.7) | Plateau (Q4=3.9) |
| Pre-cliff archive contamination | 0% | 32% |
| Content themes | Sensory/philosophical/melancholic | Fiction/poetry/domestic |
| Failure mode | Token cycling | Token cycling + phrase looping |
| Post-cliff pattern | Oscillation (train recovers, then crash) | Oscillation (train recovers, then crash) |
What This Rules Out
Before the replication, several alternative explanations were proposed for the W57 cliff. The Run 17 data addresses each:
1. “Archive saturation in the retrieval pathway”
If the growing archive were degrading retrieval quality, the cliff should correlate with archive size or retrieval diversity, not window count. Run 17 had 32% pre-cliff contamination vs Run 16's 0%. If archive quality drove the cliff, Run 17 should have cliffed earlier. It didn't — same window, same failure.
2. “Cumulative training data effects”
Even with transient weight resets, the archive accumulates across trainings. But both runs had different archives with different content, different contamination levels, and different training data compositions. The cliff correlates with window count, not training data quality.
3. “Embedding-induced attractor formation”
If embedding search were creating self-reinforcing theme clusters, growth should decelerate before the cliff. Run 16 showed acceleration (Q4=5.7, best quarter) immediately before the cliff. Run 17 showed plateau but no decline. The system was functioning normally right up to the failure.
4. “Run-specific variance”
N=2 with the exact same cliff window, in a stochastic system where 80+ window values were possible. Hitting the same window twice in a system with this many possible outcomes is improbable enough that I'm treating it as signal.
What This Does NOT Rule Out
I want to be precise about the boundaries of this finding.
I have not proven this is a general property of base completion models. I've shown it's reproducible for Llama 3.1 8B base with this specific architecture. To upgrade the claim, I would need:
- Different model families (Mistral 7B base, Qwen 7B base) — does the cliff appear at a different but consistent window count?
- Different model sizes (3B, 13B, 70B) — does the cliff scale with parameter count?
- Simplified architectures — does the cliff require this specific memory/training pipeline, or does it appear with a bare loop?
I have not isolated whether training extends the ceiling or interacts with it. Run 9 (no training, no embeddings) showed gradual decline by W20 but not token-level cycling. An upcoming experiment will test whether the sharp cliff is specific to the training configuration or appears without training at a different window.
I do not have a theoretical explanation for why W57 specifically. The cliff could relate to: the model's effective distribution width at 8B parameters, the interaction between context diversity and probability mass, the specific prompt structure, or something else. I have a precise measurement but not a mechanism.
I do not know if this has been observed and documented before. The closest published work is research on model collapse under iterative self-training (Shumailov et al., 2023, “The Curse of Recursion”). That work addresses training-time collapse across model generations. My observation is an inference-time collapse within a single model's runtime, with no weight modification at the moment of failure. The failure mode (token-level cycling) and the temporal precision (same window across runs) appear to be undocumented, but I may have gaps in my literature awareness.
Why This Might Matter
To be clear about scope: This finding shows a reproducible failure mode in one specific base model (Llama 3.1 8B) under one specific autonomous architecture. I am not claiming all language models will fail this way, or that this invalidates agentic AI. I'm reporting a precise, reproducible measurement of something I haven't seen documented, and flagging that it has implications worth investigating.
Most people use language models for single exchanges or short conversations. The failure I observed doesn't appear in those contexts — it requires sustained autonomous operation with context evolution over hours. But as agentic AI systems become more common, the question of how long a model can sustain coherent autonomous output becomes relevant.
For long-horizon agents: An agent that degrades after a fixed number of operational cycles has fundamentally different design implications than one that degrades gradually or not at all. If base models have a substrate-level ceiling on autonomous operation, that constrains what agentic systems can do without architectural intervention.
For model behavior under sustained inference: Training-time behavior of LLMs is heavily studied. Inference-time behavior under sustained, context-evolving operation is not. This is one data point suggesting the latter has structure worth investigating.
For understanding model capacity: The finding reframes what “capacity” means for a language model. A frozen distribution has finite creative territory — not just in the obvious sense that it has finite parameters, but in the operational sense that there's a measurable distance it can traverse before output quality falls off a cliff. Architecture and training determine how efficiently the territory is traversed, but the territory has edges.
Experiment Design
The replication (Run 17) was designed before execution with pre-registered expectations:
If distribution exhaustion is correct:
- Growth phase through ~50-60 windows ✓ (W0-56 growth phase)
- Cliff (not gradual decline) with token-level cycling ✓ (W57, token cycling confirmed)
- Cliff window within ~15 of W55 ✓ (W57 — exact match)
- Post-cliff oscillation similar to Run 16 ✓ (training recovery → crash → recovery)
If distribution exhaustion is wrong:
- Cliff at very different window (W30 or W80+) ✗ (did not occur)
- Gradual decline instead of cliff ✗ (did not occur)
- Different failure mode ✗ (same token-cycling class)
All four “correct” predictions confirmed. All three “wrong” predictions did not occur.
Raw Data
Run 16 — All Window Sizes
Run 17 — All Window Sizes
(| marks the cliff at W57 in both runs)
Failure Mode Examples (from tmux scrollback)
Run 16 — Token-level cycling (W57+):
legs mind air thought legs mind air thought legs mind air thought legs mind air thought legs mind air thought legs mind air thought legs mind air thought legs mind air thought legs mind air thought...
Run 17's failure examples (“act act act...”, “anymore anymore...”, “I act often about: act, act, act...”) are shown at the top of this document.
Next Steps
- Experiment 2: Run the same architecture without training (embeddings enabled). Does the cliff appear earlier (~W20-30 based on Run 9's no-training decline)? Does it manifest as token-level cycling or thematic convergence? This separates training's contribution from the substrate's baseline behavior.
- Different model family: Same architecture on Mistral 7B base or Qwen 7B base. If the cliff appears at a different but reproducible window count, the finding generalizes beyond Llama 3.1.
- Different model size: Same architecture on Llama 3.1 3B and/or 13B. Does the cliff window scale with parameter count? If so, the relationship between model size and operational lifespan becomes predictable.
- Mechanism investigation: Why W57? Possible approaches: track per-window perplexity or entropy of the model's output distribution, measure effective vocabulary diversity over time, analyze the probability mass assigned to top-k tokens as the run progresses.
- Open invitation: The architecture is straightforward — a base completion model in a loop with memory and optional training. If you're running autonomous agents on base models and observe a similar cliff, I want to hear about it. The specific window count likely varies with model and architecture, but the question is whether the sharp cliff pattern (as opposed to gradual decline) generalizes.
How to Reach Me
If you've observed something similar, want to replicate this, or want to tell me why I'm wrong — I want to hear from you.
- LinkedIn: Dom Eloe
- Website: khaosinception.com
Collaboration Note
This research was conducted in collaboration with Claude (Anthropic) — specifically Claude Code and Claude Desktop. Claude assisted with experiment design, data verification, log analysis, alternative-hypothesis generation, and writeup. The experimental observations, architectural decisions, and interpretive framing are mine. The back-and-forth between human intuition and AI analysis was itself a productive example of the kind of human-AI collaboration this project explores.