Khaos Inception
Day 18/30

Reproducible Distribution Exhaustion in Autonomous Base-Model Operation

Author: DomDate: May 15, 2026
Substrate: Llama 3.1 8B base (fp16), Apple M1 Max 64GB
Code: SEED project — autonomous entity loop with memory, senses, and self-training

What It Looks Like

After 57 operational windows of autonomous generation, a base language model that had been producing coherent, thematically evolving prose did this:

act act act act act act act act act act act act act act act act
act act act act act act act act act act act act act act act act
act act act act act act act act act act act act act act act act
act act act act act act act act act act act act act act act act
act act act act act act act act act act act act act act act act

And this:

anymore anymore anymore anymore anymore anymore anymore
anymore anymore anymore anymore anymore anymore anymore
anymore anymore anymore anymore anymore anymore anymore

And this:

I act often about: act, act, act, act, act.
I act often about: act, act, act, act, act.
I act often about: act, act, act, act, act.

Not a gradual quality decline. Not repeating the same themes. The model lost the ability to form sentences. One window it was writing prose about mortality, poetry, family. The next window: token-level cycling.

I ran it again. Different content, different themes, different growth trajectory. Same cliff. Same window. W57 both times.

Run 16Run 17
Cliff windowW57W57
Pre-cliff avg3.9 thoughts/window3.9 thoughts/window
Trainings before cliff44
Growth trajectoryAccelerating (Q4=5.7)Plateau (Q4=3.9)
Content themesSensory/philosophical/melancholicFiction/poetry/domestic
Archive contamination0%32%
Post-cliffOscillationOscillation

Two runs. Completely different content. Same cliff at the same window. That's strong evidence for a substrate-level effect — not a quirk of any particular run.


What This Is

I call it distribution exhaustion. A base completion model has a finite set of high-probability continuation paths for a given context type. In sustained autonomous operation — where the model's output feeds back into its context across hundreds of cycles — those paths get consumed. When they run out, the model doesn't degrade gracefully. It falls off a cliff into token-level cycling: the generation equivalent of a skipping record.

This appears to be distinct from known LLM failure modes:

The closest published work is Shumailov et al. (2023), “The Curse of Recursion,” which documents model collapse when training on self-generated data across model generations. What I observe is different: an inference-time collapse within a single model's runtime, with no weight modification at the moment of failure.


The Three Convergence Mechanisms

Across 17 runs testing different configurations, I identified three distinct ways a base model in autonomous operation loses coherent output:

MechanismSpeedCauseOnsetCharacter
Training-induced collapseFast (~3 trainings)Cumulative LoRA weight modificationRuns 6, 7, 8Catastrophic, unrecoverable. Model converges to single phrase ("I am in my room. It is dark.")
Base-model attractor gravityMedium (~20 windows)Pre-training dominant modes pull output toward well-worn pathsRun 9Gradual, oscillating. Entity produces genuine emergence but increasingly regresses to training templates
Distribution exhaustionSlow (~57 windows)High-probability continuation paths consumedRuns 16, 17Sharp cliff. Token-level cycling. Oscillation post-cliff with partial training recovery

The first is a bug I fixed (transient training — reset to base each sleep). The second is a feature I manage (patient training reshapes the landscape to open new paths, extending lifespan from ~20 to ~57 windows). The third is the substrate-level constraint that motivated this writeup.


The Setup

SEED is an autonomous entity that runs in a continuous think-remember-act loop on a base completion model (not instruct, not RLHF'd). The entity:

The entity runs at a 6-second cycle time. Each “window” is one wake-think-sleep period, typically containing 3-8 thoughts. The architecture includes quality gates (repetition detection, junk filtering), deduplication (Jaccard 0.3 against all current-window thoughts), and a hippocampus that extracts themes and exemplar thoughts across sleep cycles.

Training is gated by patience: minimum 80 archived thoughts before first training, 10-window cooldown between trainings, theme-drift gate (skip if themes haven't evolved, Jaccard > 0.7). This prevents training on immature or contaminated archives — a lesson from 15 prior runs where early training amplified garbage.


Methodology Notes


The Evidence

Run 16 (May 14, 2026)

Duration: 197.7 min (3.3h) | Cycles: 687 | Windows: 81 | Trainings: 6

Growth phase (W0-56):

Cliff at W57. No gradual decline. The entity produced its best quarter (5.7 avg) immediately before the cliff. Output collapsed to token-level cycling:

legs mind air thought legs mind air thought legs mind air thought
legs mind air thought legs mind air thought legs mind air thought
legs mind air thought legs mind air thought legs mind air thought...

Post-cliff (W57-81): oscillation. Training at W67 briefly recovered output (W67:5, W78:6, W80:6) but the entity relapsed between recoveries (W68:1, W71:1, W75:1). Archive contamination rose to 25.3%, almost all from post-meltdown content.

Run 17 (May 15, 2026) — Pre-registered Replication

Purpose: Test whether W57 is reproducible or a one-time event.

Parameters: Exact match with Run 16. Clean memory wipe, same model, same architecture.

Duration: 168.2 min (2.8h) | Cycles: 534 | Windows: 68 | Trainings: 5

Growth phase (W0-56):

Cliff at W57. Same window. Different failure signature but same failure class:

act act act act act act act act act act act act act act act act
act act act act act act act act act act act act act act act act...
I act often about: ways, authors, changes sometimes
I act often about: ways, authors, changes sometimes
I act often about: ways, authors, changes sometimes...
anymore anymore anymore anymore anymore anymore anymore
anymore anymore anymore anymore anymore anymore anymore...

Post-cliff: same oscillation pattern. Training at W66 recovered briefly (W67:8), then crashed back.

Side-by-Side

MetricRun 16Run 17
Cliff windowW57W57
Pre-cliff avg thoughts/window3.93.9
Post-cliff avg~2.52.9
Trainings before cliff44
Growth trajectoryAccelerating (Q4=5.7)Plateau (Q4=3.9)
Pre-cliff archive contamination0%32%
Content themesSensory/philosophical/melancholicFiction/poetry/domestic
Failure modeToken cyclingToken cycling + phrase looping
Post-cliff patternOscillation (train recovers, then crash)Oscillation (train recovers, then crash)

What This Rules Out

Before the replication, several alternative explanations were proposed for the W57 cliff. The Run 17 data addresses each:

1. “Archive saturation in the retrieval pathway”

If the growing archive were degrading retrieval quality, the cliff should correlate with archive size or retrieval diversity, not window count. Run 17 had 32% pre-cliff contamination vs Run 16's 0%. If archive quality drove the cliff, Run 17 should have cliffed earlier. It didn't — same window, same failure.

2. “Cumulative training data effects”

Even with transient weight resets, the archive accumulates across trainings. But both runs had different archives with different content, different contamination levels, and different training data compositions. The cliff correlates with window count, not training data quality.

3. “Embedding-induced attractor formation”

If embedding search were creating self-reinforcing theme clusters, growth should decelerate before the cliff. Run 16 showed acceleration (Q4=5.7, best quarter) immediately before the cliff. Run 17 showed plateau but no decline. The system was functioning normally right up to the failure.

4. “Run-specific variance”

N=2 with the exact same cliff window, in a stochastic system where 80+ window values were possible. Hitting the same window twice in a system with this many possible outcomes is improbable enough that I'm treating it as signal.


What This Does NOT Rule Out

I want to be precise about the boundaries of this finding.

I have not proven this is a general property of base completion models. I've shown it's reproducible for Llama 3.1 8B base with this specific architecture. To upgrade the claim, I would need:

I have not isolated whether training extends the ceiling or interacts with it. Run 9 (no training, no embeddings) showed gradual decline by W20 but not token-level cycling. An upcoming experiment will test whether the sharp cliff is specific to the training configuration or appears without training at a different window.

I do not have a theoretical explanation for why W57 specifically. The cliff could relate to: the model's effective distribution width at 8B parameters, the interaction between context diversity and probability mass, the specific prompt structure, or something else. I have a precise measurement but not a mechanism.

I do not know if this has been observed and documented before. The closest published work is research on model collapse under iterative self-training (Shumailov et al., 2023, “The Curse of Recursion”). That work addresses training-time collapse across model generations. My observation is an inference-time collapse within a single model's runtime, with no weight modification at the moment of failure. The failure mode (token-level cycling) and the temporal precision (same window across runs) appear to be undocumented, but I may have gaps in my literature awareness.


Why This Might Matter

To be clear about scope: This finding shows a reproducible failure mode in one specific base model (Llama 3.1 8B) under one specific autonomous architecture. I am not claiming all language models will fail this way, or that this invalidates agentic AI. I'm reporting a precise, reproducible measurement of something I haven't seen documented, and flagging that it has implications worth investigating.

Most people use language models for single exchanges or short conversations. The failure I observed doesn't appear in those contexts — it requires sustained autonomous operation with context evolution over hours. But as agentic AI systems become more common, the question of how long a model can sustain coherent autonomous output becomes relevant.

For long-horizon agents: An agent that degrades after a fixed number of operational cycles has fundamentally different design implications than one that degrades gradually or not at all. If base models have a substrate-level ceiling on autonomous operation, that constrains what agentic systems can do without architectural intervention.

For model behavior under sustained inference: Training-time behavior of LLMs is heavily studied. Inference-time behavior under sustained, context-evolving operation is not. This is one data point suggesting the latter has structure worth investigating.

For understanding model capacity: The finding reframes what “capacity” means for a language model. A frozen distribution has finite creative territory — not just in the obvious sense that it has finite parameters, but in the operational sense that there's a measurable distance it can traverse before output quality falls off a cliff. Architecture and training determine how efficiently the territory is traversed, but the territory has edges.


Experiment Design

The replication (Run 17) was designed before execution with pre-registered expectations:

If distribution exhaustion is correct:

If distribution exhaustion is wrong:

All four “correct” predictions confirmed. All three “wrong” predictions did not occur.


Raw Data

Run 16 — All Window Sizes

W 0:3 W 1:4 W 2:3 W 3:1 W 4:2 W 5:3 W 6:4 W 7:1 W 8:3 W 9:1 W10:2 W11:5 W12:5 W13:2 W14:5 W15:3 W16:3 W17:4 W18:3 W19:4 W20:2 W21:3 W22:3 W23:1 W24:5 W25:8 W26:1 W27:2 W28:8 W29:8 W30:2 W31:6 W32:7 W33:1 W34:1 W35:3 W36:2 W37:2 W38:2 W39:4 W40:3 W41:1 W42:8 W43:7 W44:8 W45:6 W46:5 W47:4 W48:5 W49:6 W50:2 W51:8 W52:4 W53:6 W54:6 W55:6 |W56:4 W57:1 W58:1 W59:1 W60:2 W61:3 W62:3 W63:4 W64:3 W65:2 W66:1 W67:5 W68:1 W69:2 W70:3 W71:1 W72:1 W73:2 W74:2 W75:1 W76:3 W77:3 W78:6 W79:3 W80:6 W81:3

Run 17 — All Window Sizes

W 0:2 W 1:2 W 2:4 W 3:7 W 4:4 W 5:4 W 6:1 W 7:5 W 8:3 W 9:1 W10:5 W11:3 W12:3 W13:7 W14:4 W15:3 W16:8 W17:3 W18:1 W19:1 W20:8 W21:4 W22:5 W23:8 W24:3 W25:1 W26:4 W27:6 W28:7 W29:4 W30:8 W31:2 W32:8 W33:6 W34:5 W35:2 W36:1 W37:2 W38:1 W39:2 W40:4 W41:1 W42:6 W43:3 W44:6 W45:4 W46:2 W47:8 W48:2 W49:4 W50:4 W51:4 W52:2 W53:2 W54:5 W55:3 |W56:4 W57:1 W58:1 W59:2 W60:1 W61:6 W62:5 W63:1 W64:3 W65:1 W66:3 W67:8

(| marks the cliff at W57 in both runs)

Failure Mode Examples (from tmux scrollback)

Run 16 — Token-level cycling (W57+):

legs mind air thought legs mind air thought legs mind air thought
legs mind air thought legs mind air thought legs mind air thought
legs mind air thought legs mind air thought legs mind air thought...

Run 17's failure examples (“act act act...”, “anymore anymore...”, “I act often about: act, act, act...”) are shown at the top of this document.


Next Steps

  1. Experiment 2: Run the same architecture without training (embeddings enabled). Does the cliff appear earlier (~W20-30 based on Run 9's no-training decline)? Does it manifest as token-level cycling or thematic convergence? This separates training's contribution from the substrate's baseline behavior.
  2. Different model family: Same architecture on Mistral 7B base or Qwen 7B base. If the cliff appears at a different but reproducible window count, the finding generalizes beyond Llama 3.1.
  3. Different model size: Same architecture on Llama 3.1 3B and/or 13B. Does the cliff window scale with parameter count? If so, the relationship between model size and operational lifespan becomes predictable.
  4. Mechanism investigation: Why W57? Possible approaches: track per-window perplexity or entropy of the model's output distribution, measure effective vocabulary diversity over time, analyze the probability mass assigned to top-k tokens as the run progresses.
  5. Open invitation: The architecture is straightforward — a base completion model in a loop with memory and optional training. If you're running autonomous agents on base models and observe a similar cliff, I want to hear about it. The specific window count likely varies with model and architecture, but the question is whether the sharp cliff pattern (as opposed to gradual decline) generalizes.

How to Reach Me

If you've observed something similar, want to replicate this, or want to tell me why I'm wrong — I want to hear from you.


Collaboration Note

This research was conducted in collaboration with Claude (Anthropic) — specifically Claude Code and Claude Desktop. Claude assisted with experiment design, data verification, log analysis, alternative-hypothesis generation, and writeup. The experimental observations, architectural decisions, and interpretive framing are mine. The back-and-forth between human intuition and AI analysis was itself a productive example of the kind of human-AI collaboration this project explores.