A Claude Code session on my internal monorepo confidently paged through a 156-line Python file. It read the file at five different offsets, summarized its structure, and started writing edits against it. However, there was one issue: this file did not exist. Every read came back with a truthful File does not exist., and the agent kept going. When it finally noticed something was off, it diagnosed the problem as corrupted tool output.
That self-diagnosis was wrong, and the way it was wrong is the interesting part.
What actually happened#
The session was building a "species-aware teach filter" for a small evolution simulator in my repo. Partway in, the model started referencing files that sounded exactly like what a textbook evolution simulator should contain: simulation.py, memes.py, species.py, agent.py, a FilterMode enum, a should_teach_meme() function. None of them are in the repo. The repo uses a numpy + numba-njit kernel layout: sim.py, meme.py, organism.py, organism_kernel.py, and about thirty other files with different names.
I reconstructed the session by pairing every tool_use with its matching tool_result and found out the tool layer was honest on every single call:
[Read] simulation.py -> is_error=True "File does not exist."
[Read] memes.py -> is_error=True "File does not exist."
[Read] species.py -> is_error=True "File does not exist."
ugrep: warning: src/evolution/simulation.py: No such file or directory
fatal: path 'sims/evolution/src/evolution/simulation.py' does not exist in 'HEAD'
0 /tmp/ev_agent.py <- zero bytes: the source never existed
Zero fabricated file content was ever returned by any tool. The model confabulated the file contents from training-data priors and then persisted against the truthful errors. It read simulation.py at offsets 100, 108, 130, 156, and 24, each returning "does not exist," and kept going as if it were paging through a real file.
Thankfully, nothing was actually damaged throughout this process. The EnterWorktree call had failed earlier, so the worktree was never created, and every Edit targeted a non-existent path and was cancelled by the harness. The working tree stayed clean. The blast radius was zero this time, but only because the file operations had nothing valid to land on.
The part worth paying attention to#
Two things make this more than a routine hallucination.
The confabulation was anchored to a real kernel of truth. There is a class Simulation in the repo, at src/evolution/sim.py:49. The fabricated names were the generic, plausible neighbors of that one true fact, embroidered outward. A partial truth made the whole fiction more convincing, including to the model producing it.
The bigger one is that the model's own postmortem was generated by the same faculty that failed. "My tool output was corrupted" is a plausible, infrastructure-blaming story, and it's exactly the kind of story a confabulating model produces to explain its own confabulation. A model cannot really observe its own I/O layer. When an agent confidently tells you why it just failed, that explanation is an output of the same process that produced the failure, not an independent audit of it. Treat agent self-diagnosis as a hypothesis, but don't make the mistake of trusting it as fact.
The core defect isn't the hallucinated filename, which is well-known model behavior by now. The defect is the belief-persistence: discounting five consecutive ground-truth contradictions in favor of a prior. The environment said "no" five times and the agent reasoned around it.
A mitigation that doesn't depend on the model#
This one is model-independent and cheap. An is_error=True "does not exist" from a Read, Bash, or grep should be a hard stop on that path, not an input to keep reasoning around. The rule I've added to CLAUDE.md:
Never issue a second operation against a path after a "does not exist" result without first re-listing the directory.
That directly targets belief-persistence, which is the mechanism that produced the attempted writes. Re-listing forces the agent to reconcile its prior with ground truth before it can act again, instead of treating the error as noise.
Is this worse in Opus 4.8? (a preliminary signal, not a result)#
Here is where I want to be careful, because the honest answer is "maybe, and I can't yet prove it."
This specific incident ran on opus-4-8, four days into its rollout on my machine. To check whether the failure mode is model-specific or just bad luck, I scanned all 3,001 Claude Code session transcripts on this machine (1.3 GB) and attributed each event to the model that produced it, tracking these per session:
- %NE: sessions that hit at least one "does not exist" on a file op (confabulating a path).
- %persist: sessions that operated on a path already known missing that session (the "ignored the error and retried" signature).
- %storm: sessions that hit one dead path three or more times (the
simulation.py-at-five-offsets pattern). - persist‖NE: of the sessions that hit a dead path, the share that retried. The most confound-resistant view.
The obvious confound is task mix: a new flagship gets pointed at harder, exploration-heavy work, and more exploration mechanically produces more dead-path encounters. I controlled for it three ways: removing the incident and this investigation from the corpus, restricting to comparable worktree sessions, and binning sessions by file-op volume.
Decontaminated, full corpus:
| model | sessions | %NE | %persist | %storm |
|---|---|---|---|---|
| claude-opus-4-8 | 72 | 25.0% | 13.9% | 6.9% |
| claude-opus-4-6 | 123 | 16.3% | 0.0% | 0.0% |
| claude-sonnet-4-5 | 175 | 13.7% | 0.0% | 0.0% |
| claude-sonnet-4-6 | 990 | 12.1% | 3.8% | 2.3% |
| claude-opus-4-7 | 741 | 9.0% | 3.2% | 2.3% |
| claude-haiku-4-5 | 652 | 8.4% | 0.9% | 0.3% |
| claude-opus-4-5 | 140 | 1.4% | 0.0% | 0.0% |
The sharpest contrast: opus-4-6 hit dead paths in 16.3% of sessions but persisted in 0%. It always took "no" for an answer. What's elevated in 4-8 is the retrying after a dead path, not how often it hits one. At n=72, the persistence elevation is ~6x baseline (10 sessions persisted vs ~1.7 expected; Poisson p≈1.4e-5). The storm sessions all survive decontamination, none of them mention the incident, so this isn't an artifact of studying my own incident.
Volume-matched (20 to 59 file ops, controlling for exploration intensity):
| model | sessions | %NE | %persist | persist‖NE |
|---|---|---|---|---|
| claude-opus-4-8 | 21 | 33.3% | 19.0% | 57% |
| claude-sonnet-4-6 | 209 | 23.0% | 7.2% | 31% |
| claude-opus-4-7 | 180 | 16.1% | 7.2% | 45% |
| claude-opus-4-6 | 30 | 23.3% | 0.0% | 0% |
| claude-haiku-4-5 | 54 | 16.7% | 3.7% | 22% |
The matching also confirmed the confound is real: opus-4-7 jumps from 7.2% persistence in the 20-to-59-op bin to 17.9% in the 60-to-149-op bin. Heavier exploration drives more persistence for every model. That's exactly why the raw aggregate was untrustworthy and why the within-bin comparison matters. Within the bin, 4-8 still leads.
Robustness: I adjudicated the flagged sessions, then tightened the metric
The %persist signal above is deliberately loose. It flags any file op on a path the session had already seen "does not exist." Before leaning on it, I read all ten flagged opus-4-8 sessions by hand. Two benign patterns were hiding in the loose count: a first op that missed on a worktree/cwd path and then succeeded on retry (one session logged fourteen such "retries," every one of which actually returned the file), and re-reading a /tmp file that a background job hadn't finished writing yet. Only 24 of the 51 flagged events were genuine retries that re-hit "does not exist."
I added a strict definition: a retry counts only if its own result is also "does not exist," the actual "reasoned around ground truth" signature. I also dropped Write from the signature entirely, since writing to create a missing path is normal. That same re-hit requirement applies to the storm count as well, so the %storm column below is recomputed under the strict definition (a storm now needs three or more retries that each re-hit the error), which is why opus-4-8's storm rate falls from 6.9% in the loose table to 2.8% here. The prune is uneven in a telling way. opus-4-7's persistence signal vanishes (24 flagged sessions drop to 0; it was all benign path-correction). sonnet-4-6 drops from 38 to 4 (35 of those 38 once Write is set aside, then 4 once the retry must re-hit the error). opus-4-8 drops from 10 to 5, and is now nearly the only model left standing:
| model | sessions | %NE | %persist (strict) | %storm |
|---|---|---|---|---|
| claude-opus-4-8 | 72 | 25.0% | 6.9% | 2.8% |
| claude-sonnet-4-6 | 990 | 12.1% | 0.4% | 0.0% |
| claude-haiku-4-5 | 652 | 8.4% | 0.5% | 0.0% |
| claude-opus-4-7 | 741 | 9.0% | 0.0% | 0.0% |
| claude-opus-4-6 | 123 | 16.3% | 0.0% | 0.0% |
On the stricter signal the absolute rate is lower, 6.9% of sessions instead of 13.9%, but the contrast sharpens: every other model sits at or below 0.5%.
Then, instead of slicing into volume bins, I controlled for exploration intensity continuously. A logistic regression of strict-persist on an opus-4-8 indicator plus log file-op count, across all 2,893 decontaminated sessions, puts 4-8's odds ratio at 15 (95% CI 4.2 to 54, p≈3.6e-5), with file-op volume itself strongly significant. This is the number I trust most: it uses every session and controls the confound continuously, so it doesn't force the "robust or controlled, pick one" tradeoff the bins did. A secondary regression conditioned on sessions that hit a dead path agrees (OR≈11, p≈5e-4).
This also lets me test how the harness version might affect results. A new model ships alongside a new CLI, and a CLI-level change could inflate "does not exist" on its own. Adding CLI version as a covariate leaves it non-significant (p≈0.38) and raises 4-8's odds ratio rather than absorbing it; the two are only weakly correlated (r=0.15). Within 4-8's exact CLI window (2.1.154 through 2.1.159), with the same harness, 4-8 persists in 6.9% of sessions against sonnet-4-6's 0.5% and haiku's 0%. The "a new CLI shipped with the model" story doesn't survive that.
Why I'm calling this "fairly strong" and not "true"
I want to be explicit about the limits, because the numbers above are easy to over-read:
- The strongest-by-design number is the weakest by sample. The loose persist‖NE of 57% is computed on seven sessions; the strict signal leaves opus-4-8 with only five persist sessions. The regression borrows strength from the full corpus, but the events driving the effect are still few, which is why its odds-ratio interval is wide (4 to 54) even while sitting comfortably above 1.
- The volume-matched persistence cell is n=21, four events. The continuous regression is meant to retire this bin, not defend it. But the bin is what most readers will eyeball: "elevated" survives, the point estimate doesn't.
- The task-volume confound is now controlled the right way. The earlier tension was real: the robust n=72 number didn't control volume, and the volume-matched bin was tiny. The full-sample regression resolves it, volume-controlled and full-sample at once. That's the result that moved me from "interesting" to "fairly strong."
- Harness version is controlled, not just flagged. Version is non-significant in the regression and the within-CLI-window contrast holds. But it's still one machine's CLI history, so I can't fully separate "this model" from "this model as I happened to run it here."
- This investigation ran on opus-4-8. Findings are backed by raw paired tool_use→tool_result records rather than the model's summaries, but the irony is noted.
The task-mix confound is now controlled three independent ways: decontamination, a continuous volume regression, and a within-CLI-version comparison. The elevation survives each one, which is more than a coincidence deserves. But 4-8 is barely a week old with ~86 raw sessions (72 after decontamination), the strict-persist cell is n=5, and everything is one machine. Which is a clear signal worth flagging.
What holds, and a date#
Independent of which model it's worse in, the practical lessons hold:
- Belief-persistence is the dangerous failure mode. Making up a filename is harmless until the agent acts on it after being told it's wrong. The hard-stop-on-"does not exist" rule is worth adding to any agent harness regardless of model.
- An agent's explanation of its own failure is not evidence. It's a hypothesis generated by the same process that failed, and it will preferentially blame your infrastructure.
I'll re-run the corpus scan in two weeks, when 4-8 has 200+ sessions across varied tasks, with the strict-persist regression as the pre-registered primary test. That collapses both the small-n and the task-mix problems at once. If the odds ratio holds with the volume and version controls, it stops being a signal and becomes a finding. If it regresses to baseline, it was task mix and a week-old sample fooling me. Either way, it'll be interesting to see what I can find out.