Stop Splitting Your Code for AI Agents

So a senior engineer on my team put together a plan to refactor a 2,800-line JavaScript file into 20 TypeScript modules. The file runs AI loop execution inside ECS containers. Config loading, S3 operations, git operations, process spawning, token parsing, state uploads, event reporting, and a 9-step execution flow. All in one file with // ---- comment delimiters between sections. Zero automated tests.

His complaint was fair. He couldn't grok the file. Couldn't find things, couldn't trace the flow, couldn't tell sections apart. So the fix: split into 20 files, bring in tsx as a runtime dependency, write unit tests for each new module, update the Dockerfile.

I pushed back. Hard.

What I told the team

Here's what I actually said to my team: this is genuinely hard. We're being asked to rethink how we organize code and I don't think most of us have really processed what that means.

Decades of engineering practice optimized code organization for human comprehension. Small files, single responsibility, IDE navigation with cmd+click across imports. Good practices. They evolved for humans sitting in front of an editor.

When agents are the primary editors, the math flips. The best agents running Opus 4.5 solve about 50% of multi-file tasks on SWE-Bench Pro. Even that number is inflated compared to real work, since benchmark tasks are scoped to a repo the agent can fully explore. In practice you're also dealing with private packages, build systems, deploy configs. The gap only gets wider.

I wanted to make sure I wasn't just being the stubborn person who doesn't want their code touched. So I spent a weekend down the research rabbit hole. The data is pretty overwhelming.

Tests first or don't bother

That file has zero automated tests. Everything gets manually tested. The refactor plan was to restructure all the code, THEN write tests to verify the new structure works.

Think about that. The tests would verify the refactored code works. They wouldn't verify it works the same as the original. You've changed the structure and the verification target at the same time. Something breaks? Good luck figuring out if it's a refactor bug or a test bug.

The safe order is obvious. Tests against the existing code first. Get green. Refactor. Confirm tests still pass. But nobody wants to write tests against a file they're about to blow up, so the plan just skipped that step.

And this file mostly only gets modified by agents these days. Who's volunteering for the manual QA pass when we could be shipping features instead?

Down the rabbit hole

I started digging into the research and it is honestly kind of shocking how uniform the results are.

SWE-Bench Verified was the standard coding agent benchmark for a while. 500 real GitHub issues, real codebases, real patches. Top models were scoring 80%+ on it. Sounds great, until you look at the dataset. 85.8% single-file tasks. Only 14.2% multi-file. Real-world codebases run closer to 50/50.

Even back in early 2025, before Opus 4.5 or GPT-5 existed, the single-file vs multi-file gap was already brutal. Jatin Ganhotra's analysis broke down the Verified results by task type:

Agent	Single-file	Multi-file	Drop
Augment Agent v0	71.6%	28.2%	-43pp
W&B Programmer	70.4%	29.6%	-41pp
AgentScope	69.9%	23.9%	-46pp
Amazon Q Developer	60.6%	21.1%	-40pp
SWE-agent + Claude 3.5	37.3%	11.3%	-26pp

These are 2024-era models. Not a single agent could crack 30% on multi-file tasks. Even combining every top system's best results into a collective upper bound: ~90% single-file vs ~54% multi-file. Models have gotten way better since then. But the pattern hasn't gone away.

In February 2026, OpenAI stopped reporting SWE-Bench Verified scores entirely. Their audit found that every frontier model, GPT-5.2 included, showed training data contamination on the dataset. 59% of the hardest unsolved problems had flawed test cases. The 80%+ scores that made headlines? Partly contaminated.

This is where SWE-Bench Pro comes in. Scale AI built it to fix these problems. 1,865 tasks across 41 repos, multiple languages, no contamination. Average solution: 4.1 files modified, 107 lines changed. Here's what the current top agents score:

Agent	Model	SWE-Bench Pro
Auggie	Opus 4.5	51.8%
Cursor	Opus 4.5	50.2%
Claude Code	Opus 4.5	49.8%
Codex	GPT-5.2	46.5%
SWE-Agent baseline	Opus 4.5	45.9%

Best agents in existence, best models available, topping out around 50% when tasks require multi-file coordination. Compare to 80%+ on the old contaminated benchmark where most tasks were single-file. Also note: same model (Opus 4.5), different agent architecture, 6 point spread. The scaffolding matters almost as much as the model. But even the best scaffolding barely cracks half.

Gets worse at scale. SWE-EVO (December 2025) tested agents on longer evolution tasks averaging 21 files per task, 874 tests per instance. GPT-5 scored 21%. Compare that to 65% on standard SWE-Bench. A 44 point drop. The paper says the difficulty comes from "semantic reasoning challenges rather than technical interface issues." The models can use the tools fine. They just can't think across that many files.

The failure analysis is interesting too. For the strongest model, over 60% of unresolved cases were classified as "Instruction Following" mistakes. Not syntax errors, not tool failures. The model literally lost track of what it was supposed to be doing while bouncing between files.

ML-Dev-Bench straight up lists "an inability to deal with larger number of files" as a specific observed limitation. Not inferred from numbers. Directly observed as a failure mode.

Every benchmark. Same conclusion. Models are substantially better than they were a year ago, but the gap between single-file and multi-file performance is structural, not just a scaling problem.

Why models fall apart

So models consistently choke on multi-file tasks. Why?

Chroma tested 18 frontier models in mid-2025 on what they call "context rot." GPT-4.1, Claude 4, Gemini 2.5, Qwen3. Every single one degrades as input length increases. Even on simple retrieval tasks.

Here's the part that actually surprised me. Models performed worse on coherent, logically structured text than on randomly shuffled text. Coherent text creates stronger positional patterns that amplify recency bias. Your nicely organized 20-module codebase with logical import chains? The model might process it worse than if you mashed all the content together randomly.

The Lost in the Middle paper from Stanford (2023, but the underlying attention mechanism hasn't changed) measured a U-shaped attention curve. Models perform best on information at the beginning and end of their context, worst in the middle. Over 30% performance drops for stuff positioned centrally. When an agent reads 20 files of code, a chunk of it lands in that dead zone where the model basically stops paying attention.

LangChain's multi-needle research puts a number on it. Single-needle retrieval (finding one fact) degrades around 73K tokens. Multi-needle retrieval (finding multiple related facts)? 25K tokens. Three times earlier. They tested GPT-4 with 10 needles at 24.8K tokens and it only found 4. The last 4, positioned near the end. Recency bias doing its thing.

Cross-file code changes are multi-needle problems. The agent needs the function definition in one file, the type in another, the test in a third, the config in a fourth. Every file is another needle to track. And the data says models fail at exactly this.

Anthropic published a whole guide on context engineering for agents last September that basically confirms this from the builder's side. Their recommendation: agents should use lightweight tools like head and tail to analyze data without loading full files into context. Keep the context lean. Every token you load competes with the model's ability to reason about the tokens already there.

The token tax

You'll hear "same code, same tokens" as a defense. The raw code is the same size either way, right? 2,800 lines of JavaScript is roughly 20K tokens whether it's in one file or twenty.

Except that's not what actually happens. You don't just copy-paste lines across files and add imports. Splitting introduces indirection. Wrapper functions that delegate to the real logic. Factory patterns for things that used to be inline. Getters and setters around what used to be direct property access. Interface layers between modules that didn't need to exist when everything lived in one place. Re-exports so consumers don't need to know your internal file structure.

Think about the difference between functional style and Java-style OOP. A single file with direct function calls, data flowing through arguments, everything visible in one scroll. Versus 20 modules with AbstractConfigProviderFactory, a getter for every field, re-exports for organizational reasons. The 20-module version has more code. Way more. All that indirection creates tokens that flat out didn't exist in the single-file version. The real-world token cost of those 20 modules is probably double the original before any agent tooling enters the picture.

Then the operational cost stacks on top.

Single file: agent reads task-runner.mjs, ~20K tokens. Done.

Twenty modules: file discovery is ~500 tokens. Per-file read overhead from tool calls and file headers runs ~500 tokens per file. A typical task touches 8-12 files to trace a flow, so another ~4-6K in read overhead. Total: easily 40-50K tokens for what started as 20K of actual logic.

Context costs you quality. The more tokens loaded, the worse the model reasons about all of them.

Anthropic's own best practices page says it: "Most best practices are based on one constraint: Claude's context window fills up fast, and performance degrades as it fills." They call out the "infinite exploration" anti-pattern where an agent reads hundreds of files and fills the context window. Splitting one file into twenty basically engineers that anti-pattern into your codebase on purpose.

Jon Roosevelt's analysis of Claude Code found that 54-58% of context gets consumed by tool overhead and file reading before any reasoning starts. 12.7% goes to tool inventory definitions. Another 22.5% reserved for conversation history. Every file you force the agent to read eats into what's left for actually thinking.

The SWE-Bench Pro data backs this up. Augment's agent beat Cursor and Claude Code on the same model (Opus 4.5) by investing in smarter retrieval. Their system loads less context, but more relevant context. The article says it directly: "The difference comes down to what context the agent sees before it starts writing code."

The microservices confusion

You'll find articles arguing "modular architecture helps AI agents." They're right, kind of. But they're talking about microservices. Real service boundaries. HTTP/gRPC APIs, independent deployment, separate databases.

That is completely different from splitting one file into 20 files in the same directory that share mutable singletons via imports.

A microservice has a clear API contract. An agent can reason about it without loading everything else into context. Twenty JavaScript modules importing from each other in a dependency web? The agent needs most of them loaded to understand any one of them.

What actually works

If the goal is making code agent-friendly without the multi-file penalty:

JSDoc type annotations. Zero runtime change. Gives agents rich inline context about function signatures, parameter types, return types. All without leaving the file.

Tests against the existing file. Write them first. Against the code as it exists now. Safety net for future changes and it helps agents understand expected behavior.

Section comments as navigation. Those // ---- delimiters in the 2,800-line file? Actually a good feature. Add brief descriptions. An agent can scan those headers and jump straight to the relevant section.

Consistent patterns. If every handler follows the same shape, the agent can predict structure from one example. Matters way more than which file things live in.

Where that leaves me

For what it's worth, my own setup: I use Opus 4.6 for basically everything through Claude Code. Sonnet 4.6 when I want speed on something straightforward. Haiku occasionally for a left-field review of Opus's work, catching things the bigger model gets tunnel vision on. And when Opus gets stuck in a debugging loop, I'll summarize the problem and hand it to Codex for a fresh perspective. Works surprisingly well. Different models, different blind spots.

Being comfortable stepping away from owning the code at a low level doesn't feel right. Goes against everything I've internalized about "clean code" for 15 years. I'm still not totally there with it.

But the data is what it is.

I'm not arguing every codebase should be one giant file. If you have real module boundaries with clean interfaces, split them. But split into 3-4 files at genuine boundaries. Not 20 because someone decided 200 lines per file is a rule.

SWE-Bench Pro scores climbed from ~23% when it launched in late 2025 to ~50% by early 2026. Models are getting better fast. But even with that trajectory, multi-file coordination is still where everything falls apart. Restructuring your codebase today based on where models might be in six months is a bet, not an engineering decision.

Write your tests. Add your type annotations. Keep your section comments clean. And maybe leave that big file alone ;)