← All posts
3 February 2026·11 min read

AI Agents Don't Fail in Dev. Your Repo Does.

AI agents fail in production codebases because the repo isn't built for them. Four layers separate agent-ready codebases from the rest.

ai-agentssoftware-developmentengineeringai-native

When an AI agent fails on a real ticket, most teams blame the model. They try a different one, adjust the prompt, or conclude that agent-driven development isn't ready for production yet. That diagnosis is almost always wrong.

The model is rarely the problem. I've run agent-driven development across production codebases, and the failure mode is consistent. The bottleneck is not the AI's capability. It is the infrastructure the AI is trying to work inside. There are four layers between a codebase and real, reliable AI agent software development work. Most teams are missing two or three of them.

Fix the layers, and agents close tickets. Leave the layers missing, and agents fail in ways that feel like an AI problem but are actually a repo problem. This post names the four layers, explains why most teams get stuck at the first one, and describes what agent-driven development actually looks like when all four are in place.

The Four Layers That Separate Agent-Ready Codebases from the Rest

These four layers are not optional features. They are the infrastructure stack that makes AI agents functional in a production environment. Every layer that is missing is a failure point.

Layer one: context infrastructure. An agent navigating your codebase needs to understand the system before it can make a decision. That means documented architecture, module boundaries, naming conventions, and decisions that have already been made. In practice, this means files like CLAUDE.md, agents.md, or architecture decision records that explain the system in plain language. Without this layer, an agent is guessing. It will produce code that compiles and fails in integration. Context infrastructure is not documentation for humans who forgot things. It is the operating manual for agents that have never seen your codebase before.

Layer two: a test suite agents can actually run. This is not a question of whether you have tests. It is a question of whether your tests are fast, deterministic, and produce output an agent can read and act on. Agents iterate by running tests, reading the result, and adjusting. If your test suite takes forty minutes to run, is non-deterministic, or produces stack traces that require human interpretation, the agent cannot close the loop. It needs a tight feedback cycle: run tests, read output, know what changed, try again. A slow or flaky test suite breaks that cycle entirely.

Layer three: CI feedback an agent can interpret and act on. Most CI pipelines were designed for human engineers who can read a wall of logs and extract what went wrong. Agents cannot do that reliably. Machine-readable errors, structured output, clear signal about what failed and why, this is what makes CI useful to an agent. If your CI pipeline outputs pages of interleaved log lines with no structure, the agent is stuck. It cannot diagnose the failure because the failure information is buried in noise designed for humans. CI for agents needs to be precise: what failed, which file, which assertion, what was expected.

Layer four: ticket types with acceptance criteria an agent can verify. This is the layer almost every team is missing, and it is the one that determines whether agents can actually close work. A ticket that says "improve the checkout flow" gives an agent nothing to verify. A ticket that says "the checkout_total function should return the correct value when a discount code is applied, verified by this test case passing with these inputs and outputs" gives the agent a target it can hit and confirm. Acceptance criteria that are agent-verifiable are just well-written acceptance criteria. They are specific, testable, and unambiguous. But writing them this way requires a discipline most teams have not built yet.

Most Teams Have Layer One. Almost None Have Layer Four.

The pattern I see in teams I've worked with is predictable. When they decide to try AI agent software development, they start by adding a CLAUDE.md or equivalent context file. That is layer one. It takes a day or two, and it feels like real progress. Engineers feel like they have done the thing. They have not done the thing.

Then they try an agent on a real ticket. It fails. They improve the context file. It fails again, differently. They add more documentation. More failures. The team concludes the agent "isn't there yet" and moves on. What actually happened is that the agent found the edge of layer one and had nowhere to go.

What they have not done is build layers two, three, and four. Their test suite is slow. Their CI output is a mess of unstructured logs. Their tickets say things like "fix the auth bug" with no acceptance criteria at all. The agent has context but nowhere to go with it. It knows what the system is supposed to do in general. It has no way to verify whether it did it. That gap is not an AI limitation. It is a missing feedback loop.

The reason teams get stuck at layer one is that it feels like the most AI-specific thing to build. Context files for AI tools are new. Tests and CI are old. Engineers have lived with imperfect test suites for years. They don't instinctively connect a flaky test suite to an agent problem. But the agent needs the test suite more than any human engineer does. The human can run one test, see something unexpected, and use judgment. The agent cannot exercise judgment outside its loop. If the loop is broken, the agent is stuck.

Layer four is rarely built at all because it requires changing how tickets are written, which means changing how product managers and engineers communicate requirements. That is a process and culture change, not a technical one. It is also, in my experience, the most valuable change a team can make regardless of AI agents. Verifiable acceptance criteria are just better requirements. Agents make the gap between "good requirements" and "bad requirements" much more expensive.

What Happens When All Four Layers Are in Place

I want to be specific about what this looks like, because "AI-driven development" is often described in abstractions that make it sound experimental. It is not experimental when the infrastructure is in place. It is a different workflow with real, observable properties.

The agent picks up a ticket. The ticket has specific acceptance criteria: a function should behave a certain way with defined inputs and outputs. The agent reads the context infrastructure, understands the module it needs to touch, and makes an initial change. It runs the test suite. The tests fail with structured output that tells it exactly which assertion failed and why. The agent reads that output, adjusts the implementation, and runs the tests again. This cycle takes minutes, not hours.

When the tests pass, the agent checks the CI pipeline. If CI fails, the error is structured and interpretable. The agent diagnoses the failure, fixes it, and re-runs. When CI passes, the agent submits a pull request. The PR includes a summary of what changed and why, derived from the acceptance criteria in the ticket.

The engineer reviews the PR. Not to figure out what the agent did, because the PR explains it. To assess whether the implementation is architecturally sound, whether it introduces risk at the boundary the agent could not see, and whether the approach aligns with decisions the team has made. That review takes ten to fifteen minutes. The engineer merges or requests a change. If they request a change, the agent handles it.

This is not a prototype. I've seen this pattern running in production teams today. The agents are not closing every ticket. They are closing a significant share of well-specified, bounded tickets, the kind that used to take a mid-level engineer a half day. The engineers are doing more architectural work, more review, more design. The ratio of humans to tickets is different. The output per engineer is different.

That outcome is not available to teams missing layers two, three, or four. It is only available to teams that built the infrastructure first.

The Repo Is an Infrastructure Problem, and Infrastructure Has to Be Built

The most common mistake I see when teams hit agent failure is changing the model. They switch from one AI coding tool to another, or from one agent framework to another, trying to find one that works better in their codebase. This is the wrong variable to change.

Switching models does not fix missing context infrastructure. It does not make your test suite faster or more deterministic. It does not restructure your CI output. It does not add acceptance criteria to your tickets. The new model runs into the same missing layers and fails in the same ways. The team concludes that agents don't work for their codebase. They are correct, but for the wrong reasons.

The GitHub Octoverse 2024 report showed that AI usage in software development has grown significantly, but adoption of the underlying practices that make AI reliable, consistent test discipline, structured CI, documented architecture, has lagged far behind. Teams are running faster tools on slower foundations. The gap between AI capability and AI-ready infrastructure is widening.

The teams that are getting real output from AI agents in software development are treating it as an infrastructure problem. They are not asking "which agent should we use". They are asking "what does our repo need to be for agents to work reliably inside it". The answer to that question is the four layers. Building them is not glamorous. It is the same unglamorous work that separates teams with sustainable velocity from teams that are stuck cycling through tools and blaming the model.

There is also a compounding effect worth naming. Every sprint a team runs agents on broken infrastructure is a sprint where they build false intuitions about what agents can and cannot do. They develop workarounds for problems that should not exist. They train their engineers to treat agent failures as normal rather than diagnostic. The cost of that institutional learning is hard to measure and significant. The longer you wait to build the infrastructure, the more of that debt you accumulate.

Context infrastructure can be built in a week. A faster, deterministic test suite is weeks to months of focused work, depending on where you're starting. Structured CI output is a configuration project, usually a few days. Verifiable acceptance criteria require a process change that takes a quarter to embed properly.

None of those timelines are unreasonable. All of them require treating the repo as the product, not just the container the product lives in. Teams that have made that shift are not experimenting with agents. They are running them.

The Teams Getting Real Results Built the Infrastructure First

The pattern across every team I've seen get meaningful output from AI agents is the same. They did not start by picking the best agent. They started by asking whether their codebase was a place an agent could work. Most of the time, the honest answer was no. Then they built toward yes.

The four layers are not a checklist to complete once. They are an ongoing standard. Context files go stale when architecture changes. Test suites drift into flakiness when they're not maintained. CI output gets messy as the pipeline grows. Ticket quality varies by who's writing them. Keeping all four layers in good shape is the engineering discipline that makes agent-driven software development possible.

The teams that skipped this work and went straight to agents did not get bad results from bad models. They got bad results from good models working in broken infrastructure. That distinction matters because it tells you where to invest. The model is not the bottleneck. The repo is.

If your agents are failing, don't switch the model. Audit the four layers. You'll find the gap there.


I help engineering teams close the gap between "we use AI tools" and "AI actually changed how we deliver." Book a 20-minute call and I'll tell you where the leverage is.

Working on something similar?

I work with founders and engineering leaders who want to close the gap between what their technology can do and what it's actually delivering.