The Velocity Trap: AI Is Making Teams Faster and More Broken

AI is making engineering teams faster. It is also making them more fragile. The two things are connected, and most leaders are only tracking the first one.

In 2026, the average engineering team using AI coding tools is producing 20% more pull requests per engineer. That sounds like a win. It's the number that gets shared in board decks and team all-hands.

But the same dataset shows that incidents per pull request are up 23.5%. Change failure rates are up 30%.

Your team is shipping more. It is also breaking more. And if you're only measuring the first number, you have no idea the second one is happening.

I've watched this exact pattern play out across engineering teams over the past eighteen months. The teams that avoided it didn't adopt AI more slowly. They built different foundations first. This post is about what those foundations are, why most leaders skip them, and how to tell whether your team is already in the trap.

Your Velocity Metrics Are Only Showing Half the Picture

The standard engineering metrics were designed for a world where code production was the bottleneck. Before AI, the constraint was how fast engineers could write code. Metrics like PR frequency and story points closed made sense as proxies for productivity because more code shipped meant more value delivered.

AI removed that constraint. Engineers can now produce substantially more code in the same time. But production was never the only constraint. The other half of the system, code review, testing discipline, architectural coherence, incident response, was not sped up by AI. In many cases it got harder, because there was more output entering it at once.

When you add more input to a system without increasing its capacity to handle that input, the system degrades. That's not a failure of AI. That's a systems problem that AI made visible faster than it would have shown up otherwise.

The 23.5% increase in incidents per PR is not random. It's the predictable output of more code entering a system that wasn't designed to handle more code. If you're measuring deployment frequency and cycle time and the signals look positive, you are looking at half the picture.

Why Velocity Without Foundations Is a Liability Transfer

AI amplifies whatever culture and process you already have. A team with strong code review culture, clear ownership, and solid test coverage will get faster and more reliable with AI. A team with weak specifications, low review discipline, and unclear ownership will get faster, and noisier.

The velocity gain is real in both cases. The stability story is completely different. And most teams, when they roll out AI tools, are measuring velocity and nothing else.

This is the liability transfer. The productivity gain shows up immediately in sprint metrics and board reports. The cost shows up three months later in your incident rate, your on-call load, and the time your most senior engineers spend firefighting rather than building. The gain is visible and attributable. The cost is diffuse and often blamed on other things: complexity, growth, bad luck.

That's not bad luck. That's a system being amplified faster than it can handle.

How a Celebrating Team Was Quietly Building a Crisis

About four months into running a transformation, one team was celebrating. Their sprint velocity was up significantly. Engineers were energised. The tools were working.

Meanwhile, I was watching their incident rate climb steadily week on week. The team was producing more features, but each one was carrying more hidden risk than before. The faster they moved, the faster the risk compounded.

The root cause wasn't the AI. The AI was just a magnifier. The underlying issue was that their codebase had poor boundary definition: modules with unclear contracts, test coverage that missed integration points, and a code review process that had always been optimistic rather than rigorous. Before AI, they were too slow to hit that wall consistently. With AI, they hit it every sprint.

We stopped, spent six weeks building the foundations that should have been there first, and then re-engaged with the tools. Velocity came back. Incidents dropped below their pre-AI baseline. The difference wasn't the tools; it was the system those tools were running on.

That six-week pause felt expensive at the time. Looking back, it was the cheapest thing we did. The alternative was compounding a fragile system for another six months and then spending twice as long cleaning it up.

Most Teams Declare Victory at Stage Two. Stage Three Is Already Underway.

Engineering teams go through a recognisable pattern when they adopt AI tooling. Understanding where your team sits tells you what's coming next.

Stage 1: Tool access. The team gets Cursor, Copilot, or equivalent. Some engineers use it heavily, some don't. There's no coordinated approach. Individual variation is high.

Stage 2: Individual productivity gains. Engineers who engage with the tools start moving faster personally. Velocity metrics start to tick up. Leadership notices and declares the rollout a success. This is often where the announcement goes out.

Stage 3: The velocity trap. More PRs, more incidents. The system starts showing strain. Senior engineers spend more time in review. On-call gets busier. The team may not connect this to AI adoption at all, attributing it to complexity, growth, or bad luck.

Stage 4: Foundation work. Either deliberately or after a sufficiently painful incident, the team pauses and builds the underlying structure: context infrastructure, review standards, quality measurement, clear ownership. This is unglamorous and often resisted because it looks like slowing down.

Stage 5: Compound advantage. AI is now amplifying a system that was designed for it. Velocity is higher than Stage 2. Quality is higher than pre-AI. The team is operating in a fundamentally different gear.

Most teams I see are stuck between Stages 2 and 3. They've declared victory on AI adoption and moved on to the next initiative, not realising that Stage 2 velocity gains are fragile and Stage 3 is already underway.

Stage 4 is the work that actually matters. The teams that skipped it paid for it later, usually in a way that was harder to fix than if they'd done it upfront.

Four Foundations That Separate Durable Gains from the Trap

Four things separate the teams that got sustainable results from those that fell into the velocity trap. They built these before or alongside their tool rollout, not after they had already paid the incident tax.

Structured context. AI generates code in the context you give it. If your codebase has no documented architecture decisions, no consistent naming conventions, no clear module boundaries, the AI is navigating blind. It will produce code that works in isolation and breaks in integration. Context infrastructure, including architecture decision records, documented conventions, and context files that explain the system to AI tools, is not an optional nice-to-have. It is the difference between AI amplifying your system and AI accelerating your technical debt.

Review culture that actually reviews. The engineers who got the best outcomes from AI were the ones who treated every AI-generated PR with the same scrutiny they would apply to a junior engineer's first commit. They would accept, question, and reject. They understood that AI produces confident-sounding code that can be architecturally wrong in ways that don't surface in tests. Review culture is a leadership decision. If your team sees reviews as a box to tick rather than a quality gate, AI will not change that; it will just give them more boxes to tick faster.

Quality measurement before velocity measurement. Most teams track deployment frequency and PR volume when they roll out AI tools. Almost none of them set a baseline for change failure rate, mean time to recovery, and incident frequency first. If you don't have a baseline, you can't tell whether your velocity gain is real productivity or deferred risk. In my experience, fewer than one in five engineering teams has any measurement of AI impact on quality at all. That is not a measurement gap. It is a leadership decision about what you are willing to know.

Clear ownership at module level. If no one clearly owns the module that AI-generated code is going into, no one has the context to assess whether the generated code is correct in the system, not just syntactically valid. Ownership clarity, which many teams treat as a soft cultural issue, becomes a hard technical requirement when AI is in the loop.

Most Teams Were Not Ready for AI to Amplify What They Already Had

Seventy-four percent of companies reported no tangible AI value by the end of 2024. By mid-2025, two-thirds were still stuck in pilot stage. The reason, in most of the cases I've seen, is not that the tools don't work. It's that the underlying engineering system wasn't ready to be amplified.

AI doesn't transform a weak engineering culture into a strong one. It transforms whatever you already have, faster. If what you have is strong, that's an extraordinary outcome. If what you have is fragile, the amplification is painful, and the pain is delayed just long enough that most leaders don't connect it to the AI rollout.

There's also a compounding factor that doesn't get talked about enough. The engineers who are most enthusiastic about AI tools, and who adopt them earliest, tend to be your most productive engineers. They push the limits, figure out how to use them for complex tasks, and generate the most output. They're also the ones whose output goes through the same review and testing system as everyone else. If that system is weak, their amplified productivity becomes amplified risk, and it's your best people creating it.

The question worth asking honestly is not "have we adopted AI?" but "are we ready for AI to reveal what our engineering system actually is?"

Five Signals That Tell You Whether You're Already in the Trap

If you're an engineering leader wondering whether your team is in the velocity trap, check these before you look at any AI-specific metrics.

Code review depth. How long does a PR review actually take on average? If the answer is under ten minutes for anything non-trivial, your reviews are likely not catching architectural issues. Look at the last ten PRs that went to production. How many had substantive comments versus approval with minor notes?

Ownership clarity. Pick five modules in your core codebase. For each one, can you name, without hesitation, who owns it and who has the deepest context on it? If you're uncertain on more than one, ownership is unclear enough that AI-generated contributions to those modules are going in without adequate review context.

Incident trend. Pull your last three months of incident data alongside your AI adoption timeline. If your incident rate started climbing at the same time adoption went up, you're in the trap. Most teams have this data; most leaders haven't looked at it this way.

Test coverage quality, not just quantity. High coverage numbers mean nothing if the tests are testing implementation rather than behaviour. Ask your engineers: if someone refactored this module and all the tests still passed, would you trust the refactor? If the honest answer is no, your coverage is a false signal.

Baseline. Do you know your change failure rate for the last three months? If not, you are measuring AI value against nothing, and you will not be able to tell whether what you're seeing is progress or a problem building quietly.

If those answers are fuzzy, that's where to start. Not with a different AI tool, not with a different rollout strategy, but with the system that the AI will be amplifying.

The teams that got to 40% faster delivery and a 70% reduction in manual QA didn't start with better tools. They started with an honest answer to those questions. Then the tools did what they were supposed to do.

I help engineering teams close the gap between "we use AI tools" and "AI actually changed how we deliver." Book a 20-minute call and I'll tell you where the leverage is.