AI Generated Code Broke Your Code Review Process. Here Is How to Fix It.
AI tools increased PR volume by 98% on some teams. Code review processes designed for human-paced output cannot handle that. The bottleneck is now review, not production, and the fix is not reviewing faster.
Code review was designed for a world where writing code was the bottleneck. A developer would spend hours or days writing a feature, submit a PR, and a reviewer would spend thirty to sixty minutes reading through it carefully. The ratio of write time to review time was roughly right. Review kept pace with production.
AI tools changed the ratio. A developer using Claude Code can now produce in two hours what previously took two days. The review burden on the other side of the PR did not change. The number of PRs being submitted did.
Faros AI's analysis of real engineering organisations found that teams using AI tools merged 98% more PRs per engineer. Code review time per PR increased 91%. The combined effect: code review has become the primary bottleneck in engineering delivery, in organisations where it was previously an afterthought.
Most teams have not adapted their review process to this reality. They are using a process designed for one production rate to review code produced at a completely different rate. The result is the degraded review quality you can observe in the data: change failure rates up 30%, incidents per PR up 23.5%, rework rates climbing. The volume increased. The quality gates did not scale with it.
The Review Process Was Already Fragile
It is worth being honest about where code review was before AI made things worse.
In most engineering organisations, code review was a cultural norm enforced primarily by social obligation. Engineers reviewed PRs because it was expected, not because the process was well-designed. The review quality was uneven: some reviewers caught architectural problems and pushed back substantively; others left a few comment and approved. The consistency was low. The investment in making review better was minimal.
AI-generated code has exposed this fragility. The volume increase made the existing weaknesses visible in the data: if your review process was marginal before, the addition of 98% more PRs does not make it better. It accelerates the failure.
The AI code review problem is partly a new problem: reviewing AI-generated code at AI-generated volume is something no engineering team has done before. It is also partly an old problem: code review was already underdeveloped, and the old problem is now impossible to ignore.
Why Reviewing More Is Not the Answer
The instinctive response to a volume problem is to increase throughput: ask engineers to do more reviews, reduce the expected review time per PR, or batch reviews into dedicated time blocks. These approaches consistently fail for a specific reason.
Code review is not a throughput problem. It is a cognitive load problem. A reviewer who reads fifty lines of code carefully will catch more problems than a reviewer who reads five hundred lines of code quickly. The review quality per line falls faster than the review quantity rises. Adding review time does not solve the problem if the time is spread across more code at lower quality.
The organisations that have navigated this well have not solved it by asking reviewers to do more. They have solved it by changing what reviewers are asked to review and how.
What a Review Process for AI-Generated Volume Actually Looks Like
The review processes that work at AI-generated volume share three characteristics that distinguish them from the traditional model.
Risk-tiered review. Not all code changes carry equal risk, and not all code changes warrant equal review investment. A change to a utility function in a well-tested module with clear ownership carries different risk than a change to authentication logic or a database migration. A risk-tiered review process routes changes to the appropriate level of scrutiny based on actual risk, not organisational habit.
In practice, this means defining categories: changes that can be merged with lightweight review, changes that require senior engineer review, changes that require security review or architecture sign-off. AI-generated code that passes automated checks and touches low-risk modules can move through the first category. Changes touching critical infrastructure move through the higher categories regardless of whether a human or AI generated them.
Automated first-pass review. The mechanical problems in code review, linting violations, missing tests, style inconsistencies, obvious security antipatterns, do not need a human reviewer. Automated tooling handles them better, faster, and more consistently. A review process that requires a human to look at every PR before any automated checks run is wasting human attention on things machines can catch.
The pattern that works is: automated checks run first, they catch the mechanical issues, humans review what remains. This is not new. Most teams have CI pipelines. The gap is that CI pipelines are often not integrated into the review workflow in a way that routes PRs based on their results. A PR that passes all automated checks with no warnings should enter the review queue differently than one with failing tests and lint violations.
Boundary-level review over line-level review. This is the deepest change, and the most counterintuitive.
Line-by-line review of AI-generated code is inefficient. A reviewer who reads every line of a two-hundred-line Claude-generated feature implementation will spend significant time and may still miss architectural problems because they are focused on local correctness rather than system behaviour.
Boundary-level review asks a different set of questions: Does this change behave correctly when called from outside? Does it handle the error cases? Does it integrate correctly with adjacent systems? Does it introduce dependencies or state that conflicts with the rest of the system? These questions are answered by reading the tests, the interface contracts, and the integration points, not by reading the implementation line by line.
This reframe is not a way to do less review. It is a way to do better review. The failure modes in AI-generated code are predominantly architectural and integration failures, not line-level syntax errors. Reviewing at the right level catches the problems that actually matter.
Using AI to Review AI
The question that comes up in most teams: can we use AI tools to review AI-generated code?
The short answer is: partly, for specific things, with clear understanding of what AI review cannot catch.
AI review tools are genuinely useful for the mechanical layer: security antipatterns, performance antipatterns, style consistency, test coverage gaps. This is the same category that automated tooling handles well. AI review is an extension of automated checking, not a replacement for human review.
What AI review cannot catch reliably is architectural intent: whether this change is the right approach to the problem, whether it fits the system it is going into, whether the abstractions are appropriate, whether the design decision embedded in the code will hold up over time. These require a reviewer who understands the system and can evaluate the change in context. That is a human capability that AI tools augment but do not replace.
The circularly obvious risk is that using AI to review AI-generated code at scale, without human oversight, produces a system that approves its own mistakes. The boundary-level human review is not optional. It is what catches the architectural errors that neither the generating model nor the reviewing model will catch consistently.
What Ownership Clarity Has to Do with Review Quality
A specific problem in review that AI adoption has made worse: when nobody clearly owns the module a change is going into, the review is lower quality.
The reviewer who owns a module brings context that a reviewer who does not own it cannot have: knowledge of the decisions that shaped the current implementation, awareness of the edge cases that have caused problems before, understanding of how this module interacts with the rest of the system at a level of detail that is not visible from reading the code. That context is the difference between review that catches real problems and review that looks for syntax errors.
AI-generated code going into a module with clear ownership gets better review, because the owner can evaluate whether the generated approach is right for that specific system. AI-generated code going into a module with unclear ownership gets shallow review, because nobody is positioned to evaluate it at depth.
Ownership clarity is not a soft culture issue. In an AI-native engineering environment, it is a hard quality requirement. Teams that have mapped module ownership explicitly, and built that ownership into the review assignment process, consistently get better outcomes from AI adoption than teams that leave ownership implicit.
The Three Things to Do Before the Volume Overwhelms You
If your team is already at the stage where PR volume has outpaced review capacity, three changes will have the most immediate impact.
Add risk categorisation to your PR template. This does not require building a new system. A simple checklist in your PR template that asks the author to identify whether the change touches authentication, payments, data migrations, or public-facing APIs, and routes those changes to senior review automatically, improves routing without process overhead. It takes thirty minutes to implement and immediately improves review quality for the highest-risk changes.
Turn off the habit of reviewing in the order PRs arrive. First-in-first-out review is a fairness norm, not a quality norm. A system that prioritises high-risk PRs for high-quality review and routes low-risk PRs to lighter-touch review produces better outcomes than one that treats all PRs the same. The change is cultural: review leads decide on priority, not the PR queue.
Treat tests as the primary review artifact for AI-generated implementations. When reviewing a significant AI-generated implementation, read the tests before reading the implementation. The tests describe the intended behaviour. If the tests are wrong or missing, the implementation is untrustworthy regardless of how clean it looks. If the tests are comprehensive and correct, the implementation can be read at a higher level of abstraction. This single change in review habit reduces the time required to review AI-generated code without reducing quality.
None of these changes require a new tool. They require understanding that the code review process was designed for a different production rate and making explicit decisions about how to adapt it.
I help engineering teams close the gap between "we use AI tools" and "AI actually changed how we deliver." Book a 20-minute call and I'll tell you where the leverage is.
Working on something similar?
I work with founders and engineering leaders who want to close the gap between what their technology can do and what it's actually delivering.