Build App with AI Agents: What Three Days Taught Me

Code is no longer the bottleneck to shipping software. Context, architecture, and test infrastructure are. When those are in place, an AI agent can close meaningful features in the time it used to take to write the ticket. I know because I just shipped a production assessment app in three days using AI agents, and the experience changed how I think about software delivery at every scale.

Most of the conversation about AI-assisted development focuses on individual features, on the "I used Copilot to write this function" level. What I want to talk about is what it's like to build app with AI agents at the product level: standing up an entire thing, end to end, in production, with real users. That's a different experience, and it has different lessons. The three-day build I just completed surfaced them quickly.

I Built a Production App with AI Agents in Three Days, and the Speed Is Not the Point

The product is the AI Engineering Maturity Assessment. It's a structured diagnostic tool for engineering leaders: 15 questions across 5 dimensions (context infrastructure, AI adoption practices, testing discipline, delivery velocity, and measurement). A scoring engine calculates a maturity level across each dimension. Users enter their email, receive a PDF of their results via email, and get a structured view of where their team sits and what to address first.

This is not a toy. It has a multi-step questionnaire flow, a scoring engine with dimension-level weighting, email delivery via Resend, a results page with a downloadable PDF, and session state management across the flow. The stack is Next.js, Tailwind, shadcn/ui, and TypeScript throughout. It runs on a real domain, handles real submissions, and delivers real results.

The reason the speed matters is not to impress anyone with a timeline. It's because the speed is evidence of something structural. The constraint in building this was never execution. The execution happened fast. What required time, judgment, and iteration was everything that came before the AI wrote a single line of code. That is the actual lesson, and it applies whether you're building an MVP, a feature for a production system, or a platform for a team.

Before the AI Could Build, I Had to Build the Context

The first thing I did when I started this build was not open a code editor. It was write a CLAUDE.md file. For those unfamiliar: CLAUDE.md is a project-level context file that Claude Code reads at the start of every session. It's the document that tells the AI what the codebase is, how it's structured, what decisions have been made and why, and what the agent should and should not do.

Writing it took roughly half a day. I documented the application architecture: where each concern lived, how the question flow was modelled, the scoring approach, the component conventions, the data types. I described what shadcn/ui components were in use and how they were being extended. I explained the email delivery logic and the PDF generation approach. I noted what was intentionally simple and should stay that way.

Without that document, every feature prompt would have required the agent to infer context it didn't have. It would have made reasonable guesses based on what it had seen in similar codebases, and those guesses would have been wrong often enough to require constant correction. With the document, the agent could navigate the codebase with minimal rework. It knew the conventions because I had stated them. It knew what not to do because I had said so explicitly.

This is a principle that the best senior engineers understand intuitively: the investment in making a system legible pays for itself many times over. With AI agents, that principle is not optional. It's the foundation of everything that follows. Context is infrastructure. Treat it that way.

The Four Decisions That Made the AI Useful

When I look back at what made this build work, four specific decisions stand out. They are not AI tips. They are good engineering decisions that happen to matter significantly more when an agent is doing the execution.

The first was choosing a stack the agent knows well. I chose Next.js, Tailwind, and shadcn/ui deliberately because these are among the most widely used tools in the ecosystem. The agent had strong pattern knowledge of all three. When I asked for a component, I got something that matched the conventions of those libraries without extensive correction. Choosing an obscure or custom stack for a build like this would have produced a different result. The agent's effectiveness is a function of the breadth and quality of its training data on the relevant patterns.

The second was writing CLAUDE.md before touching components. I've already described why. The ordering matters. Every hour spent on context before writing features saved more than an hour of correction and rework later.

The third was starting with data models and types before UI. The scoring engine, the question structure, the result model, I defined all of these in TypeScript first, before asking the agent to build any interface. This gave the agent a structure to reason against. When it built the questionnaire flow, it was building against typed contracts, not inventing structure as it went. Type errors caught problems immediately rather than at review.

The fourth was keeping a test running throughout. I didn't invest in comprehensive test coverage for a three-day build, but I did maintain a basic end-to-end test of the core flow: question progression, scoring, email capture, results render. Every time the agent made a change, the test told me immediately if something had broken. This is not a novel idea. It's the standard advice for any development process. What's different with AI agents is that the feedback loop matters even more, because the agent produces changes quickly and will continue confidently in a broken direction if you're not checking.

What the Agent Got Wrong, and What That Tells You

The agent made mistakes. This is worth being explicit about, because a lot of the public conversation about AI coding tools presents them as either transformative magic or useless noise. The reality is more specific.

The first category of mistake was wrong component patterns early in the build, before I had locked the design system. The agent defaulted to patterns it had seen in similar projects, which weren't always the right patterns for the conventions I was establishing. Once I documented those conventions in CLAUDE.md, this problem went away almost entirely. The lesson: the agent can only follow conventions that are stated. It cannot infer conventions that exist only in your head.

The second category was inconsistent styling before the Tailwind configuration was settled. The agent applied class names that were individually valid but inconsistent with each other, producing an interface that worked but looked unresolved. Again, this resolved as soon as the design system was locked and documented. The fix was not a better prompt. It was better context.

The third category was a bug in the scoring logic that only surfaced with edge-case inputs. A specific combination of question responses produced a scoring result that fell outside the expected range. The bug was subtle, not the kind that shows up in a basic happy-path test. It required me to think through the logic carefully rather than trust that the agent had handled all cases correctly.

Each of these mistakes traced back to under-specified context or under-specified testing. The agent was not failing randomly. It was failing at the edges of what I had explicitly described and tested. That is the actual lesson. The agent is only as good as the scaffolding I gave it. It is a very fast executor within a well-defined space, and an inconsistent guesser outside of it.

The Bottleneck Is No Longer Code. It Is Architecture and Judgment.

The activities that took the most time in this build were not coding. They were: deciding what to build and what to exclude, deciding how to structure the data model, and reviewing what the agent produced to make sure it was right, not just functional.

Scope decisions are expensive regardless of how fast the execution is. The question of whether the results page should show dimension-level scores and an overall score, or just an overall score, is not a coding problem. It's a product problem that requires judgment about what the user needs and what the assessment is trying to communicate. That judgment took time. The code that followed took minutes.

Data modelling is similarly irreducible. How you represent a scored multi-dimensional assessment in a type-safe way has downstream consequences for the rendering logic, the email content, the PDF layout, and any future iteration on the tool. Getting the model right before building against it was the most consequential decision I made. The agent couldn't make that decision for me, and if I had let it, I would have inherited the agent's default assumptions rather than my own considered choices.

Review quality is a function of engineering judgment, not review time. Looking at what the agent produced and knowing whether it was correct required understanding the system well enough to spot errors that weren't syntactic. That's not a task you can automate within the same loop. It requires a human who understands what the system is supposed to do and can recognise when the implementation diverges.

This is the point that matters most for engineering leaders thinking about AI's impact on team structure and productivity. If you are a strong engineer, AI agents multiply your judgment. Your architectural decisions get executed faster. Your product instincts get explored more quickly. Your review time is well spent because you're spending it on things that actually require your expertise. If your judgment is weak, AI agents multiply your mistakes. The constraint has moved up the stack, and the stack now rewards different things than it used to.

Fast Execution Has Changed the Job, Not Eliminated the Need for Engineering Judgment

Before this build, my mental model of software delivery time was anchored to execution capacity. How many engineers, working how many hours, on how well-defined a codebase. AI has disrupted that model in a specific and important way.

The relevant constraints now are: how clear is the specification, how well-documented is the architecture, how strong is the review capacity, and how quickly can the feedback loop between generation and verification run. Those are the levers. Headcount is much less important for execution-heavy work than it used to be.

This has practical implications for how teams should be structured and how technical leaders should think about capability. A small team of engineers with strong architectural judgment, good context infrastructure, and a disciplined review process can now outship a larger team that doesn't have those things. The premium is on judgment, not execution.

It also has implications for what you should invest in before reaching for AI tools. Teams that jump straight to tooling without building context infrastructure and review discipline are going to produce faster at the cost of quality. I've watched this pattern in teams that got the sequencing wrong, and the recovery is expensive. The right order is: context first, test infrastructure second, tooling third. Not the reverse.

If you're thinking about what your team's current maturity looks like across these dimensions, the AI Engineering Maturity Assessment is the tool I built for exactly this. It covers context infrastructure, AI adoption practices, testing discipline, delivery velocity, and measurement. The irony of using it to assess readiness for the workflow I used to build it is not lost on me.

The three-day build was not a demonstration of what AI can do on its own. It was a demonstration of what happens when engineering judgment is applied to giving AI agents exactly what they need to be useful. The tools are fast. The judgment is still the hard part.

I help engineering teams close the gap between "we use AI tools" and "AI actually changed how we deliver." Book a 20-minute call and I'll tell you where the leverage is.