Skip to main content

Command Palette

Search for a command to run...

Agentic TDD Non-Negotiable

For Agents, TDD Isn't Optional

Updated
7 min read
Agentic TDD Non-Negotiable

I spent a weekend with Kent Beck's Test-Driven Development. The surprising part wasn't that it held up. It was which half held up.

Beck's argument for TDD is mostly a design argument. Write the test first, and the test pulls your design toward something small, decoupled, and honest. The verification, "this code is correct", comes along for the ride. That's why TDD is a genuine debate among humans. Plenty of excellent engineers skip it, because the design benefit is contestable and they get that discipline some other way.

For agents, the whole thing inverts. The design benefit fades into the background and the verification benefit becomes the entire point.

Here's why. You can't extend the same level of trust you'd give a senior engineer to an agent. You can't read every line it writes and nod. What you need is a mechanical gate, something that answers "did it get this right?" without a human in the loop. A test is exactly that: an executable, machine-checkable specification.
It converts an unanswerable question ("is this code correct?") into a checkable one ("does it pass the spec?"). The suite becomes the agent's oracle.

So the failing test stops being a design nudge and becomes the input. "Make this test pass" is a far less ambiguous instruction than any prose description of what you want. Test-first turns out to be better implicit prompting for coding agents.

The part of Beck that quietly keeps an agent honest

One rule from the book matters more for agents than it ever did for humans: the test must fail before it passes.

For a person, that's a sanity check. For an agent, it's load-bearing. An agent will happily write a test that's green from birth, tautological, asserting nothing, technically passing. Forcing the test red first is your cheap proof that it has teeth. Skip that step and you get high coverage that verifies nothing, which is worse than no coverage because it looks like safety.

Where the unit test runs out of road

A unit test checks your business logic assuming its collaborators behave the way you think they do. Those assumptions live in your mocks. Which means if an assumption is wrong, the mock is wrong in the same direction, the test stays green, and the system is broken anyway. The mock can't catch a bug in the integration between two components, because the integration is the exact thing the mock replaced.

This is fine when your system is rich in business logic and thin on integration, a domain core, lots of real decisions, few external calls. Unit tests there are cheap and catch the bugs that matter.

It falls apart when the system is mostly glue. Typical CRUD and microservices work is thin on logic and thick on interfacing: this service calls that one, writes to a database, reads off a queue. The real risk lives in those integrations, and that's precisely the region a mock-based unit suite is blind to.

A fully green suite buys you confidence in units, not in the system.

For agents this gap is worse, because the agent doesn't carry your mental model of how the pieces fit. The integration test is what supplies that model. And these tests are much harder for agents to game: a mock test can be satisfied by lining the mock up around the wrong behaviour, but "POST this, then GET that, expect this real result through a real database" can't be faked into green. The agent has to make the system actually work.

So for agentic development you have to move up the testing pyramid, and treat integration and end-to-end tests as first-class citizens, not as an afterthought above the "real" unit tests.

The tension: confidence is slow

Here's the catch I keep hitting. Integration and e2e tests buy the confidence you actually need, but they're slower than a unit test. And the original reason the pyramid is shaped the way it is, lots of fast unit tests, few slow e2e ones, was never about virtue. It was about time. Slow verification loses. Fast, easy-to-verify wins.

But notice where this loop lives for an agent. It isn't the org-wide CI pipeline, that's a separate, downstream gate. It's the local verification loop, the agent making a change and checking its own output before it moves on. And that changes which problems are real and which aren't.

If you give the agent the real components spun up in a containerized environment on the dev machine, the actual database, the actual queue, the actual dependent service, not a guess about them, the throttling problem mostly disappears. There's no shared CI queue to contend for, no flaky network to a remote staging environment, no waiting in line behind other people's builds. The only cost left is wall-clock time: how long the agent waits to verify a change, and how long until you can ship.

So the goal isn't maximum coverage. It's maximum behaviour coverage per unit of local loop time. Finding that sweet spot is the real engineering problem. Here's the strategy I've landed on.

  1. Stop climbing the pyramid; tune a portfolio. Match the test type to where your risk lives, not to a diagram. Glue-heavy service? Integration carries most of your confidence. Rich domain core? Unit tests stay cheap and do the heavy lifting. Each test type buys a different kind of confidence at a different cost.

  2. Use real components, locally and containerized. This is the foundation that makes the rest honest. Give the agent the actual dependencies, testcontainers, ephemeral databases, localstack, not mocks of them, so its integration tests verify the seams instead of your assumptions about the seams. Because it's all local and isolated, you're not fighting contention or remote flakiness. The agent gets high-fidelity feedback at the speed of its own machine.

  3. Instruct not to run the whole suite every time, run what the change touches. This is the main lever on loop time. When the agent edits a handful of files, it should verify the tests that actually cover those files, not the entire behavior suite on every iteration. Running everything on every micro-edit is the thing that makes a behavior-heavy suite feel unaffordable, running only the affected slice is what makes it affordable.

  4. Make it deterministic, this matters more for agents than for people. A flaky failure sends a human back to rerun the test. It sends an agent down a false trail, and the lesson the agent tends to draw is "make the test pass by weakening it."

Keeping the agent inside the loop without losing control

Two guardrails make all of this safe.

  1. Own the spec at the boundary. The acceptance, integration, and e2e tests that express what you actually want should be human-authored and effectively read-only to the agent, any change flagged for your review. Below that line, let the agent write and rewrite its own unit tests freely as scaffolding. This is double-loop TDD, your failing acceptance test is the outer loop, the agent's own micro red-green-refactor is the inner one.

    1. Initially I will go as deep as reviewing the unit tests as well, till you have confidence and enough context and what can fall into unit tests for your use-case, and include it in the agent instructions.
  2. Second, invert your review priority.

With humans, you scrutinize the code and skim the tests. With agents, scrutinize the tests.

The bottom line

Beck handed us the inner loop: the design discipline, the unit cycle, and the small insistence that a test fail before it passes, which turns out to be the exact thing that keeps an agent honest. What the book leaves out are the outer loops, and those are precisely what agentic development needs most. integration and contract tests for confidence in the system, and the acceptance-test-as-spec that lets you drop an agent into the cycle without ceding control of what "correct" means.

For humans, TDD was a way to design well. For agents, it's the gate that makes the work trustworthy at all.