Why TDD and agents are a natural pair
In Module 2 you saw that Claude verifies its own work by running your tests. Test-driven development takes that to its logical conclusion: if a passing test is what tells Claude it’s done, then writing the test first gives the agent an unambiguous target and an automatic stopping condition.
This flips the usual worry about AI-written code. The fear is “what if it writes something that looks right but isn’t?” The answer is to make right mean passes a test you reviewed — and to write that test before the implementation exists, so it can’t be retrofitted to match a flawed solution.
Specify with a failing test, and the agent has a precise definition of done it cannot argue with. The test is the spec, executable.
The loop, made explicit
Run the classic red–green–refactor cycle, but split the responsibilities so you own the specification and Claude owns the toil.
1. Red — you specify behaviour as a failing test.
> We're adding a loyalty discount: 5% off when a customer has more
than 10 completed orders, never stacking with a promo code. Write
the JUnit 5 tests for DiscountService that capture exactly that —
including the no-stacking rule and the boundary at exactly 10
orders. Do NOT implement the service yet. Run them and show me red.
Review the tests now, while they’re the only artifact. This is the most important review in the whole cycle: the tests are your specification. Are the boundaries right? Is “more than 10” strict? Does a test pin the no-stacking rule? If the spec is wrong here, everything downstream is wrong.
2. Green — Claude implements until the tests pass.
> Now implement DiscountService so those tests pass. Don't change the
tests. Run them and show me green.
Because the tests are frozen, Claude can’t cheat by weakening them. It has to make the real logic correct.
3. Refactor — clean up under the safety net.
> Tests are green. Refactor DiscountService for readability — extract
the eligibility check, name the magic numbers — and keep the suite
green throughout.
The discipline that makes this safe: review the tests before the implementation, and never let the same turn write both a test and the code that makes it pass for the first time. When that separation holds, green means something.
Choosing the right verifier
“Run the tests” is only as good as the tests. Tell Claude which tool fits the situation:
- Plain JUnit 5 + AssertJ for pure logic — calculators, mappers, domain rules. Fast, no mocks, the bulk of your suite. AssertJ’s fluent assertions also give Claude clearer failure messages to work from.
- Mockito for collaborators you must isolate — but set a budget. “Mock the
PaymentGateway; use the realOrderRepositoryagainst an in-memory list.” Over-mocking produces tests that pass while the system is broken. @SpringBootTestfor wiring and configuration — sparingly, because it’s slow.- Testcontainers for anything that touches the database, a queue, or another service. A test against a real Postgres in a container catches what an H2 stand-in cannot: even in PostgreSQL-compatibility mode, H2 diverges on native queries, vendor types (
jsonb, arrays, enums),ON CONFLICT, sequences andRETURNING, and real locking and constraint semantics — exactly the things a JPA mapping or a hand-written@Querygets wrong.
> Write an integration test for OrderRepository using Testcontainers
with a real Postgres image, not H2. Assert that the custom @Query
for findOverdue() returns the right rows across a timezone boundary.
Tip: Make the test commands explicit in
CLAUDE.md—./gradlew testfor the fast unit suite,./gradlew integrationTestfor the slow Testcontainers one. Then you can tell Claude “run the fast suite” during the loop and save the slow one for the end, and it knows what you mean.
Coverage is a map, not a destination
You can point Claude at coverage gaps:
> Run JaCoCo and show me the branches in the pricing package with no
coverage. For each genuinely meaningful gap, add a test. Skip
trivial getters and the generated MapStruct code.
But treat the number as a map of where to look, not a target to maximise. 100% coverage of shallow tests is weaker than 70% of tests that assert real behaviour. The phrase “genuinely meaningful gap” tells the agent to use judgment instead of carpet-bombing the report with assertions-on-everything.
A worked rhythm
For a non-trivial feature, the session looks like this:
- Describe the behaviour; Claude writes the tests; you review the tests carefully.
- Claude implements until green.
- Claude refactors with the suite as a net.
- You add the one edge case Claude missed (there’s usually one), and have it covered.
- Run the full suite, including integration tests, before you call it done.
You spent your attention on the specification and the final review — the parts that need a human — and delegated the typing.
What just happened
You learned to use tests as the agent’s specification and stopping condition: specify in a failing test, review it before any implementation exists, let the loop turn it green, and refactor under the net. Next, you’ll scale this discipline up to changes that span an entire codebase.
Key takeaways
- Tests are the agent’s definition of done — so write and review them first, before the implementation can bias them.
- Never let one turn both write a test and first make it pass; that separation is what makes “green” trustworthy.
- Match the verifier to the task: AssertJ for logic, Mockito sparingly, Testcontainers (not H2) for real persistence.
- Use coverage tools to find meaningful gaps, not as a number to maximise.