You shipped 15 features in a week with Claude Code. Three of them are silently broken, and you won’t find out until you click through the app by hand. Or until a user does.
This isn’t a feeling. CodeRabbit’s analysis of 470 GitHub PRs found AI-co-authored code has 1.7x more major issues than human-written code, with logic errors 75% more common. Tenzai tested five AI coding tools by building 15 identical web apps. Zero out of 15 implemented CSRF protection. Zero set security headers. All five tools introduced SSRF vulnerabilities. The bugs that slip through aren’t syntactic. They’re behavioral. AI produces code that runs, passes linting, and looks correct in a diff. The failures show up when a real user clicks through the actual flow.
I’ve been building North, a GTD task management app in Rust with Leptos, Axum, and Diesel. A stack I don’t know, built almost entirely through vibe-coding. It moved fast. Around week three, it stopped being fun. Features I’d verified days ago were breaking under new changes, and I only found out by accident.
The core issue: vibe-coded apps don’t have a specification. If you haven’t written down what “working” means, it changes every commit.
The Problem: Silent Breakage at Scale
AI-generated PRs are big. They touch many files. Side effects are hard to spot even if you review the diff carefully. GitClear’s analysis of 211 million changed lines found that code reverted or updated within two weeks doubled from 2020 to 2024. Refactored lines dropped from 25% to under 10%. AI-assisted workflows produce more code, but that code churns faster and gets restructured less.
Here’s what happened in North: I had keyboard navigation working. Arrow keys moved between tasks, Shift+Arrow reordered them, Enter opened the editor. Then a refactor to the reactive signal system introduced a disposed-signal panic. The UI looked fine. No visible errors. But the browser console was screaming RuntimeError: unreachable from deep inside the WASM binary every time you navigated between tasks.
I fixed it. Two PRs later, a different change reintroduced the same class of bug. Different signal, same panic pattern.
There’s a perception problem too. METR ran a controlled trial with 16 experienced developers: those using AI tools took 19% longer to complete tasks, but believed AI had sped them up by 24%. If you can’t accurately gauge your own velocity, you definitely can’t gauge whether your features still work. That false confidence is exactly why testing infrastructure feels optional until something breaks in production.
Without automated checks, I was doing the same manual testing loop every time. And missing things. The app had grown past the point where I could hold all the expected behaviors in my head.
Step 1: Write a Regression Document
Before writing any test code, I wrote down what the app should do. Every feature, every interaction, as a checklist.
North’s docs/regress.md has 14 sections covering every page and feature:
## 1. Auth
- [x] Navigate to `/login` — login form renders
- [x] Submit with wrong credentials — error message shown, no redirect
- [x] Submit with correct credentials — redirect to Inbox
- [x] Reload page — session persists (stays logged in)
- [x] Click Logout — redirect to `/login`, session cleared
## 3. Inbox
- [x] Page loads with task list
- [x] Type in inline input, press Enter — task created, appears in list
- [x] Click checkbox — task marked complete
- [x] Press Delete on selected task — confirmation, then deleted
Each section maps to a test file. The [x] marks mean that behavior is covered by an e2e test.
This document pulls triple duty. It’s a spec (what should the app do?), a test plan (what do we automate?), and an onboarding doc (how does the app work?). Writing it forced me to confront gaps in my own understanding of the app’s behavior.
Step 2: Containerize Your Test Runner
I wanted a test setup that’s isolated from the dev environment and works the same way for me and for CI. Playwright runs in its own Docker container alongside a test database and the app.
The docker-compose.test.yml:
services:
db:
image: postgres:17-alpine
environment:
POSTGRES_DB: north_test
POSTGRES_USER: north
POSTGRES_PASSWORD: north
tmpfs:
- /var/lib/postgresql/data
app:
build:
context: .
dockerfile: docker/dev/Dockerfile
environment:
DATABASE_URL: postgres://north:north@db:5432/north_test
depends_on:
db: { condition: service_healthy }
command: >
bash -c "diesel migration run &&
cargo run --bin north-server --features ssr -- --seed &&
cargo leptos watch"
playwright:
image: mcr.microsoft.com/playwright:v1.50.0-noble
working_dir: /e2e
volumes:
- ./e2e:/e2e
environment:
BASE_URL: http://app:5000
depends_on:
- app
The database uses tmpfs for speed. No data persists between runs.
The justfile wraps the docker-compose commands:
compose-test := "-p north-test -f docker-compose.test.yml"
# Launch Playwright UI mode (for humans)
playwright *args='test --ui-port=8080 --ui-host=0.0.0.0':
docker compose {{ compose-test }} run --rm --service-ports playwright \
npx playwright {{ args }}
# Run tests in already-running containers (for Claude Code)
playwright-exec *args='':
docker compose {{ compose-test }} exec playwright \
npx playwright test {{ args }}
The playwright-exec command is the important one. It runs tests inside already-running containers using docker compose exec, which means Claude Code can run the full e2e suite without spinning up and tearing down containers every time. Fast feedback loop.
Step 3: Cover the Regress Doc with E2E Tests
Each section of the regression document maps to a spec file. Here’s a test from inbox.spec.ts:
import { test } from "../fixtures/auth";
import { ApiHelper } from "../fixtures/api";
let api: ApiHelper;
test.describe("Inbox", () => {
test.beforeEach(async ({ authenticatedPage }) => {
api = new ApiHelper(authenticatedPage.context());
await api.deleteAllTasks();
});
test("creates task via inline input", async ({
authenticatedPage: page,
}) => {
await page.goto("/inbox");
await page
.locator('[data-testid="empty-task-list"]')
.waitFor({ state: "visible" });
await page.locator('[data-testid="inbox-add-task"]').click();
const input = page.locator('[data-testid="inline-create-input"]');
await input.fill("My new task");
await input.press("Enter");
const rows = page.locator('[data-testid="task-row"]');
await expect(rows).toHaveCount(1);
await expect(rows.first()).toContainText("My new task");
});
});
A few conventions worth noting:
authenticatedPagefixture handles login. Every test starts with a logged-in session.ApiHelpersets up and tears down test data through the REST API, not through UI clicks. Tests only use the UI for the behavior they’re actually testing.data-testidselectors everywhere. CSS classes change with styling. Test IDs don’t.
North currently has 18 spec files with ~2,950 lines of test code covering auth, task CRUD, keyboard navigation, drag and drop, filters, settings, and more.
Step 4: Catch What You Can’t See
The bug that motivated this entire setup was invisible. WASM panics in Leptos don’t crash the page. The UI keeps rendering. But the browser console fills up with RuntimeError: unreachable and stack traces pointing to __rust_abort. You’d never catch this by looking at the screen.
The fix: a custom Playwright fixture that listens for console errors and fails the test if any appear.
export const test = base.extend<AuthFixtures>({
page: async ({ page }, use) => {
const errors: string[] = [];
page.on("pageerror", (err) => errors.push(err.message));
page.on("console", (msg) => {
if (msg.type() === "error" && msg.text().includes("panicked at")) {
errors.push(msg.text());
}
});
await use(page);
expect(errors, "Browser console errors detected").toEqual([]);
},
authenticatedPage: async ({ page }, use) => {
await loginViaUI(page);
await use(page);
},
});
Every test in the suite now automatically fails if the app panics, even if the assertion on the UI passes. This caught three categories of bugs in North:
- Disposed-signal panics from accessing reactive signals after their owner was cleaned up
- Stale callback panics from event handlers referencing elements that were re-rendered
- Context access panics from calling
use_context()inside event handlers instead of capturing at component creation
These bugs would have shipped. They didn’t show up in the UI. They only showed up in the console, and only if you happened to be looking.
Step 5: Teach Your AI to Use the Tests
This is the step that closes the loop. Add the test commands to your CLAUDE.md so the AI knows they exist:
E2E tests (Playwright, run from host, not inside app container):
just playwright # Launch Playwright UI mode
just playwright-exec # Run tests in already-running containers
just playwright-down # Tear down test containers
Then create a skill or instruction file with your test conventions: which fixtures to use, how to write selectors, how to set up test data. In North, this lives in .claude/skills/north-e2e/SKILL.md. The TestDino Playwright Skill is a good reference for structuring this, with 70+ guides covering common Playwright patterns. Without a skill, Claude Code produces generic, fragile tests. With one, it follows your project’s conventions from the first attempt.
The key workflow change: test-first, not test-after. Shipyard found that prompting Claude Code to write failing tests before the feature produces dramatically better results than bolting tests on afterward. When I find a bug in North, the process is: write a failing test, then fix the bug, then verify the test passes. Claude Code gets this right when the instruction is explicit.
The AI isn’t just writing application code anymore. It’s maintaining the behavioral specification alongside it.
What This Actually Cost
The Docker Compose setup took about two hours with Claude Code. The regression document took 30 minutes: I wrote one example section, had Claude Code review the codebase and generate the rest, then reviewed and tweaked it. The e2e tests took a few evenings of iterating over the regress doc, fixing actual regressions that the tests surfaced, and expanding coverage. Then I spent time optimizing the GitHub Actions workflow: parallelizing Build and Check steps, adding caches, cutting CI from 20 minutes to under 8.
What it bought: every subsequent PR gets validated against the full behavior spec in a pipeline that runs fast enough to not slow you down. Bugs that used to reappear stopped reappearing. I spend less time clicking through the app and more time building features.
Other teams are converging on the same pattern. OpenObserve built eight specialized Claude Code agents for their QA pipeline, growing from 380 to 700+ tests and catching a production bug in the process. Jampa.dev uses the Claude Code SDK to select which e2e tests to run per PR, cutting CI time by 84%. You don’t need that level of sophistication to start. A regression doc, a Playwright container, and a skill file will get you most of the value.
You don’t need 100% coverage. You need coverage on the things that break. Start with the regression doc, automate the critical paths, and expand from there. The spec is the foundation. The tests are just the enforcement.
North is open source if you want to see the full setup.