criteria.json ├── AC001: PASS ✓ ├── AC002: FAIL ✗ └── AC003: UNPROVEN

plan.json ├── layer: unit_test ├── risk: high └── approach: wiring_test

summary.md ├── proven: 12/16 ├── defects: 2 └── verdict: DO NOT SHIP

evidence/ ├── AC001_unit.log ✓ ├── AC005_e2e.log ✓ └── AC009_wiring.log ✗

intent.md ├── behavior: webhook ├── risk: boundary └── dim: variation

StoryProof

Your coding agent builds.
Ours verifies.

Derives what should be true from your spec. Tests what actually is. Shows you the gap.

$ curl -fsSL https://storyproof.pages.dev/install.sh | sh

Works with Claude Code, Codex, Cursor, Windsurf — or just you ;)

“The builder and the inspector should never share the same blind spots.”

— Eran Kahana, Stanford Law School CodeX

What slips through

Spec drift

Your agent built something close to what was asked. But “close” ships silent failures. The subscription status updates but the end date doesn’t. Not a crash. Not a test failure. Just... not right.

87%

of defects get zero valid tests from AI agents. The edge cases your agent never thought of don’t have tests because they’re not in the code.

ASE 2024

100% / 4%

Test coverage vs faults caught. AI-generated suites execute every line but catch 4% of possible bugs. Green CI, zero confidence.

Wang et al.

“We are replacing validation with transcription.”

— David Adamo Jr.

Why your coding agent shouldn’t verify its own work

Coding agents today

StoryProof

General-purpose — writes code, docs, tests, everything

Purpose-built verification agent with a dedicated testing methodology

Derives tests from the code it just wrote — “replacing validation with transcription”

Derives acceptance criteria from the spec, not the code. Different starting point, different blind spots.

Same agent writes code and tests — “builder and inspector share the same blind spots”

Independent agent that never saw the implementation. Catches what the builder assumed was obvious.

Picks test layers by habit — “mocked every little thing, even the code it should test”

Scores three risk dimensions per behavior. Unit for edges, integration for boundaries, E2E for UI. Picks the leanest proof at the right layer.

Generates tests that pass — 100% coverage, 4% of faults caught

Reads assertions, not test names. If the assertion doesn’t prove the behavior, it’s flagged as a gap.

87% of defects get zero valid tests — edge cases the agent never considered don’t get tested

Derives 16 behaviors from a one-line spec — including the 13 nobody asked for. Edge cases, error paths, boundary conditions.

Coding agents are extraordinary at building. Verification needs a different kind of thinking.

“The single biggest differentiator between agentic engineering and vibe coding is testing.”

— Addy Osmani, Google Chrome

Evidence

Evidence, not opinions

Coding agents say “looks correct.” StoryProof produces proof you can inspect, replay, and ship to CI.

Right layer for the risk

Unit test for edge cases and input validation. Integration test with real database for boundary crossings. Playwright browser test for JavaScript-dependent UI.

No mocks where the risk is real. A test that mocks the database doesn’t prove the database works.

Runtime proof, not static analysis

Every verdict is backed by a test that ran. Failed = confirmed bug. Passed = proven behavior. Not “the code looks right to me.”

Static inspection isn’t evidence. StoryProof runs the test and records the output.

Evidence that stays in your CI

Tests land in your repo, run on every future PR. Verification carries from your local machine to CI pipeline automatically.

Your agent’s confidence dies with the session. StoryProof’s evidence survives in your CI — the bug can never come back.

One spec. 14 derived behaviors. The right test for each.

Step 1

Extract behaviors

One sentence: “Add Stripe webhook for subscription renewals.” StoryProof derives 14 acceptance criteria — happy path, signature verification, idempotency, edge cases. Your agent thought of 3.

Step 2

Score the risk

Each behavior gets three risk scores: edge-case density, boundary crossings, user-visible rendering. These scores determine which kind of test can settle the question.

Step 3

Pick the leanest proof

Unit test for input validation. Integration test with real Stripe SDK for webhook signatures. Browser test for the checkout-to-subscription flow. No over-testing, no under-testing.

Real verification — full cycle

Spec

“Add Stripe webhook handler for subscription renewals”

What your agent built

WebhookController, StripeService, SubscriptionRepository. 2 unit tests — both pass. CI green.

storyproof check

14 acceptance criteria derived:

✓ Webhook receives valid event (covered)
✗ Invalid signature → 401 (your agent never tested this)
✗ Subscription end_date actually extends (code updates status, not date — spec drift)
? Customer downgrades mid-renewal (edge case)
? Duplicate webhook / idempotency (edge case)
… 9 more behaviors

storyproof prove

8 unit tests — edge cases, validation, error paths
1 integration test — real webhook POST with Stripe signature verification
1 e2e test — checkout → payment → subscription active in UI

CAUGHT: end_date never updated. Status changes, date doesn’t. Renewals would expire silently.

fix → prove → ship

10 new tests stay in your repo. Running in CI on every future PR. The webhook bug can never come back.

This is what you see.

Real terminal output from storyproof check on a Stripe webhook PR.

StoryProof Check — DO NOT SHIP

## Likely defects — code analysis suggests these will break

[AC004] Webhook handler updates subscription.status but not subscription.end_date

Expected: end_date extends by billing period on successful renewal

Prove will: Run integration test against Stripe webhook endpoint

[AC007] No signature verification — accepts any POST to /webhooks/stripe

Expected: Returns 401 for invalid Stripe-Signature header

## Needs proof — no evidence covers these behaviors

[AC009] Duplicate webhook handling — no idempotency key check

[AC011] Customer downgrades mid-renewal — old plan webhook fires

## Already covered by existing evidence

8 behaviors verified

Next: storyproof prove — settle all 6 unproven behaviors

$ storyproof check --spec "your change" → finds the gaps

$ storyproof prove → fills them with evidence

check → prove → fix → prove → ship