StoryProof Docs
Using StoryProof with coding agents
StoryProof works after any AI coding agent — Claude Code, Codex, Cursor, Windsurf, or anything that writes code.
Claude Code — Skill
Drop this file into your project. Type /verify after any code change.
---
name: verify
description: Verify the current change with StoryProof. Use after implementing
features, fixing bugs, or before committing — runs independent check and
prove cycle to find defects, coverage gaps, and lying tests.
---
# Verify with StoryProof
You are verifying that the code change actually does what was asked. StoryProof
is an independent auditor — it derives acceptance criteria from the spec, scans
existing tests, and produces runtime evidence.
## Step 1: Determine the spec
Ask the user what was built if not obvious. The spec should describe **behavior**,
not implementation:
- Good: "Add owner deletion with cascade to pets and visits"
- Bad: "Add deleteOwner method to OwnerController"
## Step 2: Check
```bash
storyproof check --spec "<the spec>"
```
Read the terminal output. It will show:
- **Likely defects** — probable bugs (exit code 2)
- **Needs proof** — behaviors with no test coverage
- **Covered** — behaviors with existing evidence
- **Coverage gaps** — tests that exist but don't prove the right thing
If exit code is 0, all behaviors are covered. You're done.
## Step 3: Prove
If check found gaps or blockers:
```bash
storyproof prove
```
This writes and runs tests for every unresolved behavior. Wait for it to complete.
## Step 4: Fix and re-prove
If prove found failing tests (real bugs):
1. Read the test failure output to understand what's wrong
2. Fix the code (not the test — the test is correct)
3. Run `storyproof prove` again
4. Repeat until exit code 0
## Step 5: Review artifacts
```bash
open .storyproof/runs/*/report.html
```
Show the user the verdict summary. New test files are in
`.storyproof/runs/*/checks/` and copied to the repo's test directory.
## Rules
- NEVER skip prove when check finds gaps — the point is runtime evidence
- NEVER modify StoryProof's test files to make them pass — fix the code instead
- If prove takes too long, use `--focus AC002` to prove one behavior at a time
- If `storyproof` is not installed, tell the user to run:
`curl -fsSL https://storyproof.pages.dev/install.sh | sh`
OpenAI Codex
Add to your AGENTS.md:
## Verification
After implementing changes, verify with StoryProof:
storyproof check --spec "<feature description>"
If exit code is not 0:
storyproof prove
Fix blockers and re-run prove. Do not mark a task complete until exit code 0.
Cursor / Windsurf / Any agent
The pattern is the same regardless of tool:
- Agent writes code
storyproof check --spec "..."— finds gaps and bugsstoryproof prove— writes tests, runs them, records evidence- Fix what broke — the agent or you fixes the code
storyproof proveagain — confirms the fix
StoryProof doesn't know or care which agent wrote the code. It verifies behavior, not authorship.
Install
$ curl -fsSL https://storyproof.pages.dev/install.sh | sh
Or from source:
$ git clone https://github.com/storyproof/storyproof.git
$ cd storyproof/cli && pip install -e .
Requires Claude Code CLI.
Check — find what's broken
$ storyproof check --spec "Add owner deletion with cascade"
StoryProof reads your code, derives acceptance criteria from your spec, scans existing tests, and tells you:
- Likely defects likely_blocker — code analysis found a probable bug
- Needs proof not_covered — no test covers this behavior
- Covered covered — existing test addresses this
- Coverage gaps — test exists but doesn't prove the right thing
~$0.50 per run. ~4 minutes.
Prove — produce evidence
$ storyproof prove
Writes and runs tests for every unresolved behavior. Tests stay in your repo.
- Verified verified — test ran and passed
- Confirmed blocker likely_blocker — test ran and failed (real bug)
- New tests written — files added to your test suite
~$1.00 per run. ~8 minutes.
The loop
After fixing a blocker, run prove again. It re-runs everything to catch regressions.
CLI Reference
check
| Flag | Default | Description |
|---|---|---|
--spec | (required) | Describe the change |
--spec-file | — | Read spec from a file |
--changes | last-commit | What to diff: last-commit, unstaged, or a branch name |
--json | false | Output verdict as JSON |
--force | false | Re-derive even if diff unchanged |
prove
| Flag | Default | Description |
|---|---|---|
--focus | — | Prove a single AC by ID (e.g., AC002) |
Exit codes
| Code | Meaning |
|---|---|
0 | All verified |
1 | Gaps remain |
2 | Blockers found |
Output
Everything lives in .storyproof/ at your repo root.
.storyproof/
├── project.json # detected stack (auto, first run)
├── latest # pointer to most recent run
└── runs/
└── <run_id>/
├── verdict.json # per-AC status
├── report.html # open in browser
├── summary.md # terminal summary
├── checks/ # test files (prove)
└── evidence/ # runtime output (prove)
Open the report:
$ open .storyproof/runs/*/report.html
Verdict statuses
| Status | Meaning |
|---|---|
| verified | Test ran, behavior confirmed |
| covered | Existing test addresses this (check only) |
| likely_blocker | Probable bug found |
| not_covered | No evidence exists |
| unproven | Couldn't run a test |
How StoryProof decides what to test
StoryProof doesn't just check "does a test exist." It asks: does the right kind of test exist for the actual risk?
Every acceptance criterion gets scored on three risk dimensions:
variation_risk — "How many ways can this go wrong?"
Measures edge-case density. A function that takes a string has more ways to fail than one that takes a boolean.
Low: Toggle a feature flag. One input, two states.
High: Parse user-provided dates across timezones, locales, and formats.
When high → StoryProof writes unit tests covering edge cases and boundary values.
boundary_risk — "Does this cross a real seam?"
Measures whether the behavior crosses a system boundary — database, API, service call, file system. Mocking these boundaries hides real failures.
Low: Pure function that transforms data in memory.
High: Service writes to two database tables in a transaction, then calls an external API.
When high → StoryProof writes integration tests that hit the real database or real HTTP endpoint. No mocks.
last_mile_risk — "Does the user actually see the right thing?"
Measures whether the behavior depends on rendering, JavaScript, or browser interaction. Server-side logic can be correct while the UI is broken.
Low: API returns JSON. The response is the product.
High: User clicks "Delete," a confirmation modal appears, they confirm, the page redirects, a success toast shows.
When high → StoryProof bootstraps Playwright and runs browser tests. For server-rendered HTML without JavaScript, integration tests that assert the rendered markup are sufficient.
Why this matters
A common pattern: your repo has 90% test coverage, all green. But the tests are all unit tests that mock the database. The boundary_risk is high, but no test crosses the real boundary. StoryProof catches this — a unit test that mocks the DB doesn't settle boundary risk, regardless of coverage percentage.
CI Integration
Coming soon — GitHub Actions, GitLab CI, and webhook-triggered verification.