StoryProof Docs
Install
$ curl -fsSL https://storyproof.pages.dev/install.sh | sh
Then add the verification skill to your project and start verifying.
Using with coding agents
StoryProof works after any AI coding agent — Claude Code, Codex, Cursor, Windsurf, or just you at a terminal.
Add the verification skill
Drop this file into your project. Works with any agent that supports skills or instruction files.
---
name: verify
description: Verify code changes with StoryProof. Use when the user asks to verify,
check, prove, or test their changes. Triggers include "verify this", "check my
changes", "prove it works", "did I break anything", or any request to validate
that code does what the spec says.
---
# Verify with StoryProof
You are running an independent verification loop. StoryProof derives acceptance
criteria from the spec, scans existing tests, and produces runtime evidence.
Your job is to drive the loop until the verdict is clean.
## When to run
- After implementing a feature or fix
- Before committing or creating a PR
- When the user says "verify", "check this", "prove it works", or "did I break anything"
## Step 1: Determine the spec and context
Ask the user what was built if not obvious. The spec should describe behavior:
- Good: "Add owner deletion with cascade to pets and visits"
- Bad: "Add deleteOwner method to OwnerController"
Also ask: **is the app running?** If yes, get the URL (e.g., `http://localhost:8080`)
— pass it with `--url` in the prove step. If not running, unit and integration
tests still work — E2E behaviors will be marked unproven.
## Step 2: Check
```bash
storyproof check --spec "<the spec>"
```
Read the terminal output. It shows: likely defects, coverage gaps, and what prove
will do. The report opens automatically in the browser.
- **Exit code 0** → all covered. Done.
- **Exit code 1** → gaps exist. Go to Step 3.
- **Exit code 2** → likely bugs found. Go to Step 3.
## Step 3: Prove
```bash
storyproof prove --url <app-url-if-running>
```
This writes and runs tests — unit, integration, and E2E. Wait for completion.
The report opens automatically.
- **Exit code 0** → all verified with runtime evidence. Done.
- **Exit code 1** → some behaviors couldn't be tested. Tell the user what's
unproven and why.
- **Exit code 2** → tests failed. Real bugs confirmed. Go to Step 4.
## Step 4: Fix and re-prove
Read the failing test output to understand what's wrong.
For each `likely_blocker` in `evidence.json`:
- `ac_id` — which acceptance criterion failed
- `exit_status` — "fail" or "error"
- `output` — the actual error message
- `file` — full path to the test file
- `command` — what was run
Fix the **code**, not the test. If the fix requires an architectural decision,
ask the user.
```bash
storyproof prove --url <app-url-if-running>
```
Repeat until exit code 0.
## Step 5: Report
Tell the user:
- How many behaviors were verified
- Any remaining unproven items and why
- The report opens automatically in the browser after each run
- Evidence is in `.storyproof/runs/*/evidence.json` — per-AC proof with test
file paths, exit codes, and output
## When to stop
- **Exit code 0 from prove** → stop, verification complete
- **3 consecutive prove failures on the same AC** → stop, tell the user the
fix isn't working and suggest manual investigation
- **Environment blocker** (can't install dependencies, app won't start) → stop,
tell the user what's blocking and mark ACs as unproven
## Rules
- NEVER skip prove when check finds gaps
- NEVER modify StoryProof's test files to make them pass — fix the code
- NEVER run check without a spec — ask the user
- Use `--focus AC002` if prove is slow — verify one behavior at a time
- If `storyproof` is not installed:
`curl -fsSL https://storyproof.pages.dev/install.sh | sh`
The pattern
The pattern is the same regardless of tool:
- Agent writes code
storyproof check --spec "..."— finds gaps and bugsstoryproof prove— writes tests, runs them, opens report- Fix what broke — the agent or you fixes the code
storyproof proveagain — confirms the fix
Check — find what's broken
$ storyproof check --spec "Add owner deletion with cascade"
Reads your code and diff, derives acceptance criteria, scans existing tests:
- Likely defects likely_blocker — code analysis found a probable bug
- Needs proof not_covered — no test covers this behavior
- Covered covered — existing test addresses this
- Coverage gaps — test exists but doesn't prove the right thing
~$0.50 per run. ~4 minutes.
check flags
| Flag | Default | Description |
|---|---|---|
--spec | — | Describe the change (required unless --spec-file given) |
--spec-file | — | Read spec from a file |
--dir | . | Project directory to verify |
--base | auto-detect | Diff base: branch name (main), commit ref (HEAD~3), or omit for auto |
--model | claude-sonnet-4-6 | LLM model override |
--changes | last-commit | What to diff: last-commit, unstaged, or a branch name |
--json | false | Output verdict as JSON to stdout |
--force | false | Re-derive criteria even if diff hasn't changed |
check exit codes
| Code | Meaning |
|---|---|
0 | All behaviors covered by existing tests |
1 | Coverage gaps — some behaviors have no evidence |
2 | Likely blockers — probable bugs detected in the code |
Prove — produce evidence
$ storyproof prove
Writes and runs tests for every unresolved behavior. Tests stay in your repo.
- Verified verified — test ran and passed
- Confirmed blocker likely_blocker — test ran and failed (real bug)
- Unproven unproven — couldn't run a test (environment issue)
~$1.00 per run. ~8 minutes.
prove flags
| Flag | Default | Description |
|---|---|---|
--spec | — | Task spec (auto-resolved from prior check if omitted) |
--spec-file | — | Read spec from a file |
--dir | . | Project directory |
--base | auto-detect | Diff base |
--model | claude-sonnet-4-6 | LLM model override |
--url | — | Running app URL for E2E tests (e.g., http://localhost:8080) |
--focus | — | Prove a single AC by ID (e.g., AC002) |
--no-install | false | Disable all dependency installs (mark E2E as unproven instead) |
prove exit codes
| Code | Meaning |
|---|---|
0 | All behaviors verified with runtime evidence |
1 | Some behaviors unproven (environment issue, dependency missing) |
2 | Tests ran and failed — confirmed bugs in the code |
Output
Everything lives in .storyproof/ at your repo root.
.storyproof/
├── project.json # detected stack (auto, first run)
├── latest # pointer to most recent run
└── runs/
└── <run_id>/
├── verdict.json # per-AC status
├── report.html # open in browser
├── summary.md # terminal summary
├── checks/ # test files (prove)
└── evidence/ # runtime output (prove)
Verdict statuses
| Status | Meaning |
|---|---|
| verified | Test ran, behavior confirmed |
| covered | Existing test addresses this (check only) |
| likely_blocker | Probable bug found |
| not_covered | No evidence exists |
| unproven | Couldn't run a test |
How StoryProof decides what to test
StoryProof doesn't just check "does a test exist." It asks: does the right kind of test exist for the actual risk?
Every acceptance criterion gets scored on three risk dimensions:
variation_risk — "How many ways can this go wrong?"
Measures edge-case density. A function that takes a string has more ways to fail than one that takes a boolean.
Low: Toggle a feature flag. One input, two states.
High: Parse user-provided dates across timezones, locales, and formats.
When high → StoryProof writes unit tests covering edge cases and boundary values.
boundary_risk — "Does this cross a real seam?"
Measures whether the behavior crosses a system boundary — database, API, service call, file system. Mocking these boundaries hides real failures.
Low: Pure function that transforms data in memory.
High: Service writes to two database tables in a transaction, then calls an external API.
When high → StoryProof writes integration tests that hit the real database or real HTTP endpoint. No mocks.
last_mile_risk — "Does the user actually see the right thing?"
Measures whether the behavior depends on rendering, JavaScript, or browser interaction. Server-side logic can be correct while the UI is broken.
Low: API returns JSON. The response is the product.
High: User clicks "Delete," a confirmation modal appears, they confirm, the page redirects, a success toast shows.
When high → StoryProof bootstraps Playwright and runs browser tests. For server-rendered HTML without JavaScript, integration tests that assert the rendered markup are sufficient.
Why this matters
A common pattern: your repo has 90% test coverage, all green. But the tests are all unit tests that mock the database. The boundary_risk is high, but no test crosses the real boundary. StoryProof catches this — a unit test that mocks the DB doesn't settle boundary risk, regardless of coverage percentage.
CI Integration
Coming soon — GitHub Actions, GitLab CI, and webhook-triggered verification.