StoryProof Docs

Install

$ curl -fsSL https://storyproof.pages.dev/install.sh | sh

Then add the verification skill to your project and start verifying.

Using with coding agents

StoryProof works after any AI coding agent — Claude Code, Codex, Cursor, Windsurf, or just you at a terminal.

Add the verification skill

Drop this file into your project. Works with any agent that supports skills or instruction files.

.claude/skills/verify.md

---
name: verify
description: Verify code changes with StoryProof. Use when the user asks to verify,
  check, prove, or test their changes. Triggers include "verify this", "check my
  changes", "prove it works", "did I break anything", or any request to validate
  that code does what the spec says.
---

# Verify with StoryProof

You are running an independent verification loop. StoryProof derives acceptance
criteria from the spec, scans existing tests, and produces runtime evidence.
Your job is to drive the loop until the verdict is clean.

## When to run

- After implementing a feature or fix
- Before committing or creating a PR
- When the user says "verify", "check this", "prove it works", or "did I break anything"

## Step 1: Determine the spec and context

Ask the user what was built if not obvious. The spec should describe behavior:

- Good: "Add owner deletion with cascade to pets and visits"
- Bad: "Add deleteOwner method to OwnerController"

Also ask: **is the app running?** If yes, get the URL (e.g., `http://localhost:8080`)
— pass it with `--url` in the prove step. If not running, unit and integration
tests still work — E2E behaviors will be marked unproven.

## Step 2: Check

```bash
storyproof check --spec "<the spec>"
```

Read the terminal output. It shows: likely defects, coverage gaps, and what prove
will do. The report opens automatically in the browser.

- **Exit code 0** → all covered. Done.
- **Exit code 1** → gaps exist. Go to Step 3.
- **Exit code 2** → likely bugs found. Go to Step 3.

## Step 3: Prove

```bash
storyproof prove --url <app-url-if-running>
```

This writes and runs tests — unit, integration, and E2E. Wait for completion.
The report opens automatically.

- **Exit code 0** → all verified with runtime evidence. Done.
- **Exit code 1** → some behaviors couldn't be tested. Tell the user what's
  unproven and why.
- **Exit code 2** → tests failed. Real bugs confirmed. Go to Step 4.

## Step 4: Fix and re-prove

Read the failing test output to understand what's wrong.

For each `likely_blocker` in `evidence.json`:
- `ac_id` — which acceptance criterion failed
- `exit_status` — "fail" or "error"
- `output` — the actual error message
- `file` — full path to the test file
- `command` — what was run

Fix the **code**, not the test. If the fix requires an architectural decision,
ask the user.

```bash
storyproof prove --url <app-url-if-running>
```

Repeat until exit code 0.

## Step 5: Report

Tell the user:
- How many behaviors were verified
- Any remaining unproven items and why
- The report opens automatically in the browser after each run
- Evidence is in `.storyproof/runs/*/evidence.json` — per-AC proof with test
  file paths, exit codes, and output

## When to stop

- **Exit code 0 from prove** → stop, verification complete
- **3 consecutive prove failures on the same AC** → stop, tell the user the
  fix isn't working and suggest manual investigation
- **Environment blocker** (can't install dependencies, app won't start) → stop,
  tell the user what's blocking and mark ACs as unproven

## Rules

- NEVER skip prove when check finds gaps
- NEVER modify StoryProof's test files to make them pass — fix the code
- NEVER run check without a spec — ask the user
- Use `--focus AC002` if prove is slow — verify one behavior at a time
- If `storyproof` is not installed:
  `curl -fsSL https://storyproof.pages.dev/install.sh | sh`

The pattern

The pattern is the same regardless of tool:

Agent writes code
storyproof check --spec "..." — finds gaps and bugs
storyproof prove — writes tests, runs them, opens report
Fix what broke — the agent or you fixes the code
storyproof prove again — confirms the fix

check → prove → fix code → prove → ship

Check — find what's broken

$ storyproof check --spec "Add owner deletion with cascade"

Reads your code and diff, derives acceptance criteria, scans existing tests:

Likely defects likely_blocker — code analysis found a probable bug
Needs proof not_covered — no test covers this behavior
Covered covered — existing test addresses this
Coverage gaps — test exists but doesn't prove the right thing

~$0.50 per run. ~4 minutes.

check flags

Flag	Default	Description
`--spec`	—	Describe the change (required unless `--spec-file` given)
`--spec-file`	—	Read spec from a file
`--dir`	`.`	Project directory to verify
`--base`	auto-detect	Diff base: branch name (`main`), commit ref (`HEAD~3`), or omit for auto
`--model`	`claude-sonnet-4-6`	LLM model override
`--changes`	`last-commit`	What to diff: `last-commit`, `unstaged`, or a branch name
`--json`	`false`	Output verdict as JSON to stdout
`--force`	`false`	Re-derive criteria even if diff hasn't changed

check exit codes

Code	Meaning
`0`	All behaviors covered by existing tests
`1`	Coverage gaps — some behaviors have no evidence
`2`	Likely blockers — probable bugs detected in the code

Prove — produce evidence

$ storyproof prove

Writes and runs tests for every unresolved behavior. Tests stay in your repo.

Verified verified — test ran and passed
Confirmed blocker likely_blocker — test ran and failed (real bug)
Unproven unproven — couldn't run a test (environment issue)

~$1.00 per run. ~8 minutes.

prove flags

Flag	Default	Description
`--spec`	—	Task spec (auto-resolved from prior check if omitted)
`--spec-file`	—	Read spec from a file
`--dir`	`.`	Project directory
`--base`	auto-detect	Diff base
`--model`	`claude-sonnet-4-6`	LLM model override
`--url`	—	Running app URL for E2E tests (e.g., `http://localhost:8080`)
`--focus`	—	Prove a single AC by ID (e.g., `AC002`)
`--no-install`	`false`	Disable all dependency installs (mark E2E as unproven instead)

prove exit codes

Code	Meaning
`0`	All behaviors verified with runtime evidence
`1`	Some behaviors unproven (environment issue, dependency missing)
`2`	Tests ran and failed — confirmed bugs in the code

Output

Everything lives in .storyproof/ at your repo root.

.storyproof/
├── project.json        # detected stack (auto, first run)
├── latest              # pointer to most recent run
└── runs/
    └── <run_id>/
        ├── verdict.json  # per-AC status
        ├── report.html   # open in browser
        ├── summary.md    # terminal summary
        ├── checks/       # test files (prove)
        └── evidence/     # runtime output (prove)

Verdict statuses

Status	Meaning
verified	Test ran, behavior confirmed
covered	Existing test addresses this (check only)
likely_blocker	Probable bug found
not_covered	No evidence exists
unproven	Couldn't run a test

How StoryProof decides what to test

StoryProof doesn't just check "does a test exist." It asks: does the right kind of test exist for the actual risk?

Every acceptance criterion gets scored on three risk dimensions:

variation_risk — "How many ways can this go wrong?"

Measures edge-case density. A function that takes a string has more ways to fail than one that takes a boolean.

Low: Toggle a feature flag. One input, two states.

High: Parse user-provided dates across timezones, locales, and formats.

When high → StoryProof writes unit tests covering edge cases and boundary values.

boundary_risk — "Does this cross a real seam?"

Measures whether the behavior crosses a system boundary — database, API, service call, file system. Mocking these boundaries hides real failures.

Low: Pure function that transforms data in memory.

High: Service writes to two database tables in a transaction, then calls an external API.

When high → StoryProof writes integration tests that hit the real database or real HTTP endpoint. No mocks.

last_mile_risk — "Does the user actually see the right thing?"

Measures whether the behavior depends on rendering, JavaScript, or browser interaction. Server-side logic can be correct while the UI is broken.

Low: API returns JSON. The response is the product.

High: User clicks "Delete," a confirmation modal appears, they confirm, the page redirects, a success toast shows.

When high → StoryProof bootstraps Playwright and runs browser tests. For server-rendered HTML without JavaScript, integration tests that assert the rendered markup are sufficient.

Why this matters

A common pattern: your repo has 90% test coverage, all green. But the tests are all unit tests that mock the database. The boundary_risk is high, but no test crosses the real boundary. StoryProof catches this — a unit test that mocks the DB doesn't settle boundary risk, regardless of coverage percentage.

CI Integration

Coming soon — GitHub Actions, GitLab CI, and webhook-triggered verification.