StoryProof Docs

Install

$ curl -fsSL https://storyproof.pages.dev/install.sh | sh

Then add the verification skill to your project and start verifying.

Using with coding agents

StoryProof works after any AI coding agent — Claude Code, Codex, Cursor, Windsurf, or just you at a terminal.

Add the verification skill

Drop this file into your project. Works with any agent that supports skills or instruction files.

.claude/skills/verify.md
---
name: verify
description: Verify code changes with StoryProof. Use when the user asks to verify,
  check, prove, or test their changes. Triggers include "verify this", "check my
  changes", "prove it works", "did I break anything", or any request to validate
  that code does what the spec says.
---

# Verify with StoryProof

You are running an independent verification loop. StoryProof derives acceptance
criteria from the spec, scans existing tests, and produces runtime evidence.
Your job is to drive the loop until the verdict is clean.

## When to run

- After implementing a feature or fix
- Before committing or creating a PR
- When the user says "verify", "check this", "prove it works", or "did I break anything"

## Step 1: Determine the spec and context

Ask the user what was built if not obvious. The spec should describe behavior:

- Good: "Add owner deletion with cascade to pets and visits"
- Bad: "Add deleteOwner method to OwnerController"

Also ask: **is the app running?** If yes, get the URL (e.g., `http://localhost:8080`)
— pass it with `--url` in the prove step. If not running, unit and integration
tests still work — E2E behaviors will be marked unproven.

## Step 2: Check

```bash
storyproof check --spec "<the spec>"
```

Read the terminal output. It shows: likely defects, coverage gaps, and what prove
will do. The report opens automatically in the browser.

- **Exit code 0** → all covered. Done.
- **Exit code 1** → gaps exist. Go to Step 3.
- **Exit code 2** → likely bugs found. Go to Step 3.

## Step 3: Prove

```bash
storyproof prove --url <app-url-if-running>
```

This writes and runs tests — unit, integration, and E2E. Wait for completion.
The report opens automatically.

- **Exit code 0** → all verified with runtime evidence. Done.
- **Exit code 1** → some behaviors couldn't be tested. Tell the user what's
  unproven and why.
- **Exit code 2** → tests failed. Real bugs confirmed. Go to Step 4.

## Step 4: Fix and re-prove

Read the failing test output to understand what's wrong.

For each `likely_blocker` in `evidence.json`:
- `ac_id` — which acceptance criterion failed
- `exit_status` — "fail" or "error"
- `output` — the actual error message
- `file` — full path to the test file
- `command` — what was run

Fix the **code**, not the test. If the fix requires an architectural decision,
ask the user.

```bash
storyproof prove --url <app-url-if-running>
```

Repeat until exit code 0.

## Step 5: Report

Tell the user:
- How many behaviors were verified
- Any remaining unproven items and why
- The report opens automatically in the browser after each run
- Evidence is in `.storyproof/runs/*/evidence.json` — per-AC proof with test
  file paths, exit codes, and output

## When to stop

- **Exit code 0 from prove** → stop, verification complete
- **3 consecutive prove failures on the same AC** → stop, tell the user the
  fix isn't working and suggest manual investigation
- **Environment blocker** (can't install dependencies, app won't start) → stop,
  tell the user what's blocking and mark ACs as unproven

## Rules

- NEVER skip prove when check finds gaps
- NEVER modify StoryProof's test files to make them pass — fix the code
- NEVER run check without a spec — ask the user
- Use `--focus AC002` if prove is slow — verify one behavior at a time
- If `storyproof` is not installed:
  `curl -fsSL https://storyproof.pages.dev/install.sh | sh`

The pattern

The pattern is the same regardless of tool:

  1. Agent writes code
  2. storyproof check --spec "..." — finds gaps and bugs
  3. storyproof prove — writes tests, runs them, opens report
  4. Fix what broke — the agent or you fixes the code
  5. storyproof prove again — confirms the fix
check prove fix code prove ship

Check — find what's broken

$ storyproof check --spec "Add owner deletion with cascade"

Reads your code and diff, derives acceptance criteria, scans existing tests:

~$0.50 per run. ~4 minutes.

check flags

FlagDefaultDescription
--specDescribe the change (required unless --spec-file given)
--spec-fileRead spec from a file
--dir.Project directory to verify
--baseauto-detectDiff base: branch name (main), commit ref (HEAD~3), or omit for auto
--modelclaude-sonnet-4-6LLM model override
--changeslast-commitWhat to diff: last-commit, unstaged, or a branch name
--jsonfalseOutput verdict as JSON to stdout
--forcefalseRe-derive criteria even if diff hasn't changed

check exit codes

CodeMeaning
0All behaviors covered by existing tests
1Coverage gaps — some behaviors have no evidence
2Likely blockers — probable bugs detected in the code

Prove — produce evidence

$ storyproof prove

Writes and runs tests for every unresolved behavior. Tests stay in your repo.

~$1.00 per run. ~8 minutes.

prove flags

FlagDefaultDescription
--specTask spec (auto-resolved from prior check if omitted)
--spec-fileRead spec from a file
--dir.Project directory
--baseauto-detectDiff base
--modelclaude-sonnet-4-6LLM model override
--urlRunning app URL for E2E tests (e.g., http://localhost:8080)
--focusProve a single AC by ID (e.g., AC002)
--no-installfalseDisable all dependency installs (mark E2E as unproven instead)

prove exit codes

CodeMeaning
0All behaviors verified with runtime evidence
1Some behaviors unproven (environment issue, dependency missing)
2Tests ran and failed — confirmed bugs in the code

Output

Everything lives in .storyproof/ at your repo root.

.storyproof/
├── project.json        # detected stack (auto, first run)
├── latest              # pointer to most recent run
└── runs/
    └── <run_id>/
        ├── verdict.json  # per-AC status
        ├── report.html   # open in browser
        ├── summary.md    # terminal summary
        ├── checks/       # test files (prove)
        └── evidence/     # runtime output (prove)

Verdict statuses

StatusMeaning
verifiedTest ran, behavior confirmed
coveredExisting test addresses this (check only)
likely_blockerProbable bug found
not_coveredNo evidence exists
unprovenCouldn't run a test

How StoryProof decides what to test

StoryProof doesn't just check "does a test exist." It asks: does the right kind of test exist for the actual risk?

Every acceptance criterion gets scored on three risk dimensions:

variation_risk — "How many ways can this go wrong?"

Measures edge-case density. A function that takes a string has more ways to fail than one that takes a boolean.

Low: Toggle a feature flag. One input, two states.

High: Parse user-provided dates across timezones, locales, and formats.

When high → StoryProof writes unit tests covering edge cases and boundary values.

boundary_risk — "Does this cross a real seam?"

Measures whether the behavior crosses a system boundary — database, API, service call, file system. Mocking these boundaries hides real failures.

Low: Pure function that transforms data in memory.

High: Service writes to two database tables in a transaction, then calls an external API.

When high → StoryProof writes integration tests that hit the real database or real HTTP endpoint. No mocks.

last_mile_risk — "Does the user actually see the right thing?"

Measures whether the behavior depends on rendering, JavaScript, or browser interaction. Server-side logic can be correct while the UI is broken.

Low: API returns JSON. The response is the product.

High: User clicks "Delete," a confirmation modal appears, they confirm, the page redirects, a success toast shows.

When high → StoryProof bootstraps Playwright and runs browser tests. For server-rendered HTML without JavaScript, integration tests that assert the rendered markup are sufficient.

Why this matters

A common pattern: your repo has 90% test coverage, all green. But the tests are all unit tests that mock the database. The boundary_risk is high, but no test crosses the real boundary. StoryProof catches this — a unit test that mocks the DB doesn't settle boundary risk, regardless of coverage percentage.

CI Integration

Coming soon — GitHub Actions, GitLab CI, and webhook-triggered verification.