StoryProof Docs

Using StoryProof with coding agents

StoryProof works after any AI coding agent — Claude Code, Codex, Cursor, Windsurf, or anything that writes code.

Agent writes code → storyproof check → storyproof prove → fix → ship

Claude Code — Skill

Drop this file into your project. Type /verify after any code change.

.claude/skills/verify.md

---
name: verify
description: Verify the current change with StoryProof. Use after implementing
  features, fixing bugs, or before committing — runs independent check and
  prove cycle to find defects, coverage gaps, and lying tests.
---

# Verify with StoryProof

You are verifying that the code change actually does what was asked. StoryProof
is an independent auditor — it derives acceptance criteria from the spec, scans
existing tests, and produces runtime evidence.

## Step 1: Determine the spec

Ask the user what was built if not obvious. The spec should describe **behavior**,
not implementation:

- Good: "Add owner deletion with cascade to pets and visits"
- Bad: "Add deleteOwner method to OwnerController"

## Step 2: Check

```bash
storyproof check --spec "<the spec>"
```

Read the terminal output. It will show:
- **Likely defects** — probable bugs (exit code 2)
- **Needs proof** — behaviors with no test coverage
- **Covered** — behaviors with existing evidence
- **Coverage gaps** — tests that exist but don't prove the right thing

If exit code is 0, all behaviors are covered. You're done.

## Step 3: Prove

If check found gaps or blockers:

```bash
storyproof prove
```

This writes and runs tests for every unresolved behavior. Wait for it to complete.

## Step 4: Fix and re-prove

If prove found failing tests (real bugs):

1. Read the test failure output to understand what's wrong
2. Fix the code (not the test — the test is correct)
3. Run `storyproof prove` again
4. Repeat until exit code 0

## Step 5: Review artifacts

```bash
open .storyproof/runs/*/report.html
```

Show the user the verdict summary. New test files are in
`.storyproof/runs/*/checks/` and copied to the repo's test directory.

## Rules

- NEVER skip prove when check finds gaps — the point is runtime evidence
- NEVER modify StoryProof's test files to make them pass — fix the code instead
- If prove takes too long, use `--focus AC002` to prove one behavior at a time
- If `storyproof` is not installed, tell the user to run:
  `curl -fsSL https://storyproof.pages.dev/install.sh | sh`

OpenAI Codex

Add to your AGENTS.md:

AGENTS.md

## Verification

After implementing changes, verify with StoryProof:

    storyproof check --spec "<feature description>"

If exit code is not 0:

    storyproof prove

Fix blockers and re-run prove. Do not mark a task complete until exit code 0.

Cursor / Windsurf / Any agent

The pattern is the same regardless of tool:

Agent writes code
storyproof check --spec "..." — finds gaps and bugs
storyproof prove — writes tests, runs them, records evidence
Fix what broke — the agent or you fixes the code
storyproof prove again — confirms the fix

StoryProof doesn't know or care which agent wrote the code. It verifies behavior, not authorship.

Install

$ curl -fsSL https://storyproof.pages.dev/install.sh | sh

Or from source:

$ git clone https://github.com/storyproof/storyproof.git
$ cd storyproof/cli && pip install -e .

Requires Claude Code CLI.

Check — find what's broken

$ storyproof check --spec "Add owner deletion with cascade"

StoryProof reads your code, derives acceptance criteria from your spec, scans existing tests, and tells you:

Likely defects likely_blocker — code analysis found a probable bug
Needs proof not_covered — no test covers this behavior
Covered covered — existing test addresses this
Coverage gaps — test exists but doesn't prove the right thing

~$0.50 per run. ~4 minutes.

Prove — produce evidence

$ storyproof prove

Writes and runs tests for every unresolved behavior. Tests stay in your repo.

Verified verified — test ran and passed
Confirmed blocker likely_blocker — test ran and failed (real bug)
New tests written — files added to your test suite

~$1.00 per run. ~8 minutes.

The loop

check → prove → fix code → prove → ship

After fixing a blocker, run prove again. It re-runs everything to catch regressions.

CLI Reference

check

Flag	Default	Description
`--spec`	(required)	Describe the change
`--spec-file`	—	Read spec from a file
`--changes`	`last-commit`	What to diff: `last-commit`, `unstaged`, or a branch name
`--json`	`false`	Output verdict as JSON
`--force`	`false`	Re-derive even if diff unchanged

prove

Flag	Default	Description
`--focus`	—	Prove a single AC by ID (e.g., `AC002`)

Exit codes

Code	Meaning
`0`	All verified
`1`	Gaps remain
`2`	Blockers found

Output

Everything lives in .storyproof/ at your repo root.

.storyproof/
├── project.json        # detected stack (auto, first run)
├── latest              # pointer to most recent run
└── runs/
    └── <run_id>/
        ├── verdict.json  # per-AC status
        ├── report.html   # open in browser
        ├── summary.md    # terminal summary
        ├── checks/       # test files (prove)
        └── evidence/     # runtime output (prove)

Open the report:

$ open .storyproof/runs/*/report.html

Verdict statuses

Status	Meaning
verified	Test ran, behavior confirmed
covered	Existing test addresses this (check only)
likely_blocker	Probable bug found
not_covered	No evidence exists
unproven	Couldn't run a test

How StoryProof decides what to test

StoryProof doesn't just check "does a test exist." It asks: does the right kind of test exist for the actual risk?

Every acceptance criterion gets scored on three risk dimensions:

variation_risk — "How many ways can this go wrong?"

Measures edge-case density. A function that takes a string has more ways to fail than one that takes a boolean.

Low: Toggle a feature flag. One input, two states.

High: Parse user-provided dates across timezones, locales, and formats.

When high → StoryProof writes unit tests covering edge cases and boundary values.

boundary_risk — "Does this cross a real seam?"

Measures whether the behavior crosses a system boundary — database, API, service call, file system. Mocking these boundaries hides real failures.

Low: Pure function that transforms data in memory.

High: Service writes to two database tables in a transaction, then calls an external API.

When high → StoryProof writes integration tests that hit the real database or real HTTP endpoint. No mocks.

last_mile_risk — "Does the user actually see the right thing?"

Measures whether the behavior depends on rendering, JavaScript, or browser interaction. Server-side logic can be correct while the UI is broken.

Low: API returns JSON. The response is the product.

High: User clicks "Delete," a confirmation modal appears, they confirm, the page redirects, a success toast shows.

When high → StoryProof bootstraps Playwright and runs browser tests. For server-rendered HTML without JavaScript, integration tests that assert the rendered markup are sufficient.

Why this matters

A common pattern: your repo has 90% test coverage, all green. But the tests are all unit tests that mock the database. The boundary_risk is high, but no test crosses the real boundary. StoryProof catches this — a unit test that mocks the DB doesn't settle boundary risk, regardless of coverage percentage.

CI Integration

Coming soon — GitHub Actions, GitLab CI, and webhook-triggered verification.