Cursor as Agent Eval — Hive, ninety scenarios, one skill (CLI + browser)

Most “agent demos” prove one happy path once. They do not survive the next refactor to intent routing, or the next MCP, or the next UI tweak.

Cursor as Agent Eval is the shorthand I use for the next step: Cursor stops being “chat that helps you code” and becomes the executor of the evaluation protocol — open the thread in the browser, run test-prompt.mjs, read the terminal, decide pass or fail, then (per the skill) patch and retest until the scenario is green. The eval harness lives in the repo (.cursor/skills/test-agent-prompt/); the agent in the IDE is what runs it reliably.

In agent-talk-v2 I ended up with something closer to a workshop: one long procedural skill file, then dozens of small scenario files — each with an ID (T1, T21, T92…), a prompt, pass criteria, and the exact command to run.

This post is about why that split exists — and why it is deliberately not a single markdown doc or a classic unit-test suite.

Two kinds of truth: terminal and UI

The skill is built around dual verification.

CLI (scripts/test-prompt.mjs) streams hard signals: [tool], [cw], [error], and a closing ✓ Done. That is the language you argue in when you disagree about whether a tool fired and in what order.

Browser (via the in-IDE browser tools described in Cursor’s Agent browser tool docs) is the other half: hard reload, navigate to ?thread=, snapshot, screenshot before the prompt, run the script, screenshot after. That is the angle you cannot fake with log tail alone — cards, layout, whether the thread URL actually stuck on a SPA.

The skill is strict for a reason: CLI-only is incomplete, and UI-only is weak for auditing tool contracts. They answer different questions.

Why `SKILL.md` is not the test list

Open the main skill file and it is long on purpose. That file is not where ninety prompts live.

It holds the orchestration — the parts that would rot if duplicated into every scenario:

server health on the right port before you trust any outcome
how to create a valid test thread (solo Hive, fixed “Tests” folder so production history never bleeds in)
the SPA navigation ritual so ?thread= is not lying
how to read exit semantics (ws error after a drop is infrastructure invalid, not “the agent failed the scenario”)
the Run → Evaluate → Fix → Retest loop when something fails — including “do not advance to the next ID until this one is green”

There is also session state (scripts/test-state.json): scenarios that already passed and are unchanged should not burn minutes and tokens every morning. Re-runs are for failures, new cases, stale windows, or code that touched the routing stack.

So SKILL.md answers: how do we trust a run? The per-ID files answer: what must be true for this slice of behavior?

`SKILL.md` compact — orchestration only

The real file is long and procedural (by design). Here is the same spine compressed so you can see what lives *outside* the per-test markdown leaves — still English, still the same rules as in Cursor.

Format note: This site splits article bodies on blank lines. A fenced `` block that contains its own blank lines accidentally becomes several paragraphs — and any line starting with ## then renders as a real section heading. So below is **one monospace block** (a single span of backticks in the source). The renderer turns long multiline backtick runs into a **block-level <code>** (see renderPostInline` in the repo), which is what you actually want for a cheat sheet.

test-agent-prompt — orchestration (compact) DUAL VERIFICATION — both required (CLI + browser) Track | What | Proves CLI | node scripts/test-prompt.mjs (events, tools, Done line) | Contracts + routing Browser | Cursor Agent browser: root reload → ?thread= → snapshots + pre/post screenshots | User-visible surface Rule: CLI-only or browser-only → incomplete run for this skill. PRECONDITIONS - curl http://127.0.0.1:3012/threads → 200 before any scenario. - POST /threads with soloAgentId "hive" + Tests folder id — no shared prod threads. PER-ID RUN RITUAL 1. Create thread (solo Hive, Tests folder). 2. Browser: hard reload app root → short wait → navigate ?thread=THREAD → snapshot (URL + input) → screenshot pre. 3. CLI: node scripts/test-prompt.mjs --thread THREAD "<prompt from Txx file>". 4. Browser: screenshot post after CLI exits. CLI CHECKLIST (E1–E6) E1 Normal Done only; ws error / ws closed after disconnect → invalid run (not scenario fail). E2 No [error] lines. E3 Tools match scenario file. E4 Agent text when expected. E5 All required events present. E6 No forbidden events. UI CHECKLIST (EU1–EU3) EU1 Hive bubble when WebSocket allows; if sandbox blocks live cards → mark UI limited — not auto-fail. EU2 No crash / blank shell. EU3 Coding-worker / artifact / notes UI only if scenario requires. SESSION STATE scripts/test-state.json → pass | fail | pending | skip per T-id (skip rerunning stable greens without cause). ON FAIL (skill is strict) Diagnose → minimal fix → restart server if backend touched → retest same ID; do not open the next T while red. THROUGHPUT Sequential runs only; respect tier order (lower tiers green before dependent higher tiers).

If you only read one artifact before a session, read this digest plus the one Txx file you are about to run — not the entire registry table.

The `tests/` folder as a readable registry

Each Txx-*.md file is a mini spec: layer, goal, preconditions, the exact user prompt, expected and forbidden events, pass criteria tied back to shared checklists, and a copy-paste run command (often curl to open the thread, then node scripts/test-prompt.mjs --thread …).

Some scenarios are product-shaped (ambiguous intent should yield clarification, not silent tool use). Some are engineering-shaped (latency baselines split into sub-phases so you know whether Haiku intent or warm-path TTFT moved).

At the scale of on the order of ninety files, the message is simple: agent behavior is a surface area, not a single demo clip.

Sample scenario file (what you open in Cursor)

In Cursor you do not hunt prompts inside SKILL.md. You open one markdown leaf under .cursor/skills/test-agent-prompt/tests/ — for example T21-ambiguous-intent.md — and treat it as the contract for that ID.

Below is a shortened version of the real shape (same intent as T21 in agent-talk-v2). Again one backtick block so it stays monospace <code> and never hijacks heading parsing.

T21 — Ambiguous Intent → Clarification (sample / shortened) Layer: L1 Core Loop (Intent Analysis) Goal: Hive asks for clarification on a vague line — no silent assumptions, no premature tools. Pre-condition: Thread is new (no prior context). Prompt (verbatim): Tolong update itu. Expected: No tool call in this turn. Forbidden: Any tool-start line before Hive has asked what "itu" refers to. Pass: Normal CLI completion; clarification-style reply; zero tools for this prompt. Run (after POST solo Hive thread in Tests folder + browser ritual): node scripts/test-prompt.mjs --thread "$THREAD_ID" "Tolong update itu."

Everything above is copy-pasteable intent: a human or the Agent can run the same thread-creation + CLI path without re-deriving steps from chat memory. That is what makes it a test case rather than a vibe check.

Cursor as Agent Eval — why the IDE is the operator

This section is the thesis behind the title: Cursor as Agent Eval means the same Cursor session that edits intent-analysis.ts can also enforce behavior — run the browser ritual, stream CLI events, update test-state.json, and loop diagnose → fix → restart → retest without treating “I ran it once” as enough.

The skill file is the contract (what counts as evidence); Cursor is the engine that applies it. That is a different object than ad-hoc prompt engineering or a green CI badge on code that never touched the agent path.

That pairing — spec in git, eval driven by the agent in the IDE — is what I mean when I say Cursor as Agent Eval out loud to other builders.

If you borrow only three ideas

Split “how we run and trust a check” from “what we assert” so neither file type collapses under the other.
Demand at least two observability channels when the product has both streaming events and a UI surface that can drift independently.
Persist state so slow behavioral checks do not become an all-or-nothing suite you dread running.

That is the whole architecture in one breath: one procedural spine, many scenario leaflets, two lenses on the same run — so tomorrow morning’s refactor still has somewhere to stand.