GAIA: The Agent That Improves the Agent System

GAIA stands for Goal Auto Improvement Agent.

The name is deliberate. This is not a script or a helper function. It's a system with enough autonomy that it deserves a proper name — one that describes what it actually does: take a goal and auto-improve the system toward it.

The problem it solves

Building agent systems is messy by default.

You ship a feature. You're not sure it works end-to-end. Two weeks later, you've forgotten what you were trying to prove. There's no closed loop between intention and evidence.

GAIA closes that loop.

You describe what you want the system to do. GAIA takes it from there.

What it does, step by step

The flow has two operations:

Operation A — Goal Creation. GAIA turns your description into a structured file: Goal-###-name.md. That file has clear success criteria, a list of affected files, and explicit test cases across all four test layers. It also registers the goal in goal-state.json with status pending.

Operation B — Goal Run. GAIA executes the goal through a full pipeline:

``B0: Pre-flight — dev server running, health check pass B1: Audit — does the codebase already satisfy the goal? B2: Code — implement only what's missing B3: Test — Unit → API → Prompt → UI, fail-fast B4: Evaluate — multi-layer evidence per test type B5: Fix loop — if anything fails, fix and retest``

The audit step (B1) is the most underrated part. GAIA reads the source first — never assumes the codebase is behind. If someone already shipped the change manually, or a previous goal covered it, GAIA skips coding and goes straight to tests. No wasted work.

Four test layers, not one

Every goal defines test cases across four layers:

Type	Tool	What it proves
Unit	Vitest	Pure function logic, edge cases, no regressions
API	curl	HTTP endpoints return correct shape and values
Prompt	test-prompt.mjs + Playwright	The agent behaves correctly in real conversation
UI	Playwright headed	The browser UI reflects the change correctly

The UI and Prompt tests always run headed — a real visible Playwright window. No headless shortcuts. The point is to see exactly what a user sees.

And the tests aren't shallow. A passing UI test must read actual element content, verify sort order, and cross-check against backend logic — not just confirm that elements exist on the page.

Design principles

Audit before coding. Read the source first. The feature might already exist.

Evidence over assertion. "218 elements found" is not a test. GAIA requires reading content, verifying order, cross-checking layers.

Fix loop with no iteration cap. If something fails, fix and retest. GAIA keeps going until the goal passes or there's a genuine external blocker. No artificial "try 3 times" ceiling.

Dev mode mandatory. GAIA always runs against the local dev server (pnpm dev), never production. Hot reload means code changes are immediately testable.

Headed browser always. Every prompt and UI test opens a visible Playwright window. This is non-negotiable.

How you invoke it

The interface is intentionally close to plain language:

"create GAIA Goal" — Operation A (define a new goal)
"run GAIA Goal 3" — Operation B (execute goal 3)
"run all GAIA goals" — Operation B on all pending or failed goals
"rerun GAIA Goal 1" — force re-run even if previously passed

Current goals (April 2026)

Six goals are in the system right now:

Goal	Name	Status
1	Memory Decay Scoring	❌ fail — UI test needs rerun with content inspection
2	Remove Brief Summary	⏳ pending
3	Fact Write Quality	⏳ pending
4	Episodic Memory Hooks	⏳ pending
5	Auto-Extraction	⏳ pending
6	Fact Relationship Graph	⏳ pending

Goals 5 and 6 depend on Goal 3 passing first. The dependency graph matters: GAIA respects it.

What this shifts

The normal cycle for agent development is slow: write → ship → forget → discover later it was broken → fix → repeat.

GAIA shortens that to: describe → run → evidence → done.

The human sets direction. GAIA handles the grind — the auditing, the coding, the testing, the re-testing. You get a goal file and a pass/fail state with real evidence, not a vibe.

That's what I want from an agent system: one that gets out of the way by doing the unglamorous work reliably.