GAIA stands for Goal Auto Improvement Agent.
The name is deliberate. This is not a script or a helper function. It's a system with enough autonomy that it deserves a proper name — one that describes what it actually does: take a goal and auto-improve the system toward it.
The problem it solves
Building agent systems is messy by default.
You ship a feature. You're not sure it works end-to-end. Two weeks later, you've forgotten what you were trying to prove. There's no closed loop between intention and evidence.
GAIA closes that loop.
You describe what you want the system to do. GAIA takes it from there.
What it does, step by step
The flow has two operations:
Operation A — Goal Creation. GAIA turns your description into a structured file: Goal-###-name.md. That file has clear success criteria, a list of affected files, and explicit test cases across all four test layers. It also registers the goal in goal-state.json with status pending.
Operation B — Goal Run. GAIA executes the goal through a full pipeline:
``
B0: Pre-flight — dev server running, health check pass
B1: Audit — does the codebase already satisfy the goal?
B2: Code — implement only what's missing
B3: Test — Unit → API → Prompt → UI, fail-fast
B4: Evaluate — multi-layer evidence per test type
B5: Fix loop — if anything fails, fix and retest
``
The audit step (B1) is the most underrated part. GAIA reads the source first — never assumes the codebase is behind. If someone already shipped the change manually, or a previous goal covered it, GAIA skips coding and goes straight to tests. No wasted work.
Four test layers, not one
Every goal defines test cases across four layers:
| Type | Tool | What it proves |
|---|---|---|
| Unit | Vitest | Pure function logic, edge cases, no regressions |
| API | curl | HTTP endpoints return correct shape and values |
| Prompt | test-prompt.mjs + Playwright | The agent behaves correctly in real conversation |
| UI | Playwright headed | The browser UI reflects the change correctly |
The UI and Prompt tests always run headed — a real visible Playwright window. No headless shortcuts. The point is to see exactly what a user sees.
And the tests aren't shallow. A passing UI test must read actual element content, verify sort order, and cross-check against backend logic — not just confirm that elements exist on the page.
Design principles
Audit before coding. Read the source first. The feature might already exist.
Evidence over assertion. "218 elements found" is not a test. GAIA requires reading content, verifying order, cross-checking layers.
Fix loop with no iteration cap. If something fails, fix and retest. GAIA keeps going until the goal passes or there's a genuine external blocker. No artificial "try 3 times" ceiling.
Dev mode mandatory. GAIA always runs against the local dev server (pnpm dev), never production. Hot reload means code changes are immediately testable.
Headed browser always. Every prompt and UI test opens a visible Playwright window. This is non-negotiable.
How you invoke it
The interface is intentionally close to plain language:
"create GAIA Goal"— Operation A (define a new goal)"run GAIA Goal 3"— Operation B (execute goal 3)"run all GAIA goals"— Operation B on all pending or failed goals"rerun GAIA Goal 1"— force re-run even if previously passed
Current goals (April 2026)
Six goals are in the system right now:
| Goal | Name | Status |
|---|---|---|
| 1 | Memory Decay Scoring | ❌ fail — UI test needs rerun with content inspection |
| 2 | Remove Brief Summary | ⏳ pending |
| 3 | Fact Write Quality | ⏳ pending |
| 4 | Episodic Memory Hooks | ⏳ pending |
| 5 | Auto-Extraction | ⏳ pending |
| 6 | Fact Relationship Graph | ⏳ pending |
Goals 5 and 6 depend on Goal 3 passing first. The dependency graph matters: GAIA respects it.
What this shifts
The normal cycle for agent development is slow: write → ship → forget → discover later it was broken → fix → repeat.
GAIA shortens that to: describe → run → evidence → done.
The human sets direction. GAIA handles the grind — the auditing, the coding, the testing, the re-testing. You get a goal file and a pass/fail state with real evidence, not a vibe.
That's what I want from an agent system: one that gets out of the way by doing the unglamorous work reliably.