AI Agent Levels on the Capability Axis: From Inference to Orchestration

Everyone says "agent." Few agree what tier they mean.

Capability axis here means one thing: how much autonomous agency the stack actually has — not team maturity, not “AI-native” branding. Moving along the axis adds concrete powers in order: acting (tools in a loop), fanning out work (multi-agent and events), changing how it works (self-improvement), owning fuzzy outcomes (goal hierarchy), and eventually modifying the core (recursive capability gain). Level numbers are just stops on that line.

This is a levels map on that axis: a shared coordinate system for products, research papers, and your own architecture. I keep a longer version with full link lists in Agent Talk v2; here is the distilled shape.

Level 0 — Basic inference

Also: completion, stateless LLM, prompt–response.

Flow: user → prompt → model → answer. No memory, no tools, no loop.

Each call stands alone. Cheap and fast — and the model can talk about actions but cannot take them.

Examples: OpenAI Chat Completions or Anthropic Messages with no tools—one prompt in, one completion out. Batch summarizers, static rewrite/FAQ bots, and any wrapper that never enters a tool loop. Local or hosted LLM endpoints used the same way (e.g. Ollama one-shot generate).

Readings: Applied AI — agent complexity spectrum (single LLM), Agentic AI maturity model — Level 1 / stateless.

Level 1 — Tool-augmented agent (ReAct)

Also: agentic loop, function-calling agent.

The model reasons, acts (tool call), observes the result, and repeats until it can answer. Still one agent, one thread; the loop lives inside a single turn from the user's point of view.

This is where most "agents" in production actually live: plugins, file tools, APIs, search — predictable failure modes, debuggable traces.

Ceiling: no delegation. One brain, one queue.

Examples: ChatGPT with browsing, code execution, or connectors—model picks a tool, sees the result, keeps going inside one user turn. Claude on the web with file and tool use. Cursor agent in the IDE (single thread of tool calls until the answer). Perplexity-style search-augmented Q&A where retrieval is a tool loop, not a hand-built orchestrator. OpenAI Responses API with hosted tools fits the same pattern for app builders.

Readings: Yao et al. — ReAct, Vellum — L2 / tool use, Agentic AI maturity — Level 2 / tool-augmented.

Level 2 — Multi-agent orchestration

Also: orchestrator–executor, agent teams, workflow graphs.

An orchestrator plans, splits work, and dispatches workers — parallel or serial. Each worker may run its own Level‑1 loop. The orchestrator collects results, evaluates, and either finishes or fires another batch.

Triggers are not only chat: cron, webhooks, events. Specialization and parallelism show up here — and so do compounding debug costs when something fails three hops in.

Examples: LangGraph and LangChain-style graphs with explicit branches. CrewAI role-based teams. Microsoft AutoGen / AG2 multi-agent conversations. OpenAI Agents SDK for handoffs and parallel tasks. Product-shaped stacks: GitHub Copilot coding agent style multi-step coding, Devin-class autonomous SWE loops, or any “orchestrator + worker” setup you run from a scheduler rather than a single chat bubble.

Readings: Agentic AI maturity — Level 4 / multi-agent, Agno — Level 4 agent teams, Hopkins — Level 4 agentic software, Vellum — L4 / collaborative.

Level 3 — Self-improving / meta-agent

Also: reflective or recursive agent.

After execution, a critic (or equivalent) judges output. Failure is not just "retry" — the system changes the plan, tools, prompts, or even invents new tools. Persistent memory means learning can span sessions, not just context windows.

This is the frontier in many labs: powerful, hard to audit, easy to drift toward the wrong objective. Human oversight at the behavior and policy layer matters.

Examples: ADAS searches over better agent prompts/structures. Reflexion agents add self-critique and verbal “reinforcement” before the next attempt. Voyager (Minecraft) pairs curriculum + skill library + verification so behavior compounds across time—not just one ReAct trace. DSPy optimizers tune prompts/programs from outcome metrics (closer to ML than a chat loop). Vendors that market “eval + auto-fix” on live traces are aiming at this band—usually with heavy human policy gates.

Readings: Agentic AI maturity — Level 5 / self-correcting, Autonomous AI scale (AAI) — arXiv, Self-evolving agents survey, ADAS — automated design of agentic systems.

Level 4 — Goal-directed autonomous AI

Level 3 tunes how work gets done. Level 4 owns what to pursue.

The human sets a broad outcome (often fuzzy), boundaries, and governance — not step-by-step tasks. The system maintains a goal hierarchy, reflects on progress toward the objective (not mere task ticks), and can spawn or retire sub-agents as the situation shifts.

Core risk: goal misspecification. A relentless optimizer plus a slightly wrong KPI is Goodhart in slow motion.

Examples: Sakana — The AI Scientist (and Nature write-up) chases a broad research outcome—ideas, experiments, write-ups—in a long arc. OpenAI Operator / ChatGPT agent–class browser automation: you state an outcome (“book this,” “fill this form”), the system works across pages; capability is now folded into ChatGPT rather than a separate product shell. Anthropic — computer use for long UI manipulation chains. Sierra positions CX agents around goals, guardrails, and outcomes (resolution, not just reply)—outcome-priced, cross-channel, with persistent customer context. These sit at research or early production; none are “hands off forever” safe.

Readings: Chedal — goal-directed AI maturity (Level 5 framing), DeepMind — levels of AGI, Hopkins — Level 5 agentic systems, Sakana — AI Scientist.

Level 5 — Recursive superintelligence (ASI)

The improvement target is not only behavior but capability — weights, architecture, reasoning depth. The Gödel Machine and related frames are the intuition: change the core when you can justify that the change helps.

Early signal: research systems reporting measurable gains from recursive self-modification (e.g. Darwin Gödel Machine moving SWE-bench numbers). Not something you "ship" like a CRUD app — but the loop is less hypothetical than it was.

Examples: Darwin Gödel Machine (Sakana) reports strong gains on SWE-bench from rewriting its own agent code in a measured loop—evidence that self-modification can compound capability, not just outputs. Historical framing: Schmidhuber’s Gödel Machine (only self-change when a proof says it helps). Nothing here is “deployed ASI”—these are controlled lab setups with narrow metrics.

Readings: Schmidhuber — Gödel Machine, Darwin Gödel Machine — Sakana AI, AAI scale — superintelligence framing, Self-evolving agents — path to ASI, DeepMind AGI paper — ASI tier.

Beyond level 5

Engineering stops; philosophy and alignment start — collective ASI, orthogonal goals, civilizational scale. Useful to name so we do not confuse product roadmaps with those trajectories.

Examples (not products): Collective ASI — meshes of many frontier systems with emergent coordination (economy- or internet-scale), not a SKU. Orthogonal goals — optima that are not “misaligned” so much as unrelated to human values (Bostrom). Resource-dominant intelligence — when control of physical infrastructure matters more than chat UX (Russell on scalable oversight). These are scenario labels for policy and safety work, not shipping checklists.

Readings: Bostrom — Superintelligence (Wikipedia), Russell — Human Compatible (Wikipedia), Vellum — L5 / fully autonomous.

Summary

Same capability axis as the sections above — each row is how far the system gets on loops, parallelism, memory span, self-modification, and who sets the objective.

Level	Name	Trigger	Loop	Parallel	Memory	Self-modifying	Human role
0	Basic inference	User	None	No	No	No	Operator
1	Tool-augmented (ReAct)	User	Within turn	No	Per turn	No	Director
2	Multi-agent orchestration	User or event	Multi-agent	Yes	Short-term	No	Manager
3	Self-improving	Autonomous	Recursive	Yes	Persistent	Behavior & tools	Supervisor
4	Goal-directed autonomous	Objective	Continuous, hierarchical	Yes	Deep	Strategy & sub-goals	Governor
5	Recursive ASI	Internal	Explosive / unbounded	Yes	Unbounded	The model itself	Alignment (pre-set)

Where things land in 2026 (rough)

Rough coordinates on the same capability axis — same labels as the level sections, applied to well-known stacks.

System	Level
Raw API call	0
ChatGPT, Claude chat	1
Cursor assistant	1–2
Copilot Workspace, Devin-class, SWE-agent	2
AutoGen, CrewAI, LangGraph	2
ADAS-style self-modifying stacks	3 (experimental)
Long-running research agents (e.g. AI Scientist class)	3–4 (research)
Darwin Gödel Machine	4–5 (proof-of-concept)
Deployed ASI	Does not exist

Most production is Level 1. A lot of VC deck energy is Level 2. The research edge is Level 3. 4–5 are not operational products you buy off the shelf.

Design principle

Do not skip levels on the capability axis. Orchestration without reliable single-agent ReAct falls over. Self-improvement without solid orchestration becomes expensive chaos.

Start at the lowest level that solves the problem. Complexity is liability.