When an agent feels brilliant, it is tempting to credit “the model” and move on.
That is half the story at best. In practice, agent intelligence behaves like a product — several factors multiply. One weak term drags the whole thing down.
Below is the checklist I use when I am debugging behavior or designing a harness. It lines up with how I think about layers in the execution model and memory shape, but zoomed out to “what levers exist before you blame the weights.”
Tools that exist and fit the job
More tools is not automatically better — wrong tools create noise. Still, the floor is simple: without tools, an agent can only talk.
Give it file readers, a terminal, search, structured APIs, or domain-specific actions, and it can do. The capability jump is not subtle.
The design question is not “how many tools” but whether the affordances match the task — tight schemas, clear errors, and guardrails so the model is not guessing at side effects.
Context you actually feed it
This is the part people hand-wave as “just paste more stuff.”
Rich context is not length for its own sake. It is grounding: repo layout, constraints, prior decisions, definitions of done, and what *not* to touch. Garbage in, garbage out applies harder here than in chat, because the agent will act on what it believes is true.
I treat CLAUDE.md / AGENTS.md style notes and targeted codebase reads as part of the product, not documentation cosplay.
System prompts and standing instructions
Think of this as the agent’s default operating procedure: role, priorities, tool policy, stop conditions, safety boundaries, and how aggressively to verify.
Vague instructions produce vague behavior — even with a strong model. Crisp instructions do not replace reasoning, but they reduce the search space for what “good” means on *your* system.
Planning before acting
A brittle pattern is: first idea → immediate tool blast → hope.
A stronger pattern is decomposition: name sub-goals, order dependencies, decide what evidence would change the plan, then execute. That can be implicit in the model, explicit in the prompt (“plan first, then act”), or structured in the harness.
Classic “reason then act” loops are one concrete version of this idea — see the ReAct framing in Yao et al., 2022. For “think longer on paper,” the chain-of-thought line of work is the conceptual ancestor — Wei et al., 2022.
Memory and state across turns
Single-turn cleverness is cheap. Long-horizon usefulness is state management: what to remember, where it lives, how to retrieve it, and how to mark progress without losing the thread.
That can be scratchpads, files, databases, temporal memory tools, or a vault — the implementation varies, but the requirement does not. If every turn starts from zero, you are paying model tax to re-derive facts you already had.
Self-correction and reflection
Without a feedback loop, one early mistake becomes a snowball: confident wrong assumptions drive more actions, which “confirm” the wrong world model.
Useful agents need a place to ask: is this actually done? does this output match the constraints? did the tool return something plausible? If not, revise the plan or try a different path — before the user has to intervene.
Model quality as the floor
Tools and prompts cannot invent stable reasoning that the model cannot sustain. They amplify what is there.
If the underlying model struggles with multi-step logic or disciplined tool use, the ceiling drops — no matter how fancy the harness looks.
The multiplication trap
A useful mental model:
effective agent ≈ model × tools × context × instructions — and you can extend the product with planning, memory, and reflection as separate multipliers in the same spirit.
If any critical term is near zero, you do not get “slightly worse.” You often get failure that looks like competence — fluent language wrapped around the wrong actions.
So when something misbehaves, I do not start by swapping the model. I walk the list: tools, context, instructions, planning, memory, reflection — then, if the stack is honest, I tune the foundation.