Hive Prewarm — Subprocess Pooling for the First Token

The thing that made Hive feel “slow on easy questions” was never the model. It was startup — each fresh query() path paying several seconds to stand up a subprocess before the first token.

Prewarm fixes that by treating the subprocess as inventory: spawn early, hold a handle, consume it on the next Hive turn, then refill behind the user.

What we are actually warming

The Claude Agent SDK can startup({ options }) instead of only query(). That returns a small handle with query(prompt) on it — same streaming shape as a normal call, but the expensive init happens before you type.

Hive wraps that in a warm pool (warm-pool.ts in agent-talk-v2): thread-scoped entries with a TTL, plus a global queue of anonymous handles anyone can take if their thread does not have a fresher one yet.

Three moments that matter

1. Server boot — global pool On launch we seed a small pool (default two) of Hive subprocesses. Each spawn runs through a factory that builds fresh mcpServers every time, because in-process MCP transports cannot safely attach to two subprocesses at once — sharing caused real “already connected” crashes in earlier experiments.

When a handle is taken, the pool spawns another so the next visitor still gets a head start.

2. New thread — first message Opening a thread with Hive triggers preWarmHiveForNewThread: we build a standalone tool bundle (app UI, coding worker, Perplexity-style tools, Obsidian, etc.) and a pre-baked system prompt via buildHivePreWarmSystemPrompt.

That prompt includes static sections and durable disk state (problems, roadmap, goals, daily state). It defers the most volatile injections — live memory block, Bagus profile, active task label — because those are appended at real turn time if the warm handle is actually consumed. That keeps the subprocess honest without rebuilding the world on every keystroke.

3. After every Hive reply — next turn When a turn finishes, we fire-and-forget another preWarm with the full MCP bundle so the next message matches normal capability. Casual chat stops feeling like you are waiting on a compiler between beats.

Tradeoffs that are deliberate

First-turn prewarm drops the heaviest external MCPs (browser CDP, Figma, that class of connector). Wiring them costs about one to two seconds and they are rare on the very first reply. They appear once the full post-turn prewarm runs, so we optimize the common path without permanently hiding tools.

Some intents skip consuming a warm handle altogether — when Haiku classifies a turn that needs a different MCP slice (site scrape, roadmap draft, vault writes, thread broadcast, inline notes). In those cases the fresh filtered mcpServers wins; prewarm must never override correct tool inventory for the job.

Side effects still land on the right thread

Global handles start life with slot callbacks. Right before handle.query(), we merge the real thread’s artifact / Obsidian / browser callbacks into that slot so pushes and previews do not disappear into a no-op when you “borrowed” a global subprocess.

If a warm handle throws, we fall back to a normal cold query(). Prewarm is an optimization, not a correctness cliff.

What you see in the client

The thread server emits pre_warm_start, pre_warm_ready, and pre_warm_failed — and replays them if you connect late — so latency work is visible instead of a black box.

One line

Prewarm is Hive’s subprocess inventory layer: pool at boot, warm on thread open, refill after every turn — so the first token on a simple message behaves like chat, not like booting a VM.

If you want the adjacent systems story, two-layer memory is how facts and vault stay separated; Ivy is how structured context lands in those layers in the first place.