Building Long-Running Agents

@dalexeenko|July 1, 2026 (13h ago)5 views

Several weeks ago I built Ants and wrote about it: the loop, SSE streaming, sandboxing, verifiers. It was mostly exploring ideas about how to make agents work. Since then I've been thinking about how to make agents survive and provide more reliable execution.

The Ants harness ran the loop in-process: you POST a prompt, a background task spins up a ReAct loop, and events stream back over SSE. It works fine until the process restarts: when that happens, the agent that was a few tool calls into a ten-minute task loses all of its work. For a background agent that's a deal breaker.

A few notes on how to deal with this.

Separate the control plane from execution

The first thing is to have a hard split between two layers: a control plane — an API that accepts prompts, starts and aborts runs, and streams events — and worker execution, the thing that actually grinds through LLM calls and tool batches.

The seam between them is a single interface, RunExecutor, with two implementations: InProcessRunExecutor and TemporalRunExecutor (this is the durable one).

Browser / CLI
     │
     ▼
Control plane (API) ──────▶ Postgres (runs · events)
     │                            ▲
     ▼                            │ append / replay
RunExecutor ──┬── InProcess       │
              └── Temporal ───────┘
                    │
                    ▼
               Tool batch ──▶ discovery ──▶ remote daemon(s)

Durable execution is about shape

The lesson I underestimated: making a loop durable is less about the durability engine and more about how you shape the loop. To run the ReAct loop as a durable workflow, I had to break it into discrete, replayable steps:

appendUserMessage → runAssistantTurn → runToolBatch → repeat → finalizeRun

Each step becomes an activity; the workflow code that stitches them together has to be deterministic. Once you decompose the loop, everything becomes simpler: say a worker crashes, then another picks up the workflow and replays to exactly where it was.

This approach to looping is one of the key lessons, even without a durability engine. Forcing yourself to name every step and make it resumable is the kind of constraint that improves the whole design.

Streaming

In the previous harness, Ants, streaming and execution were mixed in: the loop pushed events straight to the SSE response. That can't work when the executor is a worker on another machine and the client is talking to a stateless API.

So I decoupled events from execution entirely. Activities append to a run event store in Postgres. The API's job is just to replay that store to whoever connects and tail new events as they land. If you close the tab and reopen it an hour later, you'll still get the full history plus the live stream — because the stream was never tied to the connection in the first place.

Remote tools

The other thing is to run tools somewhere other than the agent server — a remote execution environment. We can have daemons connect back to us and receive tool calls over SSE. You need to build sticky sessions, routing, and failover, and keep them out of the durable-execution layer, because they're different concerns that fail in different ways.

If a daemon that runs the remote execution dies mid-batch, the natural step is to retry the tool call that was in progress. But that call might have already written a file, sent an email, or hit an API — retrying may have side effects. So the harness restarts the whole tool batch on a replacement daemon instead of resuming a partial one. This is where I picked the at-least-once guarantee: make the unit of retry something you can afford to repeat.