Durable Streaming for AI Chats

Most AI chat products stream well only as long as the original request stays alive. Refresh the page, switch networks, or lose the tab at the wrong moment, and the response disappears with the connection. Sometimes the model keeps running in the backend. Sometimes it does not. Either way, the user has no clear mental model of what survived.

We wanted a stricter contract for Limerence. A stream should belong to the chat turn itself, not to one fragile HTTP request.A stream is the flow of response tokens the AI sends back as it generates an answer. A turn is one user message and the agent response it triggers. Durable means the response is saved to disk, not held in memory that vanishes with the connection. If the browser disconnects, the output that already arrived should still exist. If the user comes back, the client should resume from durable state. If the backend dies mid-run, recovery should make that failure explicit instead of leaving a hanging spinner forever.

One Turn, One Stream, Durable State

Limerence treats streaming as a background process with durable state. A request registers a stream, enqueues a chat run into a job queue, and watches that stream while a background workerA separate process that runs the AI agent independently of the browser. persists chunksThe individual pieces of the response as they arrive. into a SQLite-backed stream store. On reconnect, the client asks the server for the active stream for that chat and resumes from there.

The design choice that matters is giving each user turn a stable stream identifier and letting the backend decide whether that stream should be watched, reused, reopened, cancelled, or failed. That single identity is what makes reconnects, idempotency, and recovery fit together.Idempotency is the guarantee that submitting the same message twice does not run the work twice.

One stable stream identity travels through the whole lifecycle

stable stream id

belongs to the turn

Read the flow top to bottom. The same identity is what lets the system suppress duplicate work, preserve partial output, and recover honestly after interruption.

01 Register

Request binds the stream to the user turn

The stream is created or reused before generation starts, so the identity belongs to the turn instead of the transport.

refresh after this point still has a real stream to reconnect to

02 Queue

Dispatcher enforces one active run per chat

A duplicate submit watches the existing stream instead of launching another worker path.

same turn, same stream, no accidental double execution

03 Run

Worker continues independently of the request

Model execution no longer depends on the browser keeping the original request alive.

transport can die while useful work keeps going

04 Persist

Chunks are written to durable delivery storage immediately

Already-produced output survives reconnects because delivery state is saved as it arrives.

partial output is no longer trapped inside one socket

05 Resume

Reconnect replays saved chunks, then live-tails new ones

The client asks for the active stream for the chat and resumes from durable state instead of reviving a dead session.

replay, idempotency, and recovery align on one identity

This version reads as a journey instead of a wide architecture chart: one durable stream is registered, dispatched, executed, persisted, and resumed. The browser can drop out anywhere in the middle without erasing what already became real.

Request Lifetime Is Not Generation Lifetime

Streaming gets messy when request lifetime and generation lifetime are treated as the same thing. They are not. The browser can go away while the model is still generating. The network can drop after half the answer has already been produced. A worker can crash after work started but before the user sees the final chunk.

For a data product, those failures are worse than a small UX glitch. If a user asks a long question against a large schema, we may spend real time assembling context, running the agent, and composing an answer. Throwing that state away because one request died is wasteful. Rerunning the same turn blindly is also dangerous because duplicate execution creates its own ambiguity: did the system retry, or did the user accidentally submit twice? We have been on the receiving end of that confusion.

We also had another constraint. Some conversations pause because the system needs a human decision or a missing piece of input. When that happens, the next assistant turn is not a brand-new conversation branch. It is a continuation of the same turn.The backend enforces this explicitly: continuation is only valid when the conversation is actually waiting for user input, not whenever the client feels like sending another assistant message. So the system needs a durable notion of "the stream for this turn" that survives across pauses.

Request-bound streaming

The original request implicitly owns the response lifecycle.

refresh or disconnect can sever the only visible copy of the output
retries can accidentally rerun the same turn
failure semantics depend on transport timing instead of explicit state

Durable streaming

The user turn owns a stable stream identity and the backend owns the lifecycle.

persisted chunks survive browser disconnects
reconnect resumes from durable state instead of guessing
cancellation, replay, and recovery follow explicit state transitions

The shape of the change is simple: the old version lets the request implicitly own the response, while the durable version gives the turn a stream identity that outlives any one connection.

Before: the request owns the visible output

After: the turn owns a durable stream

The Solution: Register, Queue, Watch

The core flow is three steps:

1
Register a durable stream for the turn.
2
Queue the actual chat run in a background job system that allows only one active run per chat.
3
Watch the durable stream while a background worker persists chunks into it.

That split matters. The request does not own model execution. It owns registration and observation.

When a user message arrives, the backend derives the stream's identifier from the user turn itself. So the stream belongs to the turn, not to the request that happened to carry it. For continuations, the system reuses the identifier from the original user turn, because the assistant is still logically finishing the same piece of work.

At the queue layer, we enforce an "at most one active or queued run per chat" invariant. The dispatch mechanism enforces this directly, not as a polite convention between client and server.

From there, the dispatcher does one of four things:

If the stream is new, it registers it, enqueues a background run, and returns a streaming response that watches the durable stream.
If the stream is already queued or running, it watches the existing stream and does not enqueue a duplicate run.
If the stream is terminal and the incoming message is another user submission, it watches the terminal stream and does not reopen it.A stream is terminal when it has reached a final state like completed, failed, or cancelled.
If the stream is terminal and the incoming message is a valid assistant continuation, it reopens the stream and queues a new background run.

The dispatcher only has four valid outcomes

Input

Incoming message arrives for a chat. Dispatcher checks stream status and message type before it does anything else.

stream is newincoming user turn

→

stream already queued or runningduplicate or reconnect path

→

Watch the existing stream. Do not enqueue a duplicate run.

stream is terminalnew user submission

→

Return the terminal stream as-is. Do not reopen finished work.

stream is terminalvalid assistant continuation

→

Reopen the existing stream id for that turn and queue continuation work.

This is where the durable-stream contract becomes enforceable. Request handling is not improvisation; it is a bounded decision table.

◆Key Takeaway

The request/worker split is only half the design. The other half is the rule that one user turn has one stable stream identity.

The worker then runs independently of the request lifecycle. It asks the agent to generate the response and persists chunks with an immediate strategy instead of buffering everything until the end. We store that stream state in a SQLite-backed stream store, separate from PostgreSQL chat history.The separation is intentional: chat history is durable conversation record, while stream storage is durable delivery state.

Replay and Live-Tail on Reconnect

Reconnect behavior is chat-scoped on the client. When the page mounts again after a refresh or disconnect, the client asks the backend whether that chat still has an active stream and, if it does, resumes from persisted state.

Reconnect does not mean "continue the old HTTP request." The client asks the backend which stream is still active for this chat, replays what was already persisted, and live-tails whatever arrives next.

http

POST /runs/{agentId}/chats/{chatId}/messages
GET /runs/{agentId}/chats/{chatId}/streams/active/watch
DELETE /runs/{agentId}/chats/{chatId}/streams/{streamId}

POST /runs/{agentId}/chats/{chatId}/messages
GET /runs/{agentId}/chats/{chatId}/streams/active/watch
DELETE /runs/{agentId}/chats/{chatId}/streams/{streamId}

On the backend, the active-watch endpoint is deliberately narrow:

If the chat has no active non-terminal stream, it returns 204.
If the chat does have an active stream, it watches that stream and replays from durable state.

Nothing here reopens finished work. Reconnect resumes only what is still in flight.

Reconnect replays persisted chunks,
then live-tails new output

This keeps reconnect centralized. The chat UI only needs to know the chat itself, not some fragile transport session from the past. Content resumes through the stream itself.Stream state is tracked separately so the UI can distinguish active generation from a conversation that is now waiting on the user instead of the model.

What Happens When the User Interrupts the Stream?

Idempotency, Continuations, and Cancellation

Once the stream is durable, every operation that touches it needs clear rules: what happens on a duplicate submit, when can a stream reopen, and what does cancellation mean when two layers of the system are involved.

Durability is really a constrained set of state transitions

queued

Stream exists. Work may not have started yet.

→

running

Worker is producing chunks and persisting them immediately.

→

completed

Terminal success. The stream is finished.

↓

waiting for input

The turn pauses without becoming a new conversation branch.

→

reopened continuation

Valid assistant continuation reuses the same turn stream identity.

cancelled

Terminal sink. Later errors must not rewrite this state to failed.

failed

Terminal failure. Used for queue collisions, orphan recovery, and unrecoverable execution failures.

Replay is the visible behavior. Trust comes from the rules around continuation, cancellation, and terminal states.

Idempotency comes first. If a user submits the same turn again while the stream already exists, the backend watches the existing stream instead of silently rerunning the turn. No duplicate work, no confusing double execution.

Continuation comes next. Assistant continuation is allowed only when two conditions hold:

there is a prior user turn to continue from
the conversation is actually waiting for user input

Without those checks, any client bug could reopen terminal streams at will. With them, continuation stays tied to the same turn that triggered the pause in the first place.

Cancellation has to cut through both layers of the system. The cancel endpoint cancels the durable stream and also cancels queued or active background jobs for that chat. The worker is cancellation-aware in both directions: it exits early if the stream is already cancelled before work starts, and it aborts generation if cancellation is detected mid-run.

Cancellation has to cut through
delivery state and execution state

One more rule matters here: later errors do not overwrite a cancelled stream to failed. Once a stream is cancelled, it stays cancelled.Without this rule, the final state depends on race timing between the cancel signal and the error handler.

1
Client
- Submits the turn once
- Watches the stream for that turn
- Reconnects by requesting the chat's active stream
2
API
- Derives a stable stream identity from the user turn
- Registers or reuses the durable stream
- Enqueues work under a one-active-run-per-chat invariant
3
Worker
- Consumes the queued run independent of the request lifetime
- Persists chunks immediately to durable stream storage
- Respects cancellation and writes a coherent terminal state

Startup Recovery for Orphaned Streams

Decoupling stream lifecycle from job lifecycle gives us better reconnect behavior, but it creates a new failure mode: a stream can exist in queued state before a background job exists at all.

This can happen in the narrow gap between recording that a stream exists and successfully handing the work to the queue. A different version happens when a worker dies mid-execution, leaving a stream in running with no healthy job behind it.

Startup recovery handles those cases by reconciling non-terminal streams against in-flight background jobs:

stale active jobs are force-failed
created and retry jobs are treated as recoverable
orphaned queued and running streams with no recoverable job are marked failed

This is intentionally conservative.

Startup recovery reconciles stream state against job reality

Stream state	Job reality	Recovery action	Why
queued or running	recoverable job exists (`created` or `retry`)	keep stream alive	The queue still has enough truth to finish the work honestly.
queued	no job exists	mark failed	Registration happened, but dispatch never left behind recoverable work.
running	worker died or active job is stale	fail stale work, then mark stream failed if nothing recoverable remains	The system should surface the loss instead of pretending execution still exists.

Recovery is intentionally conservative. If the queue cannot prove the work is still recoverable, the stream is failed explicitly.

Failure Semantics

If the browser disconnects or refreshes, the client reconnects and resumes from the chat's active stream. If the user submits the same turn twice, the backend watches the existing stream instead of rerunning work. If an assistant continuation arrives, the backend reopens the stream only when the chat is truly waiting for input.

Crashes and collisions get explicit outcomes. A queue singleton collision marks the stream failed and returns a 409. A crash before queue dispatch leaves the stream queued until startup recovery fails it. A crash during execution produces an orphaned running stream that recovery catches on the next startup. And a cancelled stream stays cancelled — later errors do not overwrite it.

The system still fails. But failure has a defined shape, and persisted output is not lost because one request disappeared.

Two Stores, Limited Recovery, Honest Gaps

This design buys clarity, but not for free.

First, we now operate two different persistence stories: PostgreSQL for chat history and SQLite for stream delivery state. We are comfortable with the split because the responsibilities are different, but it is still another moving piece. More importantly, "durable" only holds operationally if the stream store lives on persistent storage. Put it on ephemeral disk and the durability claim collapses with it.

Two stores, two responsibilities

Second, startup recovery is explicit but limited. We can preserve recoverable jobs and fail orphaned streams, but we do not auto-resume work when the job payload is gone. A real limitation, and one we accepted deliberately. Fake recovery is worse than explicit failure.

Third, the architecture is ahead of the test coverage. The product docs and internal semantics notes are consistent with the implementation, but we do not yet have the lifecycle test surface we should want around dispatch, stream persistence, worker execution, and recovery.

Still, the trade feels right. We chose a straightforward, inspectable design that works well in self-hosted environments and makes failure visible instead of magical.

The Agent Has Questions For You

How we built a tool that lets agents pause, ask structured questions, and learn from your answers — one of the flows durable streaming was designed to support.

When AI Writes HTML

The interactive-elements protocol that ships inline charts and KPIs over the same durable stream — and the five gates that keep the LLM's output safe to render.