Most AI chat products stream well only as long as the original request stays alive. Refresh the page, switch networks, or lose the tab at the wrong moment, and the response disappears with the connection. Sometimes the model keeps running in the backend. Sometimes it does not. Either way, the user has no clear mental model of what survived.
We wanted a stricter contract for Limerence. A stream should belong to the chat turn itself, not to one fragile HTTP request.A stream is the flow of response tokens the AI sends back as it generates an answer. A turn is one user message and the agent response it triggers. Durable means the response is saved to disk, not held in memory that vanishes with the connection. If the browser disconnects, the output that already arrived should still exist. If the user comes back, the client should resume from durable state. If the backend dies mid-run, recovery should make that failure explicit instead of leaving a hanging spinner forever.
One Turn, One Stream, Durable State
Limerence treats streaming as a background process with durable state. A request registers a stream, enqueues a chat run into a job queue, and watches that stream while a background workerA separate process that runs the AI agent independently of the browser. persists chunksThe individual pieces of the response as they arrive. into a SQLite-backed stream store. On reconnect, the client asks the server for the active stream for that chat and resumes from there.
The design choice that matters is giving each user turn a stable stream identifier and letting the backend decide whether that stream should be watched, reused, reopened, cancelled, or failed. That single identity is what makes reconnects, idempotency, and recovery fit together.Idempotency is the guarantee that submitting the same message twice does not run the work twice.
Read the flow top to bottom. The same identity is what lets the system suppress duplicate work, preserve partial output, and recover honestly after interruption.
Request binds the stream to the user turn
The stream is created or reused before generation starts, so the identity belongs to the turn instead of the transport.
refresh after this point still has a real stream to reconnect to
Dispatcher enforces one active run per chat
A duplicate submit watches the existing stream instead of launching another worker path.
same turn, same stream, no accidental double execution
Worker continues independently of the request
Model execution no longer depends on the browser keeping the original request alive.
transport can die while useful work keeps going
Chunks are written to durable delivery storage immediately
Already-produced output survives reconnects because delivery state is saved as it arrives.
partial output is no longer trapped inside one socket
Reconnect replays saved chunks, then live-tails new ones
The client asks for the active stream for the chat and resumes from durable state instead of reviving a dead session.
replay, idempotency, and recovery align on one identity
Request Lifetime Is Not Generation Lifetime
Streaming gets messy when request lifetime and generation lifetime are treated as the same thing. They are not. The browser can go away while the model is still generating. The network can drop after half the answer has already been produced. A worker can crash after work started but before the user sees the final chunk.
For a data product, those failures are worse than a small UX glitch. If a user asks a long question against a large schema, we may spend real time assembling context, running the agent, and composing an answer. Throwing that state away because one request died is wasteful. Rerunning the same turn blindly is also dangerous because duplicate execution creates its own ambiguity: did the system retry, or did the user accidentally submit twice? We have been on the receiving end of that confusion.
We also had another constraint. Some conversations pause because the system needs a human decision or a missing piece of input. When that happens, the next assistant turn is not a brand-new conversation branch. It is a continuation of the same turn.The backend enforces this explicitly: continuation is only valid when the conversation is actually waiting for user input, not whenever the client feels like sending another assistant message. So the system needs a durable notion of "the stream for this turn" that survives across pauses.
Request-bound streaming
The original request implicitly owns the response lifecycle.
- refresh or disconnect can sever the only visible copy of the output
- retries can accidentally rerun the same turn
- failure semantics depend on transport timing instead of explicit state
Durable streaming
The user turn owns a stable stream identity and the backend owns the lifecycle.
- persisted chunks survive browser disconnects
- reconnect resumes from durable state instead of guessing
- cancellation, replay, and recovery follow explicit state transitions
The Solution: Register, Queue, Watch
The core flow is three steps:
- 1Register a durable stream for the turn.
- 2
Queue the actual chat run in a background job system that allows only one active run per chat.
- 3
Watch the durable stream while a background worker persists chunks into it.
That split matters. The request does not own model execution. It owns registration and observation.
When a user message arrives, the backend derives the stream's identifier from the user turn itself. So the stream belongs to the turn, not to the request that happened to carry it. For continuations, the system reuses the identifier from the original user turn, because the assistant is still logically finishing the same piece of work.
At the queue layer, we enforce an "at most one active or queued run per chat" invariant. The dispatch mechanism enforces this directly, not as a polite convention between client and server.
From there, the dispatcher does one of four things:
- If the stream is new, it registers it, enqueues a background run, and returns a streaming response that watches the durable stream.
- If the stream is already
queuedorrunning, it watches the existing stream and does not enqueue a duplicate run. - If the stream is terminal and the incoming message is another user submission, it watches the terminal stream and does not reopen it.A stream is terminal when it has reached a final state like completed, failed, or cancelled.
- If the stream is terminal and the incoming message is a valid assistant continuation, it reopens the stream and queues a new background run.
Input
Incoming message arrives for a chat. Dispatcher checks stream status and message type before it does anything else.
Register stream, enqueue work, return a watcher for that stream.
Watch the existing stream. Do not enqueue a duplicate run.
Return the terminal stream as-is. Do not reopen finished work.
Reopen the existing stream id for that turn and queue continuation work.
◆Key Takeaway
The request/worker split is only half the design. The other half is the rule that one user turn has one stable stream identity.
The worker then runs independently of the request lifecycle. It asks the agent to generate the response and persists chunks with an immediate strategy instead of buffering everything until the end. We store that stream state in a SQLite-backed stream store, separate from PostgreSQL chat history.The separation is intentional: chat history is durable conversation record, while stream storage is durable delivery state.
Replay and Live-Tail on Reconnect
Reconnect behavior is chat-scoped on the client. When the page mounts again after a refresh or disconnect, the client asks the backend whether that chat still has an active stream and, if it does, resumes from persisted state.
Reconnect does not mean "continue the old HTTP request." The client asks the backend which stream is still active for this chat, replays what was already persisted, and live-tails whatever arrives next.
POST /runs/{agentId}/chats/{chatId}/messages
GET /runs/{agentId}/chats/{chatId}/streams/active/watch
DELETE /runs/{agentId}/chats/{chatId}/streams/{streamId}POST /runs/{agentId}/chats/{chatId}/messages
GET /runs/{agentId}/chats/{chatId}/streams/active/watch
DELETE /runs/{agentId}/chats/{chatId}/streams/{streamId}On the backend, the active-watch endpoint is deliberately narrow:
- If the chat has no active non-terminal stream, it returns
204. - If the chat does have an active stream, it watches that stream and replays from durable state.
Nothing here reopens finished work. Reconnect resumes only what is still in flight.
Connection one
The first browser session receives and persists chunks before the network disappears.
Connection two
The browser asks for the active stream for the chat, replays persisted chunks, then keeps watching.
This keeps reconnect centralized. The chat UI only needs to know the chat itself, not some fragile transport session from the past. Content resumes through the stream itself.Stream state is tracked separately so the UI can distinguish active generation from a conversation that is now waiting on the user instead of the model.
What Happens When the User Interrupts the Stream?
Idempotency, Continuations, and Cancellation
Once the stream is durable, every operation that touches it needs clear rules: what happens on a duplicate submit, when can a stream reopen, and what does cancellation mean when two layers of the system are involved.
queued
Stream exists. Work may not have started yet.
running
Worker is producing chunks and persisting them immediately.
completed
Terminal success. The stream is finished.
waiting for input
The turn pauses without becoming a new conversation branch.
reopened continuation
Valid assistant continuation reuses the same turn stream identity.
cancelled
Terminal sink. Later errors must not rewrite this state to failed.
failed
Terminal failure. Used for queue collisions, orphan recovery, and unrecoverable execution failures.
Idempotency comes first. If a user submits the same turn again while the stream already exists, the backend watches the existing stream instead of silently rerunning the turn. No duplicate work, no confusing double execution.
Continuation comes next. Assistant continuation is allowed only when two conditions hold:
- there is a prior user turn to continue from
- the conversation is actually waiting for user input
Without those checks, any client bug could reopen terminal streams at will. With them, continuation stays tied to the same turn that triggered the pause in the first place.
Cancellation has to cut through both layers of the system. The cancel endpoint cancels the durable stream and also cancels queued or active background jobs for that chat. The worker is cancellation-aware in both directions: it exits early if the stream is already cancelled before work starts, and it aborts generation if cancellation is detected mid-run.
One more rule matters here: later errors do not overwrite a cancelled stream to failed. Once a stream is cancelled, it stays cancelled.Without this rule, the final state depends on race timing between the cancel signal and the error handler.
- submits the turn once - watches the stream for that turn - reconnects by asking for the chat's active stream
Startup Recovery for Orphaned Streams
Decoupling stream lifecycle from job lifecycle gives us better reconnect behavior, but it creates a new failure mode: a stream can exist in queued state before a background job exists at all.
This can happen in the narrow gap between recording that a stream exists and successfully handing the work to the queue. A different version happens when a worker dies mid-execution, leaving a stream in running with no healthy job behind it.
Startup recovery handles those cases by reconciling non-terminal streams against in-flight background jobs:
- stale
activejobs are force-failed createdandretryjobs are treated as recoverable- orphaned
queuedandrunningstreams with no recoverable job are markedfailed
This is intentionally conservative.
| Stream state | Job reality | Recovery action | Why |
|---|---|---|---|
| queued or running | recoverable job exists (`created` or `retry`) | keep stream alive | The queue still has enough truth to finish the work honestly. |
| queued | no job exists | mark failed | Registration happened, but dispatch never left behind recoverable work. |
| running | worker died or active job is stale | fail stale work, then mark stream failed if nothing recoverable remains | The system should surface the loss instead of pretending execution still exists. |
Failure Semantics
If the browser disconnects or refreshes, the client reconnects and resumes from the chat's active stream. If the user submits the same turn twice, the backend watches the existing stream instead of rerunning work. If an assistant continuation arrives, the backend reopens the stream only when the chat is truly waiting for input.
Crashes and collisions get explicit outcomes. A queue singleton collision marks the stream failed and returns a 409. A crash before queue dispatch leaves the stream queued until startup recovery fails it. A crash during execution produces an orphaned running stream that recovery catches on the next startup. And a cancelled stream stays cancelled — later errors do not overwrite it.
The system still fails. But failure has a defined shape, and persisted output is not lost because one request disappeared.
Two Stores, Limited Recovery, Honest Gaps
This design buys clarity, but not for free.
First, we now operate two different persistence stories: PostgreSQL for chat history and SQLite for stream delivery state. We are comfortable with the split because the responsibilities are different, but it is still another moving piece. More importantly, "durable" only holds operationally if the stream store lives on persistent storage. Put it on ephemeral disk and the durability claim collapses with it.
Second, startup recovery is explicit but limited. We can preserve recoverable jobs and fail orphaned streams, but we do not auto-resume work when the job payload is gone. A real limitation, and one we accepted deliberately. Fake recovery is worse than explicit failure.
Third, the architecture is ahead of the test coverage. The product docs and internal semantics notes are consistent with the implementation, but we do not yet have the lifecycle test surface we should want around dispatch, stream persistence, worker execution, and recovery.
Still, the trade feels right. We chose a straightforward, inspectable design that works well in self-hosted environments and makes failure visible instead of magical.
How we built a tool that lets agents pause, ask structured questions, and learn from your answers — one of the flows durable streaming was designed to support.