A user opens the automations page, picks an agent, types a prompt, types 0 9 * * MON-FRI, and clicks save. From that moment on, every weekday at 9am UTC, the agent runs that prompt and leaves a finished chat session behind. No process the user controls is involved. The system has to wake itself up.


What looks like one feature is actually a small lifecycle stretched across two pg-boss queues. One queue holds the schedule. A different queue runs the chat. The cron tick lives in the first; the LLM call lives in the second; nothing important crosses the boundary except a job ID and a chat session row. That split is the entire reason a tight cron does not stall the next tick on a long answer.

A Saved Cron Is Three Rows in Two Systems

When a user creates an automation, two systems get a row. The Postgres Automation table gets the canonical record — agent, prompt, schedule, enabled flag. PG-boss's own schedule table — which lives in the same database — gets the cron registration that will fire the job. Both rows have to exist for the automation to do anything.


The route layer is the only place that keeps them in sync. A POST inserts the automation row, then calls scheduleAutomation if enabled === true. A PATCH calls unscheduleAutomation first, updates the row, then re-calls scheduleAutomation if the new state is enabled. A DELETE calls unscheduleAutomation before removing the row. Three handlers, two systems, one mirror.


The mirror itself is one function:

typescript
export async function scheduleAutomation(
  automationId: string,
  cron: string,
): Promise<void> {
  await boss.schedule(
    AUTOMATION_RUN_QUEUE,
    cron,
    { automationId } satisfies AutomationRunJobData,
    { key: automationId, singletonKey: automationId, tz: 'UTC' },
  );
}

boss.schedule is pg-boss's native cron scheduler. It writes one row to the schedule table; an internal poller reads that table on its own cadence and inserts a job into the automation-run queue every time the cron is due. The application never has to wake up to a timer of its own. The database is the clock.


The cron string itself is validated at the API boundary, not by pg-boss. A double-refined Zod schema first checks the field count, then runs the expression through cron-parser:

typescript
export const automationScheduleSchema = z
  .string()
  .min(9)
  .max(100)
  .refine(
    (v) => v.trim().split(/\s+/).length === 5,
    'Must be a valid 5-field cron expression',
  )
  .refine((v) => {
    try {
      CronExpressionParser.parse(v);
      return true;
    } catch {
      return false;
    }
  }, 'Invalid cron expression');

A user who types 0 9 * * MONFRI (missing the dash) gets "Invalid cron expression" at the request, not a silent never-fires automation an hour later when nobody is watching. The two refinements catch shape and semantics independently — the first rejects four- or six-field strings before the parser even tries; the second rejects anything the parser cannot understand.

The Cron Tick Doesn't Run the Chat — It Enqueues One

When a tick fires, the worker registered against automation-run runs runScheduledAutomation. The intuitive thing to expect is that this function is where the chat happens. It is not. The chat is somewhere else.


runScheduledAutomation re-reads the automation row, generates a fresh chatId and messageId, registers a stream, creates a ChatSession row tied to the automation, and then hands off the actual LLM work by calling requestChatRun — which is boss.send against a second queue, chat-run. By the time the function returns, no model has been called and no token has been streamed. All it did was put a new job on a different conveyor belt.


Concretely, here is one tick of 0 9 * * MON-FRI:

  1. 1

    Poller insert. PG-boss's internal schedule poller reads its schedule table at ~09:00 UTC and inserts a job into automation-run carrying {automationId} with singletonKey set to that same id.

  2. 2

    Worker fetch. One of the automation-run worker slots picks up the job within roughly a second of insert.

  3. 3

    Re-read the row. runScheduledAutomation calls prisma.automation.findUnique to read the current row — not the version captured when the schedule was registered.

  4. 4

    Create the chat session. A new chatId is generated, a stream is registered, and a ChatSession row is inserted with the automationId foreign key.

  5. 5

    Enqueue chat-run. requestChatRun calls boss.send against chat-run with the agent id, chat id, prompt, and stream id.

  6. 6

    Return. The automation-run worker is done. Total time inside the function is on the order of tens to low-hundreds of milliseconds — DB work only. The slot is free for the next tick.


Steps one through six finish before the model has produced a single token. The chat that the user will eventually open and read is, at this moment, sitting in a different queue waiting for a different worker. That separation is what keeps the scheduler responsive.

Key Takeaway

The cron tick's only job is to enqueue. The model call lives on a second queue with its own worker, its own retry policy, and its own timeout. Every failure window in this system follows from that split.

Two Queues Because One Policy Cannot Cover Both Workloads

Once the work is split across two queues, the queues themselves are free to disagree about retry, timeout, and concurrency. They do.


automation-run queue

typescript
upsertQueue(AUTOMATION_RUN_QUEUE, {
  retryLimit: 2,
  expireInSeconds: 1800,
  policy: 'exclusive',
});

Fast and safe to retry — its only job is to enqueue another job. A failed attempt cleans up its partially-created ChatSession before rethrowing, so a retry starts fresh with no leftover state.

chat-run queue

typescript
upsertQueue(CHAT_RUN_QUEUE, {
  retryLimit: 0,
  expireInSeconds: 1800,
  heartbeatSeconds: 30,
  policy: 'exclusive',
});

Long, non-idempotent — runs an LLM stream that may take minutes and produces visible output. retryLimit: 0 because re-running a half-finished chat would generate duplicate tokens against the same stream. The 30-second heartbeat keeps pg-boss from reclaiming an in-flight stream as abandoned.


Both queues are exclusive policy and both expire at 30 minutes. Everything else diverges. The asymmetry is the point — if both queues had to share a single policy, the chat queue's "never retry" rule would also apply to the scheduler queue, and a transient DB blip during enqueue would silently swallow a tick. Splitting the queues lets each one carry the policy that matches its actual job.

One Schedule Per Automation, Enforced by Two Different Keys

The call to boss.schedule passes both key: automationId and singletonKey: automationId. They look like the same parameter twice. They are not.


key is the schedule's identity in pg-boss. Calling boss.schedule again with the same key replaces the prior schedule rather than registering a duplicate. That is what makes the PATCH path safe — unscheduleAutomation plus a fresh scheduleAutomation does not need to verify nothing else slipped in between, because a second call with the same key would have overwritten any duplicate anyway.


singletonKey is an active-job lock at fire time. When the cron is due and pg-boss tries to insert a job into automation-run, the singleton key blocks the insert if a job with that same key is still in created or active state. The same identifier, used at a different layer, doing a different job.The chat-run queue also has a singleton, but it keys on chatId — and chatId is generated fresh every cron tick, so the chat-run singleton never collides between two cron-driven runs of the same automation.


The chat-run singleton still earns its keep elsewhere: any second enqueue against a chatId that already has an in-flight job is blocked, whether the second call came from a manual API trigger, a retry path, or any other caller. That guarantee just doesn't apply to the cron tick itself, because each tick brings its own fresh chatId.


Together, the two keys answer different questions. key answers "which schedule is this?" so re-registration is idempotent. singletonKey answers "is the prior tick still running?" so overlapping ticks are dropped instead of stacked.

Disable Is Checked Twice for a Reason

A user toggling an automation off should stop the next run. There is a race window where that toggle lands between a cron tick firing and the worker picking up the resulting job. Two layers close it.


The route layer is the first. POST only registers a schedule when enabled === true. PATCH always unschedules first, and only re-schedules if the new state is enabled. The pg-boss schedule row literally does not exist for a disabled automation, and no tick ever fires against it.


The worker layer is the second. Even when a tick has already fired and the job has been picked up, runScheduledAutomation re-reads the automation row before doing anything else and returns early if !automation.enabled || !automation.schedule. That re-read is the check that catches the user who disabled the automation a few hundred milliseconds after the cron fired but before the worker got there.


Belt and suspenders, deliberately. Either layer alone would have a race window the other one closes.

Tight Crons Silently Drop Ticks


The runScheduledAutomation function is fast precisely because it offloads to chat-run. A normal tick is in and out in tens of milliseconds. The singleton lock only matters when something blocks the function itself — DB latency, prisma reconnect, a slow chatSession.create under load. Then a tight cron starts dropping ticks, because the prior automation-run job is still completing when the next one tries to enter the queue.


The user-facing consequence is missing runs with no obvious cause. The audit trail at /automations/:id/runs reflects only ChatSession rows that survived past enqueue, so a dropped tick leaves no row to count. There is no notification. There is no event written to a dashboard. The honest current behavior is that a cron set to * * * * * paired with an unusually slow runScheduledAutomation will quietly under-deliver, and the only signal is that the expected number of sessions is short.

A Mid-Run Worker Crash Leaves an Orphan, Not a Retry


That decision creates a different problem. If the worker process dies after a stream is marked running but before the chat completes, pg-boss will not re-enqueue the job, and the stream is left orphaned with no worker to finish it. The system has to recover from outside the queue.


It does, at next boot. recoverOrphanedRunningStreams runs during startup in two stages. First, any chat-run job still in active state is failed outright — at boot time those jobs are definitionally stale, since the process that held them just restarted. Then the sweep cross-references stream IDs against created and retry jobs still safely queued. A stream whose ID appears in that queued set is left alone; the new process will pick it up normally. Every other running or queued stream is marked failed. The user sees the failure on next page load — not in real time, not via a push, just the next time they open the chat. The trade is deliberate: accept a delayed failure signal in exchange for never producing a duplicate AI response into a live stream.

The Three-Step PATCH Is Not Transactional

Updating an automation runs three operations in sequence: unscheduleAutomation, then prisma.update, then scheduleAutomation if the new state is enabled. None of those calls share a transaction. They cannot — pg-boss writes to its own tables through its own client, and the Prisma update is a separate database round trip.


The window is small but real. If the process dies between step one and step three, the automation row exists with enabled=true and the prior pg-boss schedule row has already been deleted. The API will return success on the subsequent GET because the Postgres row is fine. No tick will ever fire because the pg-boss schedule row is missing.


There is no startup reconciliation that walks prisma.automation.findMany({ where: { enabled: true } }) and re-registers schedules that are missing from pg-boss. The drift would only surface as "this automation stopped firing" without an obvious cause, hours or days later. It is the gap that surprises operators most, and it is genuinely a gap — not a designed-around behavior.

UTC Cron, No Per-Automation Timezone

The tz argument passed to boss.schedule is hard-coded to 'UTC'. The Automation Prisma model has no timezone column. There is nowhere for a per-automation timezone to live.


The product surface today asks the user to think in UTC. A São Paulo user who wants "every weekday at 9am local" must pre-convert their cron — and that conversion drifts twice a year on DST transitions, because UTC does not observe DST and São Paulo, when its government decides to, does. The same cron expression that fires at 9am local in March can fire at 8am local in November.


This is a known limitation, not a subtle bug. Adding a per-automation timezone is a timezone String column on the model and one extra argument threaded into scheduleAutomation. The mechanism is in place; the schema and the surface area are not.

What's Honest About This Design Today

The two-queue split earns its complexity. It keeps cron ticks responsive under long LLM calls, lets each queue carry the retry and heartbeat policy its workload actually needs, and keeps a non-idempotent chat from being silently re-run. The mechanism is small — one schedule call, one re-read, one boss.send — and the invariants it enforces are real.


What the design does not do today is just as worth saying. There is no boot-time reconciliation between Automation rows and pg-boss schedule rows, so a crashed PATCH can leave an automation that looks live but never fires. Singleton drops are not surfaced anywhere the user can see, so an over-tight cron under-delivers in silence. There is no per-automation timezone, so DST quietly shifts schedules for users outside UTC. There is no dry-run or "next 5 fire times" preview, even though the cron parser used at validation time would make it a small change.


Each of those gaps has a specific code path that would need to change. Naming them is the honest version of "the design is good." The design is good for the cases it was built for; the cases it does not yet cover are knowable, and they are the next things to build.