When AI Writes HTML: The Interactive Elements Protocol

Custom Tags Beat JSON Tool-Calls for Inline Visuals

There is a moment in every LLM product where the model has to render something that is not text. A bar chart inside an answer. A KPI card next to a sentence. A small dashboard the user can re-parameterize. The default reach is a JSON tool-call: define a schema, stream a structured object, render it on the client. It works, and it is also wrong for visuals interleaved with prose.

The reason is a streaming one. A long JSON object arrives partial for the entire generation, and the renderer has to either hide the chart until the closing brace lands or guess at incomplete fields. The model also cannot easily mix three small charts into a paragraph and end with a closing sentence — the surface forces a single object per turn, or a wrapper schema that is itself another thing to maintain.

Limerence took the other path. The model emits kebab-case HTML tags inline in its streaming markdown response, and the frontend dispatches each tag to a real React component. Every data-bound tag — KPIs, charts, tables — carries a sql= attribute that the browser re-executes against the agent's database. Layout and parameter tags (<grid>, <param-select>, <dashboard-controls>) carry only presentational props. There is no inline JSON, no data prop, no escape hatch for the model to ship arbitrary blobs.

JSON tool-call surface

The model emits a single structured tool-call per turn. The chart cannot coexist with prose; the renderer waits for the closing brace; partial state leaks during streaming.

json

{
  "type": "bar_chart",
  "title": "Revenue by region",
  "x_key": "month",
  "y_key": "revenue",
  "data": [
    { "month": "Jan", "revenue": 12400 },
    { "month": "Feb", "revenue": 15800 }
  ]
}

{
  "type": "bar_chart",
  "title": "Revenue by region",
  "x_key": "month",
  "y_key": "revenue",
  "data": [
    { "month": "Jan", "revenue": 12400 },
    { "month": "Feb", "revenue": 15800 }
  ]
}

Inline kebab-case tag

The model writes prose and components in the same response. The tag carries only presentational props plus sql=; the browser fetches the data independently.

html

Revenue rose for the second month in a row.

<bar-chart title="Revenue by region"
           sql="SELECT region AS month, SUM(amount) AS revenue
                FROM sales GROUP BY region"
           x-key="month" y-key="revenue" />

Most of the lift came from EMEA.

Revenue rose for the second month in a row.

<bar-chart title="Revenue by region"
           sql="SELECT region AS month, SUM(amount) AS revenue
                FROM sales GROUP BY region"
           x-key="month" y-key="revenue" />

Most of the lift came from EMEA.

The trade-off is not free. HTML in a streaming markdown channel is a wider attack surface than JSON. CommonMark wraps things it shouldn't. A token can arrive split across a tag boundary. A sanitizer that is too permissive turns the chat into an XSS vector; a sanitizer that is too strict eats the tags the product just emitted. The rest of this post is how that surface gets paid for.

A Token's Path: Stream Chunk to Mounted Chart

Picture a single token leaving the model. By the time it lands as a pixel inside a <BarChart>, it has passed through eight stages, and five of them are gates that can refuse to forward it. The lifecycle is the post's spine — every later section returns to a stage by number.

A token's path: eight stages, five gates

01 Stream

LLM emits kebab-case tags inline with markdown

The model writes prose and components in the same response. A bar chart appears as <bar-chart sql="..." x-key="month" /> flowing inside a paragraph, never as a separate JSON tool-call.

02 BufferGate 1 · buffer

Backend chunker holds tokens until the tag closes

A streaming HTML parser tracks open-tag depth and only flushes when the matching close brings depth to zero. Half-open tags never reach the browser.

Gate 1 · buffer

03 Transport

AI SDK forwards completed elements to the client

The chat hook receives complete fragments, not character-by-character tokens. Reconnect, error handling, and message identity are inherited from the SDK.

04 NormalizeGate 2 · normalize

Pre-processor wraps tags in `<div>` and escapes `*`

CommonMark would otherwise eat the tag inside a paragraph or bold the page when SQL contains SELECT *. The pre-processor collapses multi-line tags, escapes asterisks inside attributes, and converts \" to " inside SQL props.

Gate 2 · normalize

05 ParseGate 3 · allowlist

Streamdown runs remark, rehype-raw, and rehype-sanitize

Markdown becomes an HTML tree, raw HTML is allowed in, and the sanitizer enforces the allowlist. Anything not registered as a tag, or any attribute not declared, is stripped.

Gate 3 · allowlist

06 Mount

Registry maps each tag to its React component

A small array of { name, component, allowedAttributes } records becomes both the rehype component map and the sanitizer schema. One source of truth, two consumers.

07 ValidateGate 4 · sql read-only

Browser re-validates the SQL string before executing

Every data-bound component runs the same read-only check on the SQL attribute that the agent ran in its sandbox. The model is trusted to compose; the browser is trusted to refuse writes.

Gate 4 · sql read-only

08 RenderGate 5 · throttle 50ms

Throttle caps re-parse to once every 50ms

Even at full streaming speed, the markdown tree is rebuilt at most twenty times per second. The slide-up reveal stays smooth and Recharts mounts without jitter.

Gate 5 · throttle 50ms

Five of the eight stages enforce a specific invariant; three are transit. The post returns to each stage by number.

◆Key Takeaway

The protocol's safety lives in five enforcement sites along the token's path, not in the LLM's good behavior. Each gate refuses a specific malformation; a prompt change cannot weaken any of them.

A useful way to read the diagram: the model is the only thing in the pipeline that is allowed to be wrong. Every later stage assumes its input is hostile. That framing is worth holding while reading the rest, because the design choices stop looking defensive and start looking like the only correct posture for an LLM-emitted UI.

The Backend Chunker Refuses to Flush a Half-Open Tag

The first gate sits before the network. The model's stream is piped through a chunker that buffers tokens until a complete top-level element exists, then flushes the entire element at once. Without this, the browser would see characters like <bar-cha as visible text, the markdown parser would treat the half-tag as inline literal, and the eventual closing of the tag would leave a permanently broken paragraph.

The mechanism is depth tracking. A streaming HTML parser walks the buffer, counts onopentag and onclosetag, and only marks the element complete when the matching close brings depth back to zero.There is a manual zero-dependency twin to the htmlparser2 path that does the same walk by hand, tracking quote state and escape sequences. It exists because the chunker has to run in environments that can't pull in the full parser, and because reviewing a 200-line state machine is easier than auditing a transitive dependency tree. Self-closing tags are recognized explicitly, so <param-date-range /> flushes on the slash without waiting for a phantom close.

The flush condition is unambiguous and easy to reason about:

const parser = new Parser(
  {
    onopentag(name) {
      if (rootTagName === null) rootTagName = name;
      depth++;
    },
    onclosetag(name) {
      depth--;
      if (depth === 0 && name === rootTagName) {
        elementEnd = parser.endIndex + 1;
      }
    },
  },
  { recognizeSelfClosing: true, lowerCaseTags: true },
);
parser.write(buffer);
return elementEnd !== -1 ? buffer.slice(0, elementEnd) : null;

const parser = new Parser(
  {
    onopentag(name) {
      if (rootTagName === null) rootTagName = name;
      depth++;
    },
    onclosetag(name) {
      depth--;
      if (depth === 0 && name === rootTagName) {
        elementEnd = parser.endIndex + 1;
      }
    },
  },
  { recognizeSelfClosing: true, lowerCaseTags: true },
);
parser.write(buffer);
return elementEnd !== -1 ? buffer.slice(0, elementEnd) : null;

The trade-off is latency. A long opening tag delays delivery of every later token in the same root element. In practice, tags average under 200 characters, the model emits them quickly, and the user perceives a single visual bloom as the chart appears at once rather than character-by-character growth that would never look right anyway.

The chunker also assumes well-formed open tags. The htmlparser2 path tolerates almost anything because the underlying parser is liberal, but the manual zero-dependency twin tracks inQuote by hand — an unescaped inner quote inside a SQL attribute (sql="SELECT * FROM \"users\"" arriving without proper escaping) can flip the state machine and either flush early or never flush. The pre-processor in the next stage repairs most of this before markdown ever sees it; the chunker's contract is "well-formed in, complete element out," not "broken in, fixed out."

CommonMark Eats Custom Tags Unless You Wrap Them in `<div>`

The second gate is the strangest one in the system, because it exists to defend against a rule that is correct in the spec and inconvenient in practice. CommonMark treats raw HTML as a block element only when it sits at the top of a line and matches certain shapes. A bare <bar-chart> line is treated as inline HTML inside an implicit paragraph, which produces invalid markup the moment the chart renders a <div> of its own.

The pre-processor performs three load-bearing transformations on the streamed text before the markdown parser sees it. The first collapses multi-line opening tags into a single line so the parser sees one <bar-chart …> instead of several broken pieces. The second escapes * to * inside attribute values, because a SELECT * would otherwise turn the rest of the page bold. The third wraps every top-level custom tag in a <div>…</div> so CommonMark treats it as an HTML block.

Streamdown input

Multi-line tag, asterisk in SQL, escaped quote — every one of these is a CommonMark or attribute-string hazard.

html

<bar-chart
  title="Revenue"
  sql="SELECT * FROM sales WHERE region = \"emea\"" />

<bar-chart
  title="Revenue"
  sql="SELECT * FROM sales WHERE region = \"emea\"" />

After normalize

Tag collapsed onto one line, * escaped so it cannot trigger bold, \" converted to ", and the whole element wrapped in a <div> so the markdown parser hands it to rehype-raw as a block.

html

<div><bar-chart title="Revenue" sql="SELECT &#42; FROM sales WHERE region = &quot;emea&quot;" /></div>

<div><bar-chart title="Revenue" sql="SELECT &#42; FROM sales WHERE region = &quot;emea&quot;" /></div>

A useful detail for anyone hitting the same wall: the heuristic for "is this a custom tag" is the presence of a hyphen in the tag name. That is the same rule the HTML spec uses for custom elements, and it doubles as a debug aid — an unregistered tag with a hyphen still flows through the pipeline far enough to be visible in the DOM, which means a typo surfaces as a broken render rather than total silence.

The Sanitizer's Allowlist Is the Schema

The third gate is the load-bearing one for security. After the markdown parser produces an HTML tree, rehype-sanitize walks that tree against an allowlist of tag names and per-tag allowed attributes. The allowlist is not hand-written; it is built from the same registry that supplies the React components.

The registry is three fields:

export type GenAIInteractiveElement = {
  name: string;
  component: ComponentType<any>;
  allowedAttributes: string[];
};

export type GenAIInteractiveElement = {
  name: string;
  component: ComponentType<any>;
  allowedAttributes: string[];
};

A small build step walks the array and produces two parallel maps the markdown renderer consumes — a component map ({ kpi: KPI, … }) and the sanitizer's allowlist ({ kpi: ['title', 'sql', 'variant', …] }). One source of truth, two consumers. There is no inline-JSON escape hatch. A hypothetical <kpi data="…"> from the model loses the data attribute before the sanitizer hands the tree to React, because data is not in the allowlist and the sanitizer does not negotiate.

The same build step unconditionally overrides the <p> renderer. This is not optional: a markdown paragraph that wraps a custom block element would otherwise produce <p><div>…</div></p>, which is invalid HTML and triggers React hydration warnings. Hijacking <p> lets the renderer emit a div that styles like a paragraph but legally contains block children.

The Browser Re-Validates SQL the Model Already Validated

The fourth gate is the one that surprises reviewers. Every data-bound component runs the same read-only check on the SQL attribute that the agent already ran in its sandbox before emitting the tag. Two validations of the same string, in two different processes, on the same machine.

The reason is the threat model. The SQL string ships verbatim from the model into the user's browser; nothing in between is trusted to constrain it. The agent's pre-emission validation is a behavioural nudge — a guardrail against the model emitting a write query — but the frontend gate is the load-bearing one because it runs in the only process that has the user's database credentials.

The data hook keys cached results by the SQL string itself, so two charts in the same response with identical SQL share a single fetch. That dedupe is intentional and shows up most often when a <grid> of three KPIs uses the same window function with different WHERE clauses — the planner sees three separate queries, the cache sees three separate keys, and the user sees one round-trip per unique query.

The 50ms Throttle Bounds the Cost of Re-Parsing

The fifth gate is the only one that protects cost rather than correctness. Without it, every streamed token would trigger a full markdown re-parse — remark walks the buffer, rehype-raw rebuilds the HTML tree, rehype-sanitize scrubs the allowlist against the registry, and React diffs the resulting tree. At LLM token rates that is dozens of full re-parses per second on a buffer that grows on every step, and the work is roughly linear in message length per pass.

useAgentChatSetup configures experimental_throttle: 50, which caps the renderer at twenty markdown rebuilds per second regardless of how fast tokens arrive. Fifty milliseconds is below the threshold a user reads as "delayed," so there is no perceived latency cost; the win is that long messages stay cheap to render even as the conversation grows. The smooth slide-up entrance for charts is a downstream effect of the same budget — every paint frame has the headroom to actually finish.

Where the Pipeline Fails Quietly

The honest section. Five failure windows are easy to hit and worth knowing about, because none of them throws a visible error.

Param races chart mount. A <param-select> registers itself into a parameter store on mount; a sibling <bar-chart sql="… WHERE region = '{{region}}'"> that mounts in the same render frame may run with params.region === undefined for one tick before re-querying. The interpolation path tolerates the unresolved placeholder and substitutes empty, which produces a noisy zero-row return that briefly renders "No data returned" before the second query lands.The cache keys by SQL plus params plus agent ID, which means two charts in the same message with identical SQL share a fetch — intended dedupe. Two charts in different messages with identical SQL also share, which can flicker through cached-then-fresh transitions when the second message lands. The behaviour is correct; the visual is occasionally surprising.

Sanitizer eats unknown tags. The protocol's biggest debugging trap is forgetting to add a new component to the registry array. The model emits the tag, the sanitizer strips it because it is not in the allowlist, and the user sees nothing. There is no console warning and no fallback "I tried to render <sankey> but it isn't registered" message — the tag simply disappears. Until the developer thinks to view-source on the assistant message, there is no signal that anything went wrong.

Required attributes are not enforced. The allowlist is exactly that: a list of attributes that may pass through. It does not enforce that any of them must be present. A chart whose component hard-reads a required x-key from props can throw on render when the model forgets to emit it. The throw is caught by an error boundary that wraps the entire <Streamdown> instance for that assistant message — the recovery surface is per-message, not per-element. The message's text is replaced wholesale by a small red "Failed to render content"; earlier messages, later messages, and the input continue rendering unaffected. One bad chart costs the whole assistant turn, which is the right trade against a stale React tree, but it is worth knowing the granularity is coarser than it looks.

Backend stream cuts mid-element. If the AI SDK stream errors while the chunker is buffering an unclosed <bar-chart … tag, the buffered prefix is lost — it never reaches the client. The frontend sees a clean cut at the last completed element, and the assistant message simply ends short. There is no resume protocol for partial elements and no marker that something was discarded. In practice this is a quiet failure of provider connectivity rather than a bug in the protocol, but operators running this in production should know that a half-streamed chart never makes it to the browser as a half-rendered chart — it disappears.

Compare is not a registry consumer. The current Compare route renders assistant messages without passing dashboardComponents down through the messages context, which means custom tags emitted on that surface are stripped by the same sanitizer-eats-unknown-tags path described above. Whether that is intentional (Compare is a side-by-side model evaluator and may be deliberately text-only) or a regression is itself the open question — and it is exactly the route-level shape of the registry-drift class the next section is about. Surfaces have to opt into the registry; forgetting is silent.

Three Files Per Tag, and the Drift That Lives Between Them

The protocol is small enough that the entire contract for one tag fits in a developer's head. It is also large enough that the contract has to live in three places at once, with no codegen and no runtime bridge between them.

The component author writes a normal React component and a typed props interface. This is the only place TypeScript can check; it covers nothing that crosses a process boundary.

interface KPIProps {
  title: string;
  sql: string;
  variant?: 'default' | 'trend';
  trendSql?: string;
  format?: 'number' | 'currency' | 'percent';
}

export function KPI(props: KPIProps) { /* … */ }

interface KPIProps {
  title: string;
  sql: string;
  variant?: 'default' | 'trend';
  trendSql?: string;
  format?: 'number' | 'currency' | 'percent';
}

export function KPI(props: KPIProps) { /* … */ }

The drift is asymmetric. Adding a prop to the TS interface and the allowlist but forgetting the reference doc means the model never emits the prop. Adding a prop to the reference doc and the TS interface but forgetting the allowlist means the model emits it and the sanitizer strips it before the component runs. Adding a prop to the doc and the allowlist but forgetting the TS interface means the prop arrives untyped — a string the component has to coerce by hand.

There is a fix-shaped hole here that no one has filled. A build-time check that diffs the three artifacts against one another would catch every drift class above with no behavioural change. A Zod or Valibot schema at the registry boundary would let mismatched attribute strings surface as labeled fallbacks instead of undefined props. Neither has been built, and that is itself a useful trade-off to name out loud: the simplicity of three small files is worth more, today, than the centralization a generated layer would buy.

The deeper lesson, for anyone considering this protocol: an LLM that emits HTML is exactly as dangerous as the schema that validates it, and the schema lives wherever the developer last edited it. Five gates in the pipeline cannot save a sixth gate that does not exist. Knowing which gate you are missing is the actual skill.

Limerence Chat

The chat surface where these inline tags actually render — built on the same streaming and read-only safety primitives the rest of the engineering posts cover.

Custom Tags Beat JSON Tool-Calls for Inline Visuals

A Token's Path: Stream Chunk to Mounted Chart

The Backend Chunker Refuses to Flush a Half-Open Tag

CommonMark Eats Custom Tags Unless You Wrap Them in <div>

The Sanitizer's Allowlist Is the Schema

The Browser Re-Validates SQL the Model Already Validated

The 50ms Throttle Bounds the Cost of Re-Parsing

Where the Pipeline Fails Quietly

Three Files Per Tag, and the Drift That Lives Between Them

CommonMark Eats Custom Tags Unless You Wrap Them in `<div>`