The Text-to-SQL Accuracy Playbook

Generated SQL Returns Rows That Look Right

A bar chart renders, the numbers carry one decimal place, the color of the bars is right. It is also wrong. The query the agent generated reads from a column that does not exist on the table you actually meant. The model invented a plausible name, the database returned an empty result for that branch of the query, and the surrounding aggregation hid the empty branch. The chart looks complete because nothing on the page knows it isn't.

That is the failure shape this post is about. Not "the model is bad at SQL" — the model is fine, given enough context. The failure is that the surrounding system trusted the SQL string the model produced without forcing it through the same invariants every other query has to satisfy. The fix is not a better prompt. The fix is a pipeline that the next query, and the next, and the next, all share.

The rest of this post walks the pipeline that Limerence forces every generated SQL string through, the invariants pinned along the way, what happens when the model is wrong, and where the story is still rules in prose rather than rules in code.

One Pipeline, Every Invariant Per Query

Every SQL string that wants to touch a database in Limerence — model-generated, tool-called, skill-emitted, internal — passes through one funnel. Each adapter (Postgres, SQLite, BigQuery, MySQL, SQL Server) declares an ordered list of preprocessors and an ordered list of recoverers. The funnel runs them in the same order, every time, and routes the result the same way.

The four-beat lifecycle is small enough to hold in your head:

Preprocess. Walk the adapter's preprocessor list against the SQL string. AST validation runs first; identifier-quoting fixes run after. A preprocessor either returns a transformed string or throws.

Run. Hand the preprocessed string to the driver. Postgres uses pg, SQLite uses better-sqlite3, BigQuery uses the official client. Connection-level read-only is already on by this point.

Walk recoverers on error. If the driver throws, walk the dialect's recoverer list. The first recoverer that produces a non-null repaired string wins. The repaired string re-enters preprocessing — not just the driver — so the AST check fires on the rewrite.

Return rows or throw RepairExhausted. Success returns the row set. Exhaustion throws. The pipeline does not silently degrade.

One pipeline, every invariant per query

Generated SQL string

Model output, tool call, or skill-emitted query. Whatever the source, the next four stages are the same.

Preprocess

Each adapter declares an ordered list. validateReadOnly fires first — AST check, dialect-locked. Postgres adds quoteIdentifiers to preserve mixed-case columns.

Postgres

validateReadOnly("PostgreSQL")quoteIdentifiers

SQLite

validateReadOnly("SQLite")

BigQuery

validateReadOnly("BigQuery") (dryRun)

MySQL

validateReadOnly("MySQL")

SQL Server

validateReadOnly("TransactSQL")

Run

DatabaseAdapter.execute() hits the driver. Connection-level read-only is already on by this point — defense in depth on top of the AST check.

on error

Walk recoverers

Each adapter declares its own list. SQLite rewrites correlated-CTE shapes; BigQuery rewrites cross-project references. Postgres, MySQL, SQL Server have no entries — failures are surfaced unchanged.

Postgres

(no recoverer registered)

SQLite

sqliteRepair

BigQuery

bigqueryProjectRepair

MySQL

(no recoverer registered)

SQL Server

(no recoverer registered)

if a recoverer matched

recoverer.repairreturns repaired SQL

pRetry↻repaired query re-enters preprocessing — read-only check fires again on the rewrite, not just the original.

terminal

successrows out

The execute path returns the result set. The same SQL string that survived preprocessing is what produced the rows.

RepairExhaustedthrown

Every registered recoverer for this dialect failed (or none were registered). The pipeline throws rather than presenting.

Every generated SQL string — Postgres, SQLite, BigQuery, MySQL, SQL Server — passes through the same preprocess → run → recover funnel. Recoverers are registered per dialect; empty slots are real, not omissions.

The shape matters more than any single stage. The preprocessor list is what guarantees the read-only invariant. The recoverer loop is what makes per-dialect repair surgical instead of speculative. The fact that the repaired query goes back through preprocessing is what stops a recoverer from accidentally introducing a write.

◆Key Takeaway

Accuracy is what the pipeline enforces on every query, not what the prompt asks the model to do. The model is one stage. The funnel is the mechanism.

Read-Only Enforced Twice: AST and Connection

The first invariant on the pipeline is read-only. It is enforced twice, on purpose, in two completely different places.

The first enforcement is AST-based. validateReadOnly(dialect) is the first preprocessor on every adapter. It parses the SQL with a dialect-aware grammar and refuses any tree that contains a write — INSERT, UPDATE, DELETE, MERGE, data-modifying CTEs, INTO OUTFILE, INTO DUMPFILE. EXPLAIN is allowed; EXPLAIN ANALYZE is not, because on Postgres EXPLAIN ANALYZE actually executes the underlying statement. A separate AST check rejects multiple statements separated by ;, with a narrow exception for T-SQL SET PARSEONLY ON … OFF blocks that recurses into the inner query. A regex fallback catches anything the parser missed.

The second enforcement is the connection itself. Postgres clients are checked out with default_transaction_read_only = on set on every connection. SQLite databases are opened with mode=ro&immutable=1. BigQuery uses dryRun: true for validation. Even if a clever query slipped past the AST check, the driver itself would refuse to write.

Parser-only check

A regex blocklist or a single AST check sits between the model and the database.

a clever shape the parser does not recognize reaches the driver
an EXPLAIN ANALYZE survives because it parses as EXPLAIN
a future SQL feature the validator was not updated for is a hole
the parser is the only thing the read-only claim depends on

Parser plus connection-level enforcement

The AST check is one of two. The driver session itself refuses to write.

the AST check catches the syntactic cases at zero driver cost
default_transaction_read_only = on catches anything that slipped past
a future model upgrade does not make the connection writable
read-only survives if either layer is wrong, not only when both are right

This is defense in depth, not redundancy. The two layers fail in different ways and protect against different mistakes — one in the validator, one in the driver. Either layer alone is a single point of failure.

Schema Grounding Is a Configured Surface, Not a Default

The second invariant: the model is not asked to guess. Before any SQL is generated, the system injects schema introspection into the prompt — table names, column names and types, sample values for low-cardinality enums, optional foreign-key traversal. This is what the codebase calls grounding.

The default for Postgres is info() + tables() + columnStats() — server version, table-and-column listing, and per-column statistics. Indexes, constraints, row counts, column-value samples, and foreign-key traversal are all opt-in booleans. Joins-heavy schemas with no forward or backward FK depth enabled will see hallucinated joins, because the model will compose them from naming patterns instead of declared relationships.

The four most-used facets, side by side:

Server identity and version. One row, cheap. Tells the model the dialect quirks to expect — Postgres 14 vs 16, SQLite 3.45, BigQuery standard SQL. Without it, the model defaults to ANSI SQL and trips on dialect features.

The decision the agent owner makes once, at agent-creation time, is the floor on accuracy for every query that agent ever runs. Defaults are safe for trivial queries. Domain-specific or join-heavy workloads need a deliberate config.

Seventeen Fragment Types, Not One System Prompt

The third invariant: instructions are typed, not freeform. The agent's system context is not a single text blob the owner edits. It is an array of typed records, each one validated against a discriminated union schema before it can be saved.

A handful of clarification fragments are built into every agent before any custom instructions are added — including the recency-without-timeframe ask ("which timeframe should I use?") and the ranking-without-metric ask ("what should top mean — revenue, count, duration?"). Custom fragments stack on top of those. A quirk fragment captures "the region_code column is two-letter ISO except for legacy rows where it's null." A correction captures "we tried this query last week and the join was wrong; here is the right shape." The agent owner does not write a long paragraph and hope the model honors it. They author a record with a type, and the system installs it into the right section of the prompt.

Typed beats untyped because the owner can see what the agent knows by category, dedupe by type, and refactor without breaking the prompt. A wall of system-prompt text gives none of those properties.

Schema Drift Invalidates Cache, Not Reality

The fourth invariant: the same schema produces the same prompt. Limerence stamps every text-to-SQL adapter instance with a version hash derived from the agent's grounding configuration, instructions array, and connected data source IDs.The hash covers grounding + instructions + dataSourceIds only. The live database schema is not in the hash — by design, because hashing the introspection on every chat would defeat the cache it is built to support. Cosmetic edits to the agent — a typo in a description, a renamed display label — do not bust the cache. Substantive edits — a new fragment, a re-enabled FK depth — do.

This is the right trade for chat latency, and it is also the source of the one schema-related failure mode the system does not yet detect. If the database itself changes — a column is dropped, a table is renamed — and nobody re-introspects, the cached grounding is stale. The agent will hallucinate against introspection that no longer matches the database. The hash will not change, because the hash inputs did not change. The first sign of drift is a query that fails at the driver with a column does not exist or relation does not exist error — the driver rejects the broken plan, but nothing in the loop tells the user why it broke.

A drift detector — schema-diff against the cached introspection, surfaced as a freshness indicator on the agent — would close this gap. The codebase does not have one yet.

When the Model Is Wrong: Per-Dialect Repair, With Holes

The recovery half of the lifecycle. When the driver throws, the pipeline walks the dialect's recoverer list before surfacing the error. A recoverer is a small, opinionated function: it inspects the error and the SQL, and it either returns a repaired SQL string or returns null. Null means "not my failure mode." A returned string means "try this," and the pipeline re-enters preprocessing with it.

Two recoverers exist today. sqliteRepair catches no such column errors with correlated-CTE shapes that SQLite's optimizer rejects, and rewrites them into the equivalent shape SQLite accepts.A separate documented caveat: SQLite table-valued-function shapes like json_each(...) can fail the upstream scope analyzer and be rejected before reaching the driver. Known leak, not yet closed. bigqueryProjectRepair catches Not found: Table and Not found: Dataset errors, and rewrites the cross-project reference using the configured dataset map.

The empty slots are real. Postgres, MySQL, MariaDB, and SQL Server have no registered recoverers. Common fixes — column-typo correction, ambiguous-column qualification, GROUP-BY auto-fix — are not in scope today. A failed Postgres query surfaces unchanged. The shallow coverage is honest; the funnel exists, the slots exist, the registrations are empty.

The Rules That Are Prose, Not Code

The hardest invariant to enforce is the one that connects SQL execution to UI rendering. The system prompt installs a guardrail fragment that says, in effect: do not present a chart, table, or dashboard unless the underlying SQL has executed in this turn and the rows have been inspected. The model is asked to honor the rule. The rule is also duplicated inside the BI skill's prompt, but that skill is currently disabled: true in its frontmatter — so today the rule fires from the default fragment catalog only. Either way, it is prose.

And the rendering tools do not refuse to draw a <bar-chart sql="..." /> for a SQL string that was never executed in this turn. No runtime check says "this chart references a query that has no execution receipt — abort the render." The invariant is real, the enforcement is rhetorical. Code beats prose; the gap is the prose layer.

Datasets Are Evaluation Fuel, Not Training Data

The accuracy lever the reader probably wants most is the one that is least built. Limerence has the surface area for evaluation: a Dataset is scoped to one agent, contains DatasetEntry rows of (input, expected, origin, actual?), and is populated by tooling — schema-driven synthesis with persona and complexity dimensions, breadth and depth evolution from seed entries, extraction from past chat turns. Entries can be exported as JSONL in the canonical fine-tuning shape: { question, sql, origin, comments }.

What does not exist yet: a runner that takes a dataset, replays each entry through the live agent loop, scores the regenerated SQL against the expected SQL, and reports a per-model accuracy number. The evaluations UI is wired to mock data. The dataset-validate endpoint only checks that each expected SQL parses against the schema — not that the model regenerates it. And nothing in-process consumes the JSONL export for fine-tuning. Output is the boundary.

That is the honest gap. The dataset surface is real and useful for hand-building golden sets. The closed-loop evaluation and fine-tuning consumers are not built. A reader looking for "we A/B tested four models against our dataset and here is the curve" should know the curve does not exist yet.

What Moves Accuracy Today

The levers that are real, in code, and copyable into another stack:

AST read-only validation as the first preprocessor. Not a regex, not a connection flag alone — a dialect-aware parser that runs before the driver sees the SQL.
Connection-level read-only on top of the AST check. Defense in depth means each layer protects against a different class of mistake.
Typed instruction fragments instead of one freeform system prompt. The reviewability is worth the schema cost.
FK-aware grounding, opt-in per agent. The default is the safe minimum; joins-heavy schemas need explicit forward and backward depth.
Version-hashed introspection cache. Cosmetic edits do not bust the cache; substantive ones do. Live schema drift is still on the to-do list.
One pipeline for every query. The recoverer that fixes a SQLite failure runs in the same funnel that fires the read-only check. The funnel is the reason invariants compose instead of conflict.

The post-funnel work — runtime enforcement of execute-before-render, recoverers for Postgres and MySQL, a real evaluation runner, a fine-tuning consumer for the JSONL export, a drift detector on cached grounding — is named, scoped, and not yet built. That is the honest version of the accuracy story today.