BigQuery, Meet Natural Language

BigQuery and a natural-language layer should be a clean fit. Schemas live in INFORMATION_SCHEMA, the dialect is documented, the public-data corpus is enormous. In practice, two things keep getting in the way.

The first is the cross-project tax. The most useful BigQuery data — bigquery-public-data.*, partner shares, that one warehouse a sibling team built — lives in projects the LLM has never been told about. The model writes FROM fdic_banks.institutions, BigQuery returns "Access Denied," and you go back to copy-pasting fully-qualified table names into prompts.

The second is the cost-of-being-wrong. Every iteration on a half-correct query costs slot time. Tools that "just run the SQL" punish exploration, especially when the query was syntactically doomed before it ever scanned a byte.

◆Key Takeaway

Both failures are infrastructure problems, not prompt problems. Limerence's BigQuery adapter handles the first with a syntactic repair pass against the configured cross-project map, and the second with BigQuery's own dry-run planner gating every execute.

Connecting a Warehouse: Project, Datasets, Done

Open the Data Sources page, click New connection, and pick BigQuery — Google Cloud data warehouse from the database type list. Two fields matter.

The first is the GCP Project ID. This is the billing project — the one whose slot quota pays for query compute. The second is one or more Datasets. You can list bare names like analytics_events, which resolve against the billing project, or fully-qualified references like bigquery-public-data.fdic_banks for cross-project data. They mix freely on the same connection.

Once the project ID is filled, a Load button appears. It hits an introspection endpoint that lists every dataset the service account can see in the billing project and drops them into a dropdown.Large orgs will see hundreds of datasets here — there is no filter on the listing today. The + button next to the dropdown lets you type a fully-qualified project.dataset name for anything outside the billing project.

1
Pick BigQuery in the New Connection dialog. The form swaps to the BigQuery-specific fields.
2
Enter the GCP Project ID. This is the billing project, where query jobs run and slot time is charged.
3
Click Load to list datasets in the billing project, or type a fully-qualified project.dataset reference for cross-project data like bigquery-public-data.fdic_banks.
4
Save. The agent now grounds on those datasets — same chat, output styles, and dataset gating you already use for Postgres or SQLite.

After saving, the data source detail page shows the project, the attached datasets, the last-tested timestamp, and a connection status. A Test connection button runs a SELECT 1 against the warehouse, and the schema panel pulls real table and view lists from INFORMATION_SCHEMA.

Why `FROM fdic_banks.institutions` Stops Failing

The schema context the agent sees presents tables as dataset.table. That is fine when the dataset lives in the billing project. It breaks the moment the LLM emits a reference to a foreign-project dataset, because BigQuery resolves unqualified dataset.table against the billing project and returns "Access Denied" or "Not found."

Limerence catches that error class and rewrites the query against the cross-project map built from your configured datasets. The retried query runs against the right project and bills against yours.

Without repair

The LLM emits an unqualified reference. BigQuery resolves it against the billing project, where the dataset does not exist.

sql

SELECT name, city, state
FROM fdic_banks.institutions
WHERE state = 'CA'
LIMIT 10

SELECT name, city, state
FROM fdic_banks.institutions
WHERE state = 'CA'
LIMIT 10

BigQuery responds with Access Denied: Dataset january-9f554:fdic_banks and the query never reads a row. The user is back to pasting fully-qualified table names into the prompt.

With repair

The recoverer detects the error class, looks up fdic_banks in the cross-project map built from the configured datasets, and rewrites the SQL in place.

sql

SELECT name, city, state
FROM `bigquery-public-data`.fdic_banks.institutions
WHERE state = 'CA'
LIMIT 10

SELECT name, city, state
FROM `bigquery-public-data`.fdic_banks.institutions
WHERE state = 'CA'
LIMIT 10

The base adapter retries. The query runs against bigquery-public-data, bills against the configured billing project, and returns rows.

The repair is syntactic, not semantic. It is a textual rewrite against a map built from the datasets you already configured, not a model-driven correction or a guess at what the user meant. Three quoting forms are handled — unqualified dataset.table, double-backtick `dataset`.`table`, and the single-backtick-pair shape `dataset.table` — across both single tables and multi-table joins. References that were already fully qualified are left alone, and string literals that happen to contain a dataset.table substring are skipped.The skip on string literals matters because LLMs sometimes emit WHERE name = 'fdic_banks.institutions' as a filter against an audit-log table. Rewriting that would break the query.

Catching Bad Queries Before They Cost Anything

Every query Limerence sends to BigQuery passes through a dry-run before it ever runs. The adapter calls createQueryJob with dryRun: true, BigQuery returns either an OK with the planned output schema or a structured error, and the executor only proceeds on OK. Parse errors, missing columns, missing tables, and permission failures all surface for free.

For an exploratory workflow this matters more than it sounds. A failed dry-run still costs a planning round-trip, but it does not move data, charge for slots, or scan a partition. The cost-of-being-wrong drops to "one extra request to the planner" — a budget the team can afford to spend, repeatedly, while iterating on a half-formed question.

What This Unlocks, and What It Deliberately Doesn't

A few things follow once the cross-project mechanic is solved.

Public datasets become first-class. A team can attach bigquery-public-data.fdic_banks and bigquery-public-data.austin_bikeshare next to their internal warehouse and ask questions across both. Multi-project warehouses stop being a manual-qualification chore — datasets in sibling-team projects, partner-share projects, or your own historical projects all sit on the same connection. And the same agent workflow that already worked on Postgres or SQLite carries straight over: chat grounding, output styles, dataset scoping, the existing review surfaces.

A few limits are worth naming.

There is no row-limit or cost-cap enforcement yet. A misphrased question against a partitioned billion-row table will plan, dry-run cleanly, and then scan everything it asked for. The in-app guidance to "ask about partitioned columns to keep costs down" is a tip, not a guard. A cost-cap preprocessor is on the roadmap.

The repair pipeline only fires on access-denied and not-found errors. Other BigQuery error classes — quota exceeded, location mismatch, partition-required — fall through unrecovered. The recoverer is intentionally narrow; the failure mode it solves is the one that was actually hurting people.

The dataset list is stored as a comma-separated string. There is no per-dataset metadata, location flag, or independent permission state. If you need that today, model it outside the connection.

Setting It Up

A short checklist for an admin wiring this up.

The billing project needs the BigQuery API enabled. The credentials the backend uses need bigquery.jobs.create on that project. For each dataset attached to the connection — including any cross-project ones — the credentials need bigquery.dataViewer (or an equivalent role) on the dataset's home project. Cross-project datasets bill against your project but read from theirs, so both grants matter.

That is the whole setup. Pick a project, attach datasets, point the agent at it. The cross-project map is built from the datasets you configured, the dry-run gate runs on every query, and the rest is the agent workflow you already know.