DATASETS

BETA

Build training data from your own schema.

◇

Generate From Schema

Turn your schema into question-SQL pairs. No manual labeling. Choose complexity levels and generate realistic user personas.

schema → questions

INPUT SCHEMA

customers

id, name, email

created_at, status

orders

id, customer_id

total, status

GENERATED PAIR

"Who are our top customers by order volume?"

SELECT name, COUNT(*)

FROM customers c

JOIN orders o ON...

GROUP BY name

Complexity:

Count:50 ▼

Complexity Levels

•Low - Simple SELECTs, basic filters
•Medium - JOINs, GROUP BY, aggregations
•Hard - Subqueries, HAVING, complex conditions
•Window - Window functions, CTEs, analytical queries

PERSONAS - Generate realistic user personas to create diverse questions. "Sales Manager", "Data Analyst", "Customer Support Lead"

◇

Import From SQL

Already have queries? Import existing SQL files or paste queries directly. AI generates the natural language questions.

sql → questions

Paste your SQL:

SELECT product_name, SUM(quantity)

FROM order_items

GROUP BY product_name

ORDER BY SUM(quantity) DESC

LIMIT 10;

Generated question:

"What are the top 10 products by quantity sold?"

◇

Import From Chat History

Extract question-SQL pairs from real conversations. Auto-deduplicate as you import.

history → dataset

Select sessions to import:

Sales Agent (23 sessions)

☑Revenue analysis- 47 messages - 12 queries

☑Customer segmentation- 31 messages - 8 queries

☐Product performance- 22 messages - 5 queries

Support Agent (15 sessions)

☐Ticket analysis- 18 messages - 4 queries

☐Response time report- 25 messages - 6 queries

•Group sessions by agent
•See message and query counts
•Bulk select multiple sessions
•Auto-dedupe by question

◇

Augment & Evolve

Expand your dataset automatically. Add complexity (depth) or create variations (breadth).

evolvers

DEPTH (Evolve)

Add complexity with 5 techniques:

☑Add filters- WHERE clauses, date ranges

☑Add joins- Multi-table queries

☑Add aggregations- GROUP BY, COUNT, SUM, AVG

☑Add reasoning- Multi-step calculations

☑Hypothetical- "What if" scenarios

"Top customers" → "Top customers by revenue this quarter where status is active, grouped by region with YoY growth"

BREADTH (Spread)

Generate variations with synonyms and rephrasing

"Top customers" → "Best customers"

→ "Highest spending customers"

→ "Leading customers"

→ "Top performing accounts"

Count per entry:5 ▼

◇

Validate & Dedupe

Ensure quality before training. Validate SQL, remove duplicates, and track entry lineage.

quality checks

VALIDATION

☑SQL ValidationEXPLAIN-based syntax check

☑Execution TestRun queries to verify results

DEDUPLICATION MODES

○ExactMatch on both question AND SQL

●SQL-onlyMatch on SQL only (ignore question)

○QuestionMatch on question only (ignore SQL)

Dataset: sales_queries_v2

Total entries: 847

Valid: 812Duplicates removed: 23Invalid: 12

Entry Lineage

Track where each entry came from and what it evolved into.

entry tree

📄"Top customers by revenue"(MANUAL)

├─📄"Top customers by revenue this quarter"(AUGMENTED)

└─📄"Top customers by revenue this quarter in CA"(AUGMENTED)

├─📄"Best customers by revenue"(AUGMENTED)

└─📄"Highest revenue customers"(AUGMENTED)

ORIGIN TRACKING - Every entry tagged with origin: MANUAL, FEEDBACK, IMPORT, AUGMENTED

◇

Export For Training

Export clean datasets in formats ready for fine-tuning.

export formats

JSONL

OpenAI format

CSV

Generic tabular

Parquet

Coming Soon

Sample JSONL output:

{"messages": [

{"role": "user", "content": "Top customers?"},

{"role": "assistant", "content": "SELECT..."}

]}

READY?

bash

$ curl -fsSL https://limerence.sh/install.sh | bash

Request a demo