DATASETS

BETA

Build training data from your own schema.

Generate From Schema

Turn your schema into question-SQL pairs. No manual labeling. Choose complexity levels and generate realistic user personas.

schema → questions
INPUT SCHEMA
customers
id, name, email
created_at, status
orders
id, customer_id
total, status
GENERATED PAIR
"Who are our top customers by order volume?"
SELECT name, COUNT(*)
FROM customers c
JOIN orders o ON...
GROUP BY name
Complexity:
Count:50 ▼
Complexity Levels
  • Low - Simple SELECTs, basic filters
  • Medium - JOINs, GROUP BY, aggregations
  • Hard - Subqueries, HAVING, complex conditions
  • Window - Window functions, CTEs, analytical queries
PERSONAS - Generate realistic user personas to create diverse questions. "Sales Manager", "Data Analyst", "Customer Support Lead"

Import From SQL

Already have queries? Import existing SQL files or paste queries directly. AI generates the natural language questions.

sql → questions
Paste your SQL:
SELECT product_name, SUM(quantity)
FROM order_items
GROUP BY product_name
ORDER BY SUM(quantity) DESC
LIMIT 10;
Generated question:
"What are the top 10 products by quantity sold?"

Import From Chat History

Extract question-SQL pairs from real conversations. Auto-deduplicate as you import.

history → dataset
Select sessions to import:
Sales Agent (23 sessions)
Revenue analysis- 47 messages - 12 queries
Customer segmentation- 31 messages - 8 queries
Product performance- 22 messages - 5 queries
Support Agent (15 sessions)
Ticket analysis- 18 messages - 4 queries
Response time report- 25 messages - 6 queries
  • Group sessions by agent
  • See message and query counts
  • Bulk select multiple sessions
  • Auto-dedupe by question

Augment & Evolve

Expand your dataset automatically. Add complexity (depth) or create variations (breadth).

evolvers
DEPTH (Evolve)
Add complexity with 5 techniques:
Add filters- WHERE clauses, date ranges
Add joins- Multi-table queries
Add aggregations- GROUP BY, COUNT, SUM, AVG
Add reasoning- Multi-step calculations
Hypothetical- "What if" scenarios
"Top customers""Top customers by revenue this quarter where status is active, grouped by region with YoY growth"
BREADTH (Spread)
Generate variations with synonyms and rephrasing
"Top customers""Best customers"
→ "Highest spending customers"
→ "Leading customers"
→ "Top performing accounts"
Count per entry:5 ▼

Validate & Dedupe

Ensure quality before training. Validate SQL, remove duplicates, and track entry lineage.

quality checks
VALIDATION
SQL ValidationEXPLAIN-based syntax check
Execution TestRun queries to verify results
DEDUPLICATION MODES
ExactMatch on both question AND SQL
SQL-onlyMatch on SQL only (ignore question)
QuestionMatch on question only (ignore SQL)
Dataset: sales_queries_v2
Total entries: 847
Valid: 812Duplicates removed: 23Invalid: 12
Entry Lineage

Track where each entry came from and what it evolved into.

entry tree
📄"Top customers by revenue"(MANUAL)
├─📄"Top customers by revenue this quarter"(AUGMENTED)
└─📄"Top customers by revenue this quarter in CA"(AUGMENTED)
├─📄"Best customers by revenue"(AUGMENTED)
└─📄"Highest revenue customers"(AUGMENTED)
ORIGIN TRACKING - Every entry tagged with origin: MANUAL, FEEDBACK, IMPORT, AUGMENTED

Export For Training

Export clean datasets in formats ready for fine-tuning.

export formats
JSONL
OpenAI format
CSV
Generic tabular
Parquet
Coming Soon
Sample JSONL output:
{"messages": [
{"role": "user", "content": "Top customers?"},
{"role": "assistant", "content": "SELECT..."}
]}
READY?
bash
$ curl -fsSL https://limerence.sh/install.sh | bash
Request a demo