DATASETS
BETABuild training data from your own schema.
◇
Generate From Schema
Turn your schema into question-SQL pairs. No manual labeling. Choose complexity levels and generate realistic user personas.
schema → questions
INPUT SCHEMA
customers
id, name, email
created_at, status
orders
id, customer_id
total, status
GENERATED PAIR
"Who are our top customers by order volume?"
SELECT name, COUNT(*)
FROM customers c
JOIN orders o ON...
GROUP BY name
Complexity:
Count:50 ▼
Complexity Levels
- •Low - Simple SELECTs, basic filters
- •Medium - JOINs, GROUP BY, aggregations
- •Hard - Subqueries, HAVING, complex conditions
- •Window - Window functions, CTEs, analytical queries
PERSONAS - Generate realistic user personas to create diverse questions. "Sales Manager", "Data Analyst", "Customer Support Lead"
◇
Import From SQL
Already have queries? Import existing SQL files or paste queries directly. AI generates the natural language questions.
sql → questions
Paste your SQL:
SELECT product_name, SUM(quantity)
FROM order_items
GROUP BY product_name
ORDER BY SUM(quantity) DESC
LIMIT 10;
Generated question:
"What are the top 10 products by quantity sold?"
◇
Import From Chat History
Extract question-SQL pairs from real conversations. Auto-deduplicate as you import.
history → dataset
Select sessions to import:
Sales Agent (23 sessions)
☑Revenue analysis- 47 messages - 12 queries
☑Customer segmentation- 31 messages - 8 queries
☐Product performance- 22 messages - 5 queries
Support Agent (15 sessions)
☐Ticket analysis- 18 messages - 4 queries
☐Response time report- 25 messages - 6 queries
- •Group sessions by agent
- •See message and query counts
- •Bulk select multiple sessions
- •Auto-dedupe by question
◇
Augment & Evolve
Expand your dataset automatically. Add complexity (depth) or create variations (breadth).
evolvers
DEPTH (Evolve)
Add complexity with 5 techniques:
☑Add filters- WHERE clauses, date ranges
☑Add joins- Multi-table queries
☑Add aggregations- GROUP BY, COUNT, SUM, AVG
☑Add reasoning- Multi-step calculations
☑Hypothetical- "What if" scenarios
"Top customers" → "Top customers by revenue this quarter where status is active, grouped by region with YoY growth"
BREADTH (Spread)
Generate variations with synonyms and rephrasing
"Top customers" → "Best customers"
→ "Highest spending customers"
→ "Leading customers"
→ "Top performing accounts"
Count per entry:5 ▼
◇
Validate & Dedupe
Ensure quality before training. Validate SQL, remove duplicates, and track entry lineage.
quality checks
VALIDATION
☑SQL ValidationEXPLAIN-based syntax check
☑Execution TestRun queries to verify results
DEDUPLICATION MODES
○ExactMatch on both question AND SQL
●SQL-onlyMatch on SQL only (ignore question)
○QuestionMatch on question only (ignore SQL)
Dataset: sales_queries_v2
Total entries: 847
Valid: 812Duplicates removed: 23Invalid: 12
Entry Lineage
Track where each entry came from and what it evolved into.
entry tree
📄"Top customers by revenue"(MANUAL)
├─📄"Top customers by revenue this quarter"(AUGMENTED)
└─📄"Top customers by revenue this quarter in CA"(AUGMENTED)
├─📄"Best customers by revenue"(AUGMENTED)
└─📄"Highest revenue customers"(AUGMENTED)
ORIGIN TRACKING - Every entry tagged with origin: MANUAL, FEEDBACK, IMPORT, AUGMENTED
◇
Export For Training
Export clean datasets in formats ready for fine-tuning.
export formats
JSONL
OpenAI format
CSV
Generic tabular
Parquet
Coming Soon
Sample JSONL output:
{"messages": [
{"role": "user", "content": "Top customers?"},
{"role": "assistant", "content": "SELECT..."}
]}