EVALS
Compare models. Measure accuracy.
◇
Compare Models
Available NowSee how different models respond to the same question. Find the right accuracy-cost tradeoff for your use case.
model comparison
Question: "What are the top 5 customers by revenue?"
OpenAI
SELECT name, SUM(amount) as rev FROM customers c JOIN orders o ON... GROUP BY c.id ORDER BY rev DESC LIMIT 5;
✓ 0.8s
Anthropic
SELECT c.name, SUM(o.amount) AS total_revenue FROM customers c INNER JOIN orders... GROUP BY c.id, c.name ORDER BY total... LIMIT 5;
✓ 1.2s
Groq
SELECT name, total FROM ( SELECT c.name, SUM(o.amount)... ) subquery LIMIT 5;
✓ 0.6s
Gemini
SELECT name, revenue_sum FROM (...) ORDER BY ...
✗ Error: missing col
How It Works
- •Select 2-4 models from your configured providers
- •Ask any question - all models respond simultaneously
- •Compare SQL quality, response time, and correctness
- •Multi-turn conversations - follow up and compare again
Use Cases
- •Find the fastest model for simple queries
- •Identify which model handles complex JOINs best
- •Test before switching providers
- •Validate responses across different architectures
◇
Batch Evaluation
Coming SoonRun your curated datasets against agents automatically. Measure accuracy at scale.
eval run
Dataset: sales_queries_v2 (847 entries)
Agent: Sales Analytics Bot
Models: OpenAI, Anthropic, Groq
Progress: 423/847
Results so far:
✓Exact match:312 (73.8%)
~Semantic match:89 (21.0%)
✗Failed:22 (5.2%)
Planned Features
- •Run full dataset or random sample
- •Compare multiple models in a single run
- •Track progress in real-time
- •Drill into individual failures
- •Export results as CSV/JSON
Match Types
- •Exact - SQL output is character-for-character identical
- •Semantic - Different SQL syntax, but same query results
- •Failed - Wrong results, syntax error, or timeout
◇
Failure Analysis
Coming SoonUnderstand why queries fail. Detect patterns. Improve your agent with one click.
failure analysis
Question: "Revenue by region last fiscal year"
Expected:
SELECT region, SUM(amount) FROM sales
WHERE date >= '2024-04-01' GROUP BY region
Got:
SELECT region, SUM(amount) FROM sales
WHERE YEAR(date) = 2024 GROUP BY region
Issue: Fiscal year not understood
Common failure patterns:
- •Fiscal year terminology (8 failures)
- •Custom status mapping (5 failures)
- •Multi-table joins (4 failures)
Planned Features
- •Pattern detection - Group failures by root cause
- •One-click fix - Add term/hint/example directly to agent
- •Mark as correct - When the "wrong" answer was actually right
- •Bulk operations - Fix similar failures together
◇
Accuracy Tracking
Coming SoonTrack how accuracy improves over time. See the impact of each change.
accuracy trend
100%┤
90%┤ ●────●
80%┤ ●────●────●────●
70%┤ ●────●────●
60%┤
└────────────────────────────────────────
v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6
Latest improvement: +3.2% after adding fiscal year terms
CHANGELOG
v1.6- Added fiscal year term definition+3.2%
v1.5- Switched to OpenAI+2.1%
v1.4- Added 50 new examples+4.5%
v1.3- Fixed status mapping quirk+1.8%
Planned Features
- •Version-by-version accuracy comparison
- •Improvement attribution (which change helped most)
- •Regression detection (when accuracy drops)
- •Target accuracy goals with alerts
◇
The Improvement Loop
Evaluation isn't a one-time task. It's a continuous cycle.
improvement cycle
TEACH
Agent
│
▼
RUN
Evaluations
ANALYZE
Failures
▲
│
│
▼
TRACK
Accuracy
FIX
Add teachings
Integration with Datasets
- •Import chat history to build test sets
- •Augment datasets to increase coverage
- •Deduplicate to keep tests clean
- •Export for fine-tuning when accuracy plateaus