EVALS

Compare models. Measure accuracy.

Compare Models

Available Now

See how different models respond to the same question. Find the right accuracy-cost tradeoff for your use case.

model comparison
Question: "What are the top 5 customers by revenue?"
OpenAI
SELECT name,
  SUM(amount) as rev
FROM customers c
JOIN orders o ON...
GROUP BY c.id
ORDER BY rev DESC
LIMIT 5;
0.8s
Anthropic
SELECT c.name,
  SUM(o.amount)
  AS total_revenue
FROM customers c
INNER JOIN orders...
GROUP BY c.id, c.name
ORDER BY total...
LIMIT 5;
1.2s
Groq
SELECT name, total
FROM (
  SELECT c.name,
  SUM(o.amount)...
) subquery
LIMIT 5;
0.6s
Gemini
SELECT name,
  revenue_sum
FROM (...)
ORDER BY ...
✗ Error: missing col
How It Works
  • Select 2-4 models from your configured providers
  • Ask any question - all models respond simultaneously
  • Compare SQL quality, response time, and correctness
  • Multi-turn conversations - follow up and compare again
Use Cases
  • Find the fastest model for simple queries
  • Identify which model handles complex JOINs best
  • Test before switching providers
  • Validate responses across different architectures

Batch Evaluation

Coming Soon

Run your curated datasets against agents automatically. Measure accuracy at scale.

eval run
Dataset: sales_queries_v2 (847 entries)
Agent: Sales Analytics Bot
Models: OpenAI, Anthropic, Groq
Progress: 423/847
Results so far:
Exact match:312 (73.8%)
~Semantic match:89 (21.0%)
Failed:22 (5.2%)
Planned Features
  • Run full dataset or random sample
  • Compare multiple models in a single run
  • Track progress in real-time
  • Drill into individual failures
  • Export results as CSV/JSON
Match Types
  • Exact - SQL output is character-for-character identical
  • Semantic - Different SQL syntax, but same query results
  • Failed - Wrong results, syntax error, or timeout

Failure Analysis

Coming Soon

Understand why queries fail. Detect patterns. Improve your agent with one click.

failure analysis
Question: "Revenue by region last fiscal year"
Expected:
SELECT region, SUM(amount) FROM sales
WHERE date >= '2024-04-01' GROUP BY region
Got:
SELECT region, SUM(amount) FROM sales
WHERE YEAR(date) = 2024 GROUP BY region
Issue: Fiscal year not understood
Common failure patterns:
  • Fiscal year terminology (8 failures)
  • Custom status mapping (5 failures)
  • Multi-table joins (4 failures)
Planned Features
  • Pattern detection - Group failures by root cause
  • One-click fix - Add term/hint/example directly to agent
  • Mark as correct - When the "wrong" answer was actually right
  • Bulk operations - Fix similar failures together

Accuracy Tracking

Coming Soon

Track how accuracy improves over time. See the impact of each change.

accuracy trend
100%
90%●────●
80%●────●────●────●
70%●────●────●
60%
└────────────────────────────────────────
v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6
Latest improvement: +3.2% after adding fiscal year terms
CHANGELOG
v1.6- Added fiscal year term definition+3.2%
v1.5- Switched to OpenAI+2.1%
v1.4- Added 50 new examples+4.5%
v1.3- Fixed status mapping quirk+1.8%
Planned Features
  • Version-by-version accuracy comparison
  • Improvement attribution (which change helped most)
  • Regression detection (when accuracy drops)
  • Target accuracy goals with alerts

The Improvement Loop

Evaluation isn't a one-time task. It's a continuous cycle.

improvement cycle
TEACH
Agent
RUN
Evaluations
──────▶
ANALYZE
Failures
TRACK
Accuracy
◀──────
FIX
Add teachings
Integration with Datasets
  • Import chat history to build test sets
  • Augment datasets to increase coverage
  • Deduplicate to keep tests clean
  • Export for fine-tuning when accuracy plateaus
READY?
bash
$ curl -fsSL https://limerence.sh/install.sh | bash
Request a demo