EVALS

Compare models. Measure accuracy.

◇

Compare Models

Available Now

See how different models respond to the same question. Find the right accuracy-cost tradeoff for your use case.

model comparison

Question: "What are the top 5 customers by revenue?"

OpenAI

SELECT name,
  SUM(amount) as rev
FROM customers c
JOIN orders o ON...
GROUP BY c.id
ORDER BY rev DESC
LIMIT 5;

✓ 0.8s

Anthropic

SELECT c.name,
  SUM(o.amount)
  AS total_revenue
FROM customers c
INNER JOIN orders...
GROUP BY c.id, c.name
ORDER BY total...
LIMIT 5;

✓ 1.2s

Groq

SELECT name, total
FROM (
  SELECT c.name,
  SUM(o.amount)...
) subquery
LIMIT 5;

✓ 0.6s

Gemini

SELECT name,
  revenue_sum
FROM (...)
ORDER BY ...

✗ Error: missing col

How It Works

•Select 2-4 models from your configured providers
•Ask any question - all models respond simultaneously
•Compare SQL quality, response time, and correctness
•Multi-turn conversations - follow up and compare again

Use Cases

•Find the fastest model for simple queries
•Identify which model handles complex JOINs best
•Test before switching providers
•Validate responses across different architectures

◇

Batch Evaluation

Coming Soon

Run your curated datasets against agents automatically. Measure accuracy at scale.

eval run

Dataset: sales_queries_v2 (847 entries)

Agent: Sales Analytics Bot

Models: OpenAI, Anthropic, Groq

Progress: 423/847

Results so far:

✓Exact match:312 (73.8%)

~Semantic match:89 (21.0%)

✗Failed:22 (5.2%)

Planned Features

•Run full dataset or random sample
•Compare multiple models in a single run
•Track progress in real-time
•Drill into individual failures
•Export results as CSV/JSON

Match Types

•Exact - SQL output is character-for-character identical
•Semantic - Different SQL syntax, but same query results
•Failed - Wrong results, syntax error, or timeout

◇

Failure Analysis

Coming Soon

Understand why queries fail. Detect patterns. Improve your agent with one click.

failure analysis

Question: "Revenue by region last fiscal year"

Expected:

SELECT region, SUM(amount) FROM sales

WHERE date >= '2024-04-01' GROUP BY region

Got:

SELECT region, SUM(amount) FROM sales

WHERE YEAR(date) = 2024 GROUP BY region

Issue: Fiscal year not understood

Common failure patterns:

•Fiscal year terminology (8 failures)
•Custom status mapping (5 failures)
•Multi-table joins (4 failures)

Planned Features

•Pattern detection - Group failures by root cause
•One-click fix - Add term/hint/example directly to agent
•Mark as correct - When the "wrong" answer was actually right
•Bulk operations - Fix similar failures together

◇

Accuracy Tracking

Coming Soon

Track how accuracy improves over time. See the impact of each change.

accuracy trend

100%┤

90%┤ ●────●

80%┤ ●────●────●────●

70%┤ ●────●────●

60%┤

└────────────────────────────────────────

v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6

Latest improvement: +3.2% after adding fiscal year terms

CHANGELOG

v1.6- Added fiscal year term definition+3.2%

v1.5- Switched to OpenAI+2.1%

v1.4- Added 50 new examples+4.5%

v1.3- Fixed status mapping quirk+1.8%

Planned Features

•Version-by-version accuracy comparison
•Improvement attribution (which change helped most)
•Regression detection (when accuracy drops)
•Target accuracy goals with alerts

◇

The Improvement Loop

Evaluation isn't a one-time task. It's a continuous cycle.

improvement cycle

TEACH

Agent

│

▼

RUN

Evaluations

──────▶

ANALYZE

Failures

▲

│

▼

TRACK

Accuracy

◀──────

FIX

Add teachings

Integration with Datasets

•Import chat history to build test sets
•Augment datasets to increase coverage
•Deduplicate to keep tests clean
•Export for fine-tuning when accuracy plateaus

READY?

bash

$ curl -fsSL https://limerence.sh/install.sh | bash

Request a demo