Evals for Agentic AI: The Missing 48% of Your Production Stack
89% of organisations running AI agents in production have observability. Only 52% have evals.
That number comes from LangChain’s State of Agent Engineering survey, and it should bother you. It means nearly half the teams shipping agents to real users have no systematic way of knowing whether those agents are actually good. They have logs. They have traces. They can see what happened. But they cannot tell you whether what happened was correct.
At Jelifish, we build agentic systems on AWS for clients across financial services, logistics, and healthcare. We have shipped agents that route customer queries, generate compliance reports, and orchestrate multi-step data pipelines. And I can tell you from direct experience: the gap between “agent works in demo” and “agent works in production” is not about model selection or prompt engineering. It’s about evaluation.
Eval-Driven Development: TDD for Agents
Test-Driven Development changed how we write software. You write the test first, watch it fail, then make it pass. Evaluation-Driven Development (EDD) applies the same principle to agents, with one difference: outcomes are scored, not binary.
An agent that answers a customer question might be 0.7 relevant, 0.9 unbiased, and 0.4 complete. That granularity matters. Binary pass/fail hides the nuance that separates a useful agent from a liability.
The loop looks like this:
graph TD
A[Define Dataset] --> B[Run Experiment]
B --> C[Score with Evaluators]
C --> D[Compare Against Baseline]
D --> E{Meets Threshold?}
E -- Yes --> F[CI Gate Passes]
E -- No --> G[Iterate on Agent]
G --> B
F --> H[Deploy to Production]
H --> I[Live Scoring + Sampling]
I --> J[Detect Drift]
J --> AThe feedback loop is the part most teams skip. Production data flows back into your dataset, surfacing failure modes you never anticipated. I have seen agents score 0.95 on hand-crafted test sets and 0.6 on real user inputs within the first week. Without live scoring, you would never know.
Mastra’s Scorer Pipeline
Mastra is the framework I reach for when building evaluated agents in TypeScript. At version 1.14.0, it has moved from a class-based “Evals” API to a functional pipeline-based “Scorers” API, and the new design is significantly better.
Here’s what surprised me about Mastra’s approach: it splits the scoring process into four discrete steps, each with a clear responsibility. This matters because the biggest problem with LLM-as-judge evaluation is inconsistency. Ask Claude to score a response on a 0-1 scale and you will get different numbers on consecutive runs. Mastra’s design insight is to use the LLM for what it is good at (extracting structured information) and then apply deterministic logic for the actual scoring.
A custom scorer follows a four-step pipeline: preprocess, analyse, generateScore, and generateReason.
import { createScorer } from "@mastra/core/scores";
const qualityScorer = createScorer({
name: "Quality Scorer",
description: "Evaluates response quality",
})
.preprocess(({ run }) => {
return { wordCount: run.output.split(" ").length };
})
.analyze(({ run, results }) => {
const hasSubstance = results.preprocessStepResult.wordCount > 10;
return { hasSubstance };
})
.generateScore(({ results }) => {
return results.analyzeStepResult.hasSubstance ? 1.0 : 0.0;
})
.generateReason(({ score, results }) => {
const wordCount = results.preprocessStepResult.wordCount;
return `Score: ${score}. Response has ${wordCount} words.`;
});The preprocess step extracts raw data. The analyze step can call an LLM to make qualitative judgements and return structured results. The generateScore step applies deterministic logic to those structured results. The generateReason step produces a human-readable explanation. This separation means your scores are reproducible even when the LLM’s phrasing varies between runs.
Mastra also ships built-in scorers for common patterns. Answer relevancy, bias detection, faithfulness, hallucination, and toxicity are all available out of the box via @mastra/evals/scorers/llm.
Wiring Scorers to Agents
Attaching scorers to an agent is straightforward. You configure them with a sampling rate so they run asynchronously without blocking responses.
import { Agent } from "@mastra/core/agent";
import { anthropic } from "@ai-sdk/anthropic";
import { createAnswerRelevancyScorer, createBiasScorer } from "@mastra/evals/scorers/llm";
export const customerSupportAgent = new Agent({
name: "CustomerSupport",
instructions: "You are a helpful customer support agent",
model: anthropic("claude-sonnet-4-5-20250514"),
scorers: {
relevancy: {
scorer: createAnswerRelevancyScorer({ model: anthropic("claude-sonnet-4-5-20250514") }),
sampling: { type: "ratio", rate: 0.5 },
},
bias: {
scorer: createBiasScorer({ model: anthropic("claude-sonnet-4-5-20250514") }),
sampling: { type: "ratio", rate: 1 },
},
},
});A sampling rate of 0.5 means half of all responses get scored. For bias, I run at 1.0 (every response) because the cost of a biased answer reaching a customer is high. For relevancy on a high-throughput endpoint, 0.1 or 0.2 is often enough to detect drift. Scores persist to the mastra_scorers table and are visible in Mastra Studio.
Datasets and Experiments
Scorers tell you how good a single response is. Experiments tell you how good an agent is across a representative set of inputs.
Mastra’s dataset system uses SCD-2 (Slowly Changing Dimension Type 2) versioning. Every item in a dataset is versioned, so when you update a test case, the old version is preserved. This is not just good data hygiene; it means you can re-run old experiments against the exact data they originally used.
Running an experiment against a dataset looks like this:
const dataset = await mastra.datasets.get({ id: 'translation-dataset-id' });
const summary = await dataset.startExperiment({
name: 'sonnet-4.5-baseline',
targetType: 'agent',
targetId: 'translation-agent',
scorers: ['accuracy', 'fluency'],
maxConcurrency: 10,
itemTimeout: 30_000,
maxRetries: 2,
});
for (const item of summary.results) {
for (const score of item.scores) {
console.log(`${score.scorerName}: ${score.score} -- ${score.reason}`);
}
}Mastra Studio provides side-by-side experiment comparison, which is where the real value sits. You run sonnet-4.5-baseline, then swap to a different model or tweak the system prompt, run sonnet-4.5-revised-prompt, and compare scores across every test case. No spreadsheets. No squinting at log files.
CI/CD Eval Gates
Here’s the uncomfortable truth: evals that don’t block deployments are just dashboards. Dashboards get ignored.
I wire Mastra experiments into Vitest so they run as part of our CI pipeline. If scores drop below a threshold, the build fails. This is the single most effective practice I have adopted for agent quality.
import { describe, it, expect } from 'vitest';
import { runEvals } from '@mastra/core/evals';
import { weatherAgent } from './weather-agent';
import { locationScorer } from '../scorers/location-scorer';
describe('Weather Agent Tests', () => {
it('should correctly extract locations from queries', async () => {
const result = await runEvals({
data: [
{
input: 'weather in Berlin',
groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'DE' },
},
],
target: weatherAgent,
scorers: [locationScorer],
});
expect(result.scores['location-accuracy']).toBe(1);
});
});Note: API examples in this post are based on Mastra 1.14.0. Import paths and method signatures may change between releases. Check the docs at mastra.ai for the latest signatures.
This runs in the same CI pipeline as your unit tests. The agent gets a real input, produces a real output, and the scorer evaluates it against ground truth. If someone changes the system prompt and location extraction breaks, the build fails before it reaches staging.
Live Scoring in Production
CI evals catch regressions before deployment. Live scoring catches drift after deployment.
Model providers update weights. User behaviour shifts. Data distributions change. An agent that scored 0.92 on relevancy last month might be at 0.78 today, and without live scoring, you will not know until a customer complains.
graph TD
subgraph "AWS Production Architecture"
A[API Gateway] --> B[Lambda: Agent Handler]
B --> C[Amazon Bedrock]
B --> D[Mastra Scorers - Async]
D --> E[mastra_scorers Table]
E --> F[CloudWatch Metrics]
F --> G{Score Below Threshold?}
G -- Yes --> H[CloudWatch Alarm]
H --> I[SNS: On-Call Alert]
G -- No --> J[Dashboard]
end
subgraph "CI/CD Pipeline"
K[GitHub Push] --> L[Vitest Eval Suite]
L --> M{Scores Pass?}
M -- Yes --> N[CDK Deploy]
M -- No --> O[Build Fails]
N --> B
endAt Jelifish, we run agents on Lambda with Bedrock as the model provider. Mastra scorers execute asynchronously after the response is returned to the user, so latency is unaffected. We push scorer results as CloudWatch custom metrics and set alarms on rolling averages. If the 1-hour average relevancy score drops below 0.8, someone gets paged.
The sampling rate is your cost lever. At 10,000 requests per day, scoring every response with an LLM-based scorer gets expensive. We typically sample at 0.1 for high-volume endpoints and 1.0 for high-risk ones (anything touching financial data or compliance).
How Other Frameworks Handle Evals
Mastra is not the only framework thinking about this, but it is the furthest along in TypeScript. If your team is Python-first, look at Strands or LangSmith instead. Mastra’s value proposition is strongest when your agent code is already TypeScript and you want evals wired into the same codebase without switching languages.
Here is how the field compares.
| Mastra | Strands | Bedrock AgentCore | LangSmith |
|---|---|---|---|---|
Language | TypeScript | Python | Language-agnostic (API) | Python (TS SDK limited) |
Eval approach | Functional scorer pipeline | Output + trace-based evaluators | 13 built-in evaluators | Annotation queues + custom evaluators |
CI integration | Vitest / any JS test runner | pytest | API-driven | pytest / custom |
Live scoring | Agent-level scorer sampling | OpenTelemetry traces | Online evaluation mode | Tracing + feedback |
Strands Agents SDK (AWS, Python) supports two categories of evaluator: output-based evaluators that score the final response, and trace-based evaluators that operate at TOOL_LEVEL or TRACE_LEVEL to inspect intermediate steps. It uses OpenTelemetry for instrumentation, which is a solid architectural choice. The experiment API is clean:
pip install strands-agents-evalsfrom strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import OutputEvaluator, HelpfulnessEvaluator
def get_response(case: Case) -> str:
agent = Agent(system_prompt="You are a helpful assistant.", callback_handler=None)
response = agent(case.input)
return str(response)
evaluator = OutputEvaluator(rubric="Evaluate accuracy, completeness, and clarity")
experiment = Experiment(cases=test_cases, evaluators=[evaluator])
reports = experiment.run_evaluations(get_response)I like Strands’ trace-level evaluation. Scoring the final output is not enough when your agent makes five tool calls to get there. A correct answer arrived at by calling the wrong tools in the wrong order is a time bomb.
Amazon Bedrock AgentCore Evaluations (currently in Preview) offers 13 built-in evaluators with online and on-demand modes. It is available in us-east-1, us-west-2, ap-southeast-2, and eu-central-1. If you are already running Bedrock Agents, this is the path of least resistance for getting evaluation coverage without adding another framework.
LangGraph / LangSmith has the most mature evaluation platform in the Python ecosystem. LangSmith’s annotation queues and dataset management are excellent. But if your stack is TypeScript on AWS, the integration story is weaker.
CrewAI is focused on multi-agent orchestration and has added evaluation hooks, but evals are not a first-class concern in the way they are in Mastra or Strands. For teams building complex multi-agent pipelines, I would pair CrewAI with an external evaluation framework rather than relying on its built-in capabilities.
Production Pitfalls: The Swiss Cheese Model
Anthropic’s eval guide identifies three grader types: code-based (fast, deterministic), model-based (flexible, inconsistent), and human (gold-standard, expensive). Effective evaluation systems combine three or four grader types. Think of it like the Swiss cheese model in aviation safety: each layer has holes, but the holes do not align.
graph TD
subgraph "Three-Layer Evaluation"
direction TB
A["Layer 1: Code-Based Graders<br/>Regex, JSON schema, word count<br/>Fast, deterministic, narrow"]
B["Layer 2: LLM-Based Scorers<br/>Relevancy, bias, faithfulness<br/>Flexible, probabilistic"]
C["Layer 3: Human Review<br/>Annotation queues, spot checks<br/>Slow, expensive, definitive"]
end
A --> B --> C
D[Agent Response] --> A
C --> E[Production Confidence]Here’s what we’ve learned, mostly the hard way:
Do not evaluate only the final output. An agent that returns the right answer after calling the wrong API three times is going to fail unpredictably. Strands’ TRACE_LEVEL evaluation exists for this reason. In Mastra, build custom scorers that inspect tool call sequences, not just the final string.
Do not specify paths; specify outcomes. If your eval checks “did the agent call function X then function Y”, you are testing implementation, not behaviour. Agents find valid alternative paths. Score the outcome.
Do not grade your own tests. Using the same model to evaluate its own responses means your evaluator shares the same blind spots as your generator. Anthropic found that Opus 4.5 scored just 42% on CORE-Bench when run through a generic CORE-Agent scaffold. Switching to Claude Code as the scaffold lifted that to 78%. Manual review of grading inconsistencies and underspecified tasks then brought the effective score to 95%. The scaffold and evaluation harness mattered as much as the model. Use a different model family for evaluation, and combine LLM grading with deterministic checks.
Do not expect the last 15% to come cheap. ZenML’s data across 1,200 deployments shows that 80% quality happens quickly. Getting from 80% to 95% takes the majority of your development time. Budget accordingly.
Structure graders for partial credit. Binary pass/fail on agentic tasks is almost always wrong. An agent that gets 4 out of 5 fields correct is not the same as one that gets 0. Your scoring should reflect that.
Getting Started
If you have agents in production with no evals, here is a practical starting point:
Pick your highest-risk agent. The one where a bad answer has real consequences. Start there, not with the easy one.
Build a dataset of 50 real inputs. Not synthetic. Pull from production logs. Include the weird edge cases your users actually send.
Write three scorers. One code-based (does the output parse as valid JSON, does it contain required fields), one LLM-based (relevancy or faithfulness), one domain-specific (does the extracted entity match ground truth).
Run your first experiment. Use Mastra’s startExperiment to get baseline scores. You now have a number. That number will be lower than you expect.
Wire it into CI. Use Vitest. Set the threshold 10% below your current baseline. Tighten it as you improve.
Turn on live scoring. Start with a 0.1 sampling rate. Push scores to CloudWatch. Set an alarm.
The gap between 89% observability and 52% evals is not a tooling problem. The frameworks exist. Mastra, Strands, Bedrock AgentCore, and LangSmith all provide what you need. The gap is a practice problem, and closing it is the single best investment you can make in your agent quality today.
Owen from the Jelifish team. We build evaluated agentic AI systems on AWS for organisations that cannot afford to guess whether their agents are working. If you’re shipping agents to production and want confidence in their quality, we’d love to help.