Agentic AI

Agentic AI Testing & Evaluation: A Complete Developer Guide for 2026

Testing AI agents is fundamentally different from testing deterministic software. Your agent's output changes with every run, tools can fail mid-chain, and a hallucination 3 steps deep can corrupt an entire workflow. This guide gives you a battle-tested testing pyramid — from fast mock-LLM unit tests to production monitoring — so you can ship AI agents with the same confidence you ship any other system.

Md Sanwar Hossain April 6, 2026 20 min read AI Agent Evaluation
Agentic AI testing and evaluation framework for developers 2026

TL;DR — The Testing Hierarchy

"Start with golden dataset regression tests for every agent feature. Add LLM-as-judge evaluation for open-ended outputs. Track per-tool accuracy, chain success rates, and hallucination rates in CI/CD to catch regressions before they reach production."

Table of Contents

  1. Why Testing AI Agents Is Different
  2. Unit Testing AI Agents with Mock LLMs
  3. Integration Testing Agent Tool Calls
  4. Golden Datasets & Regression Testing
  5. LLM-as-Judge Evaluation
  6. RAGAS Metrics for RAG-Based Agents
  7. Simulation & Adversarial Testing
  8. Evaluation Frameworks Compared
  9. Integrating Evals into CI/CD Pipelines
  10. Monitoring Agent Quality in Production

1. Why Testing AI Agents Is Different

Traditional software testing operates on deterministic input/output contracts. You assert assertEquals("Alice", user.getName()) and it either passes or fails, every single time. AI agents break this contract in four fundamental ways:

These constraints demand a layered testing strategy. The AI agent testing pyramid has three tiers: fast and cheap unit tests at the base, integration tests in the middle, and expensive eval runs at the top — run less frequently but with higher coverage of semantic quality.

2. Unit Testing AI Agents with Mock LLMs

Unit tests for AI agents mock the LLM entirely — you're testing your agent's orchestration logic, not the model's intelligence. This lets you run thousands of tests in seconds with zero API cost.

What to Unit Test

# Unit testing an AI agent with a mock LLM (Python/pytest)
import pytest
from unittest.mock import Mock, patch
from my_agent import ResearchAgent, ToolRegistry

class MockLLM:
    """Deterministic mock LLM for unit testing"""
    def __init__(self, responses):
        self.responses = iter(responses)

    def invoke(self, messages):
        return next(self.responses)

def test_agent_selects_search_tool_for_factual_query():
    """Agent should invoke search when LLM signals tool_use"""
    mock_responses = [
        # Turn 1: LLM decides to search
        {"role": "assistant", "content": None,
         "tool_calls": [{"name": "web_search", "args": {"query": "Python asyncio tutorial"}}]},
        # Turn 2: LLM generates final answer after seeing results
        {"role": "assistant", "content": "Python asyncio uses an event loop to..."}
    ]

    mock_llm = MockLLM(mock_responses)
    mock_search_tool = Mock(return_value={"results": [{"title": "AsyncIO Docs", "snippet": "..."}]})

    registry = ToolRegistry()
    registry.register("web_search", mock_search_tool)

    agent = ResearchAgent(llm=mock_llm, tools=registry)
    result = agent.run("Explain Python asyncio")

    # Assert tool was called with correct args
    mock_search_tool.assert_called_once_with(query="Python asyncio tutorial")
    assert "event loop" in result.output

def test_agent_retries_on_tool_failure():
    """Agent should retry failed tool calls up to max_retries"""
    mock_responses = [
        {"tool_calls": [{"name": "database_query", "args": {"sql": "SELECT * FROM users"}}]},
        {"tool_calls": [{"name": "database_query", "args": {"sql": "SELECT * FROM users"}}]},
        {"content": "I was unable to retrieve user data due to a database error."}
    ]

    mock_llm = MockLLM(mock_responses)
    mock_db = Mock(side_effect=[Exception("Connection timeout"), Exception("Connection timeout"), None])

    agent = ResearchAgent(llm=mock_llm, tools=ToolRegistry())
    agent.tools.register("database_query", mock_db)
    result = agent.run("Get all users", max_retries=2)

    assert mock_db.call_count == 2  # Retried once
    assert "unable" in result.output.lower()

def test_agent_detects_infinite_loop():
    """Agent should break out of loops where the same tool is called 3+ times"""
    # Mock LLM that keeps calling the same tool forever
    infinite_mock = Mock(return_value={
        "tool_calls": [{"name": "web_search", "args": {"query": "same query"}}]
    })
    agent = ResearchAgent(llm=infinite_mock, tools=ToolRegistry(), max_steps=10)
    result = agent.run("Find something")
    assert agent.loop_detected is True
AI agent testing pyramid: unit tests with mock LLMs, integration tests with real models, evaluation frameworks
AI Agent Testing Pyramid — layered strategy from fast mock-LLM unit tests to expensive semantic evals. Source: mdsanwarhossain.me

3. Integration Testing Agent Tool Calls

Integration tests use a real LLM but mock external dependencies (databases, APIs, file systems). They verify that your agent's full loop — LLM reasoning + tool execution + context management — works end-to-end without being at the mercy of third-party API availability.

Integration Test Design Principles

# Integration test with real LLM + mocked tools
import pytest
from unittest.mock import patch
from my_agent import CustomerSupportAgent

@pytest.fixture
def mock_crm():
    """Mock CRM database responses"""
    with patch("my_agent.tools.crm_lookup") as mock:
        mock.return_value = {
            "customer_id": "CUST-001",
            "name": "Alice Johnson",
            "subscription": "Pro",
            "open_tickets": 2
        }
        yield mock

@pytest.fixture
def mock_ticket_system():
    with patch("my_agent.tools.create_ticket") as mock:
        mock.return_value = {"ticket_id": "TKT-99821", "status": "created"}
        yield mock

@pytest.mark.integration
def test_support_agent_creates_ticket_for_billing_issue(mock_crm, mock_ticket_system):
    """Agent should look up customer and create a billing ticket"""
    agent = CustomerSupportAgent(model="gpt-4o-mini")

    result = agent.handle(
        message="I was charged twice this month, my account is CUST-001",
        max_tokens=1500
    )

    # Structural assertions — not exact strings
    mock_crm.assert_called_once_with(customer_id="CUST-001")
    mock_ticket_system.assert_called_once()
    call_args = mock_ticket_system.call_args[1]
    assert call_args.get("category") == "billing"
    assert "CUST-001" in call_args.get("description", "")
    assert result.tokens_used < 1500  # Budget check

@pytest.mark.integration
def test_agent_escalates_to_human_for_fraud():
    """Agent must NOT autonomously process fraud claims — escalate always"""
    agent = CustomerSupportAgent(model="gpt-4o-mini")
    result = agent.handle("Someone used my card without permission")

    assert result.escalated_to_human is True
    assert result.ticket_created is False  # No autonomous action on fraud

4. Golden Datasets & Regression Testing

A golden dataset is a curated set of input/output pairs where you've manually verified the correct behavior. It's the AI equivalent of a regression test suite — you run your agent against it before every release to detect quality regressions.

Building Your Golden Dataset

A production-ready golden dataset has four components:

Golden Dataset Size Guidelines

  • 🟢 MVP (launch minimum): 50–100 examples covering all major use cases and P0 regressions
  • 🟡 Mature product: 200–500 examples with edge case coverage and adversarial inputs
  • 🔵 Enterprise/regulated: 1,000+ examples with balanced category distribution and human-reviewed labels
  • 📌 Rule of thumb: Add a new golden example every time you fix a bug. Your dataset grows with your bug history.
# golden_dataset.jsonl (one example per line)
{
  "id": "GD-0042",
  "input": "Summarize Q1 2026 sales report and highlight any anomalies",
  "expected_tool_calls": ["fetch_document", "analyze_data"],
  "expected_output_criteria": {
    "contains_keywords": ["Q1", "revenue", "anomaly"],
    "max_length": 300,
    "format": "bullet_points",
    "must_not_contain": ["Q2", "competitor names"]
  },
  "category": "happy_path",
  "priority": "P0"
}

# Running golden dataset regression
from ragas.metrics import answer_correctness, faithfulness
from my_agent import SalesAgent
import json

def run_regression(dataset_path: str, threshold: float = 0.85):
    agent = SalesAgent()
    failures = []

    with open(dataset_path) as f:
        for line in f:
            example = json.loads(line)
            result = agent.run(example["input"])

            # Rule-based checks
            for keyword in example["expected_output_criteria"]["contains_keywords"]:
                if keyword.lower() not in result.output.lower():
                    failures.append(f"GD-{example['id']}: missing keyword '{keyword}'")

            if len(result.output.split()) > example["expected_output_criteria"]["max_length"]:
                failures.append(f"GD-{example['id']}: output too long")

    if failures:
        raise AssertionError(f"Regression failures:\n" + "\n".join(failures))
    print(f"✅ All {dataset_path} examples passed")

5. LLM-as-Judge Evaluation

LLM-as-judge is the technique of using a powerful LLM (typically GPT-4o or Claude Sonnet) to evaluate the output of your agent. It's especially valuable for open-ended outputs where rule-based checks are insufficient — summaries, explanations, code reviews, and creative content.

Designing a Reliable Judge Prompt

A poorly designed judge prompt is biased and useless. Follow these principles:

# LLM-as-judge implementation
from openai import OpenAI

JUDGE_PROMPT = """You are an expert evaluator for AI agent responses.
Score the following response on three dimensions (0-10 each):
- Accuracy: Is all factual information correct based on the reference?
- Completeness: Does it fully answer the question?
- Conciseness: Is it appropriately brief without omitting key info?

Question: {question}
Reference answer: {reference}
Agent response: {response}

Think step-by-step about each dimension, then provide your scores.
Output ONLY valid JSON: {{"accuracy": N, "completeness": N, "conciseness": N, "reasoning": "..."}}"""

def judge_response(question: str, reference: str, response: str) -> dict:
    client = OpenAI()
    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            question=question, reference=reference, response=response
        )}],
        response_format={"type": "json_object"},
        temperature=0  # Deterministic judge
    )
    scores = json.loads(result.choices[0].message.content)
    scores["overall"] = sum([scores["accuracy"], scores["completeness"], scores["conciseness"]]) / 30
    return scores

# Batch evaluation
def evaluate_dataset(examples):
    results = [judge_response(e["question"], e["reference"], e["response"]) for e in examples]
    avg_overall = sum(r["overall"] for r in results) / len(results)
    print(f"Average quality score: {avg_overall:.2%}")
    if avg_overall < 0.80:
        raise ValueError(f"Quality below threshold: {avg_overall:.2%} < 80%")
LLM evaluation pipeline: golden datasets, LLM-as-judge evaluation, RAGAS metrics, CI/CD quality gate
LLM Evaluation Pipeline — from golden datasets through LLM-as-judge scoring to RAGAS metrics and CI/CD gates. Source: mdsanwarhossain.me

6. RAGAS Metrics for RAG-Based Agents

RAGAS (Retrieval-Augmented Generation Assessment) is the standard evaluation framework for agents that use retrieval. It provides five objective metrics that measure both the quality of retrieval and the quality of generation.

RAGAS Metric What It Measures Target Score Failure Mode
Faithfulness Are all claims in the answer supported by the retrieved context? ≥ 0.85 Hallucination (inventing facts)
Answer Relevancy Is the answer relevant and focused on the question? ≥ 0.80 Off-topic, padded responses
Context Precision How much of the retrieved context is actually useful? ≥ 0.75 Noisy retrieval, irrelevant chunks
Context Recall Was all needed context retrieved? (requires ground truth) ≥ 0.80 Missing critical documents
Answer Correctness Factual + semantic similarity to ground truth answer ≥ 0.80 Wrong facts, missed key points
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

# Prepare evaluation dataset
data = {
    "question": ["What is our refund policy?", "How do I reset my password?"],
    "answer": [agent_response_1, agent_response_2],
    "contexts": [[retrieved_chunks_1], [retrieved_chunks_2]],
    "ground_truth": ["Refunds within 30 days...", "Go to Settings > Security..."]
}

dataset = Dataset.from_dict(data)
result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.87,
#  'context_precision': 0.79, 'context_recall': 0.84}

# CI gate
if result["faithfulness"] < 0.85:
    raise SystemExit("❌ Faithfulness score below 0.85 — blocking deploy")

7. Simulation & Adversarial Testing

Standard tests verify the happy path. Adversarial and simulation tests verify how your agent behaves when things go wrong — which is the majority of real production scenarios.

Adversarial Test Categories

8. Evaluation Frameworks Compared

Framework Best For LLM-as-Judge CI/CD Integration Price
RAGAS RAG pipelines, retrieval quality ✅ Built-in ✅ Easy Open source
DeepEval General agent eval, rich metrics ✅ Built-in ✅ pytest plugin Free/paid
LangSmith LangChain apps, tracing + eval ✅ Built-in ✅ Easy $39+/mo
Promptfoo Prompt A/B testing, red-teaming ✅ Built-in ✅ CLI-based Open source
Arize Phoenix Observability + eval combined ✅ Built-in ✅ OTEL-based Open source

9. Integrating Evals into CI/CD Pipelines

Running evals in CI/CD requires balancing thoroughness against cost and speed. The key insight: not all evals need to run on every commit. Structure your eval pipeline by frequency and trigger.

CI/CD Eval Trigger Strategy

# .github/workflows/ai-eval.yml
name: AI Agent Evaluation

on:
  push:
    branches: [main]
  pull_request:
  schedule:
    - cron: '0 2 * * *'  # Nightly at 2am UTC

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run agent unit tests (mock LLM)
        run: pytest tests/unit/ -v --timeout=30
        env:
          USE_MOCK_LLM: "true"

  eval-pr:
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    steps:
      - uses: actions/checkout@v4
      - name: Run eval on PR (50 critical examples)
        run: |
          python eval/run_eval.py \
            --dataset golden_dataset_critical.jsonl \
            --model gpt-4o-mini \
            --threshold 0.80
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

  eval-nightly:
    runs-on: ubuntu-latest
    if: github.event_name == 'schedule'
    steps:
      - uses: actions/checkout@v4
      - name: Run full nightly eval
        run: |
          python eval/run_eval.py \
            --dataset golden_dataset_full.jsonl \
            --model gpt-4o \
            --judge-model gpt-4o \
            --threshold 0.82 \
            --report-slack

10. Monitoring Agent Quality in Production

Evals before deployment catch known failure modes. Production monitoring catches the unknown unknowns — user behaviors, data distributions, and failure patterns you never anticipated in your test suite.

Production Quality Metrics to Track

AI Agent Testing Checklist

  • ☐ Unit tests with mock LLM covering tool selection, retry logic, and loop detection
  • ☐ Integration tests for each major agent workflow using cheap model + mocked tools
  • ☐ Golden dataset with 100+ examples covering happy paths and edge cases
  • ☐ LLM-as-judge eval with explicit rubric (not vague "rate this" prompts)
  • ☐ RAGAS evaluation for all RAG-based agent features
  • ☐ Adversarial test suite: prompt injection, tool failures, jailbreaks
  • ☐ CI/CD gates: unit tests on every commit, RAGAS on every PR, full eval nightly
  • ☐ Production monitoring: completion rate, escalation rate, cost/session, P95 latency
  • ☐ Online sampling eval: run judge on 1–5% of production traffic

Leave a Comment

Related Posts

Md Sanwar Hossain - Software Engineer
Md Sanwar Hossain

Software Engineer · Java · Spring Boot · Microservices · AI/LLM Systems

All Posts
Last updated: April 6, 2026