Batch Evaluation

Batch evaluation runs evaluators across all agent sessions in CloudWatch, producing per-session scores and aggregate metrics. Use it to measure agent quality over time, compare before/after prompt changes, or validate ground truth expectations.

Quick Start

# Run a single evaluator across all sessions
agentcore run batch-evaluation -r MyAgent -e Builtin.Correctness

# Multiple evaluators
agentcore run batch-evaluation -r MyAgent -e Builtin.Correctness Builtin.Helpfulness Builtin.Faithfulness

# Reference evaluators by ARN (custom or cross-account)
agentcore run batch-evaluation -r MyAgent --evaluator-arn arn:aws:bedrock-agentcore:us-west-2:123456789012:evaluator/MyCustomEval

# JSON output for scripting
agentcore run batch-evaluation -r MyAgent -e Builtin.Helpfulness --json

Available Evaluators

Built-in evaluators provided by AgentCore:

Evaluator	What it measures
`Builtin.Correctness`	Factual accuracy of responses
`Builtin.Helpfulness`	How well responses address the user's goal
`Builtin.Faithfulness`	Grounding in tool results / provided context
`Builtin.GoalSuccessRate`	Whether the agent achieved the user's goal
`Builtin.ToolSelectionAccuracy`	Correct tool chosen for the task
`Builtin.Completeness`	Whether all parts of the request were handled
`Builtin.TrajectoryExactOrderMatch`	Tool call sequence matches expected trajectory

Custom evaluators defined in your project (via agentcore add evaluator) can also be used.

Filtering Sessions

By time window

# Only sessions from the last 3 days
agentcore run batch-evaluation -r MyAgent -e Builtin.Helpfulness --lookback-days 3

By session ID

agentcore run batch-evaluation -r MyAgent -e Builtin.Correctness -s <session-id-1> <session-id-2>

Ground Truth

Provide expected responses, assertions, or expected tool trajectories for specific sessions:

agentcore run batch-evaluation \
  -r MyAgent \
  -e Builtin.Correctness Builtin.GoalSuccessRate \
  -s <session-id> \
  --ground-truth ./ground_truth.json

Ground truth file format

[
  {
    "sessionId": "<session-id>",
    "groundTruth": {
      "inline": {
        "assertions": [{ "text": "Agent should use the lookup_order tool" }],
        "expectedTrajectory": {
          "toolNames": ["lookup_order"]
        },
        "turns": [
          {
            "input": "What's the status of order ORD-1001?",
            "expectedResponse": { "text": "Order ORD-1001 has been delivered" }
          }
        ]
      }
    }
  }
]

All fields inside inline are optional — include only what's relevant:

assertions — free-text expectations evaluated by Builtin.GoalSuccessRate
expectedTrajectory — tool call sequence evaluated by Builtin.TrajectoryExactOrderMatch
turns — input/expected-response pairs evaluated by Builtin.Correctness

Dataset-Driven Evaluation

Instead of scoring historical CloudWatch traces, drive the evaluation from a dataset — the CLI invokes the agent with each dataset scenario, then scores the results:

# Use the local DRAFT dataset file
agentcore run batch-evaluation -r MyAgent -e Builtin.Correctness --dataset MyScenarios

# Use a published dataset version
agentcore run batch-evaluation -r MyAgent -e Builtin.Correctness --dataset MyScenarios --dataset-version 1

Flag	Description
`--dataset <name>`	Dataset name — invoke the agent with its scenarios instead of traces
`--dataset-version <version>`	Dataset version (`N` or `DRAFT`; omit to use the local file)

Add and edit datasets with agentcore add dataset and agentcore dataset publish-version. The number of scored sessions equals the number of scenarios in the dataset.

Custom Name

agentcore run batch-evaluation -r MyAgent -e Builtin.Helpfulness -n "weekly_quality_check"

Names must start with a letter and contain only letters, digits, and underscores (max 48 characters).

Encrypting Results with KMS

By default, batch evaluation results are encrypted with an AWS-managed key. To encrypt them with your own customer managed key (CMK), pass its ARN with --kms-key:

agentcore run batch-evaluation \
  -r MyAgent \
  -e Builtin.Correctness \
  --kms-key arn:aws:kms:us-west-2:111122223333:key/12345678-1234-1234-1234-123456789012

The key must be in the same region as the evaluation, and the calling principal (and the AgentCore service) must have kms:Encrypt/kms:GenerateDataKey permissions on it. Omit the flag to use the AWS-managed key.

Stopping a Running Evaluation

agentcore stop batch-evaluation -i <batch-evaluation-id>

Viewing Results

CLI output

The CLI shows scores grouped by evaluator with average scores after the run completes.

Local history

Job records are saved in .cli/jobs/batch-eval-results/. View past runs via the TUI:

agentcore
# Navigate to: Run → Batch Evaluation History   (or View → Batch Evaluation)

JSON output

agentcore run batch-evaluation -r MyAgent -e Builtin.Helpfulness --json

Returns batchEvaluationId, evaluationResults with numberOfSessionsCompleted, evaluatorSummaries with per-evaluator averageScore.

TUI Wizard

Run agentcore → Run → Batch Evaluation for a guided flow:

Select agent
Multi-select evaluators
Set lookback days
Optionally select specific sessions
Optionally add ground truth
Name the run (optional)
Confirm and run

The TUI shows real-time progress with elapsed time and step indicators.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Evaluation

Quick Start

Available Evaluators

Filtering Sessions

By time window

By session ID

Ground Truth

Ground truth file format

Dataset-Driven Evaluation

Custom Name

Encrypting Results with KMS

Stopping a Running Evaluation

Viewing Results

CLI output

Local history

JSON output

TUI Wizard

FilesExpand file tree

batch-evaluation.md

Latest commit

History

batch-evaluation.md

File metadata and controls

Batch Evaluation

Quick Start

Available Evaluators

Filtering Sessions

By time window

By session ID

Ground Truth

Ground truth file format

Dataset-Driven Evaluation

Custom Name

Encrypting Results with KMS

Stopping a Running Evaluation

Viewing Results

CLI output

Local history

JSON output

TUI Wizard