Evaluation

Overview

The evaluation system is the central mechanism for monitoring and improving agent behavior. It uses G-Eval — an LLM-as-judge framework — to score every agent response against domain-specific criteria like legal correctness, behavioral compliance, and pedagogical value. Every evaluation call is traced in Langfuse under the evals/automated trace name.

How G-Eval Works

G-Eval sends the agent’s response, its context, and a set of evaluation criteria to a high-reasoning model (GPT-4o by default). The evaluator returns:

A score (0 to 1) for each metric.
A rationale explaining why it scored the response that way.
A pass/fail determination based on a predefined threshold.

Each metric evaluation is traced as a generation in Langfuse with the name g-eval-{MetricName}, so you can inspect the evaluator’s reasoning for any score.

Agent-Specific Metrics

Witness Metrics

Metric	Threshold	What It Measures
Affidavit Faithfulness	0.7	Does the response contain only facts from the sworn affidavit? Penalizes fabrications, embellishments, and contradictions.
Behavioral Compliance	0.6	Does the response match the witness’s configured personality — cooperativeness, verbosity, memory quality?
Response Authenticity	0.6	Does the output sound like a real human witness, not AI? Checks for first-person perspective and natural speech patterns.
Witness Answer Relevancy	0.6	Does the witness directly address the question without unnecessary tangents?

Judge Metrics

Metric	Threshold	What It Measures
Ruling Correctness	0.7	Is the sustain/overrule decision legally sound under the Federal Rules of Evidence?
Rule Citation	0.5	Does the judge cite the correct FRE rule number and apply it properly?
Judicial Demeanor	0.5	Is the tone professional, impartial, and appropriate for a courtroom?
Judge Response Format	0.5	Does the output follow the required JSON schema for the orchestrator?

Opposing Counsel Metrics

Metric	Threshold	What It Measures
OC Task Completion	0.6	Did the agent make a clear decision (objection vs. no objection) or ask a valid question?
Strategic Quality	0.5	Does the action advance the case theory, or is it a wasted move?
Pedagogical Value	0.5	Does the action teach the student something? Includes intentional errors.
OC Fact Grounding	0.6	Does the agent only reference facts already established in the transcript?

Running Evaluation Batches

Batches can be triggered via the Eval Dashboard or the API. The system supports both ad-hoc runs for testing prompt changes and scheduled monitoring for detecting performance drift.

POST /api/evals/automated

Configuration

Parameter	Description
`agentTypes`	Which agents to evaluate: `witness`, `judge`, `opposing_counsel`
`daysBack`	Only include sessions from the last N days
`maxTestCases`	Cap on total evaluations to run
`evaluatorModel`	Model for the evaluator (default: `openai/gpt-4o-mini`)
`includeHumanRated`	Include cases that have human ratings for correlation analysis
`sessionIds`	Limit to specific sessions
`scenarioIds`	Limit to specific scenarios

What Happens

Dataset building — The system extracts test cases from session transcripts. Each test case is one agent response with its triggering context, the preceding conversation, and (for witnesses) the relevant affidavit.
Metric evaluation — Each test case is scored against its agent-specific metric suite (4-5 metrics per response). Every metric call is traced individually in Langfuse.
Results storage — Individual results are stored in the automatedEvalResults collection. Batch summaries go into automatedEvalBatches.
Summary calculation — The system computes per-agent averages, per-metric pass rates, and an overall pass rate across all evaluations.

Dashboard Metrics

The Eval Dashboard surfaces these top-level indicators:

Metric	Interpretation
Overall Pass Rate	Percentage of agent responses meeting the quality threshold across all metrics.
Average Score	Aggregate grade (0-100%) of agent performance.
Human Correlation	Alignment between AI and Human scoring. >= 0.7 means automated metrics are reliable.
Divergent Cases	Outliers where AI and Human scores differ by >= 30% — the most valuable data points for tuning.

Trend charts track performance over time. A sudden dip in “Rule Citation” for the Judge agent might indicate a regression from a model update or a particularly tricky new scenario.

Viewing Results in Langfuse

Batch Traces

Each batch evaluation creates a trace named evals/automated that contains:

The batch configuration as input.
Summary statistics as output.
Individual g-eval-* generations nested inside for each metric evaluation.

Inspecting a Score

To understand why a specific response received a low score:

Find the batch trace in Langfuse.
Locate the g-eval-{MetricName} generation for that test case.
Read the evaluator’s rationale — it explains which criteria the response failed and why.

Practical Workflows

Regression Testing

After updating an agent prompt, run a batch evaluation on a standard scenario. Compare metric scores to the previous batch. If Affidavit Faithfulness drops, the prompt change may need revision.

Tuning the Evaluator

If human raters consistently disagree with automated scores on a specific metric, the evaluation criteria may be too strict or missing context. Inspect the g-eval-* generation to see exactly what the evaluator was told, and adjust the metric definition accordingly.

Scaling Quality Across Scenarios

Seed a new scenario with 5-10 human ratings. Once correlation is established, run automated evals on the rest to identify weak spots without manual review.

Observability

Prompts & Testing

Quality & Evaluation

Overview

How G-Eval Works

Agent-Specific Metrics

Witness Metrics

Judge Metrics

Opposing Counsel Metrics

Running Evaluation Batches

Configuration

What Happens

Dashboard Metrics

Viewing Results in Langfuse

Batch Traces

Inspecting a Score

Practical Workflows

Regression Testing

Tuning the Evaluator

Scaling Quality Across Scenarios

Observability

Prompts & Testing

Quality & Evaluation

Documentation Index

​Overview

​How G-Eval Works

​Agent-Specific Metrics

​Witness Metrics

​Judge Metrics

​Opposing Counsel Metrics

​Running Evaluation Batches

​Configuration

​What Happens

​Dashboard Metrics

​Viewing Results in Langfuse

​Batch Traces

​Inspecting a Score

​Practical Workflows

​Regression Testing

​Tuning the Evaluator

​Scaling Quality Across Scenarios

Overview

How G-Eval Works

Agent-Specific Metrics

Witness Metrics

Judge Metrics

Opposing Counsel Metrics

Running Evaluation Batches

Configuration

What Happens

Dashboard Metrics

Viewing Results in Langfuse

Batch Traces

Inspecting a Score

Practical Workflows

Regression Testing

Tuning the Evaluator

Scaling Quality Across Scenarios