Overview
The evaluation system is the central mechanism for monitoring and improving agent behavior. It uses G-Eval — an LLM-as-judge framework — to score every agent response against domain-specific criteria like legal correctness, behavioral compliance, and pedagogical value. Every evaluation call is traced in Langfuse under theevals/automated trace name.
How G-Eval Works
G-Eval sends the agent’s response, its context, and a set of evaluation criteria to a high-reasoning model (GPT-4o by default). The evaluator returns:- A score (0 to 1) for each metric.
- A rationale explaining why it scored the response that way.
- A pass/fail determination based on a predefined threshold.
g-eval-{MetricName}, so you can inspect the evaluator’s reasoning for any score.
Agent-Specific Metrics
Witness Metrics
| Metric | Threshold | What It Measures |
|---|---|---|
| Affidavit Faithfulness | 0.7 | Does the response contain only facts from the sworn affidavit? Penalizes fabrications, embellishments, and contradictions. |
| Behavioral Compliance | 0.6 | Does the response match the witness’s configured personality — cooperativeness, verbosity, memory quality? |
| Response Authenticity | 0.6 | Does the output sound like a real human witness, not AI? Checks for first-person perspective and natural speech patterns. |
| Witness Answer Relevancy | 0.6 | Does the witness directly address the question without unnecessary tangents? |
Judge Metrics
| Metric | Threshold | What It Measures |
|---|---|---|
| Ruling Correctness | 0.7 | Is the sustain/overrule decision legally sound under the Federal Rules of Evidence? |
| Rule Citation | 0.5 | Does the judge cite the correct FRE rule number and apply it properly? |
| Judicial Demeanor | 0.5 | Is the tone professional, impartial, and appropriate for a courtroom? |
| Judge Response Format | 0.5 | Does the output follow the required JSON schema for the orchestrator? |
Opposing Counsel Metrics
| Metric | Threshold | What It Measures |
|---|---|---|
| OC Task Completion | 0.6 | Did the agent make a clear decision (objection vs. no objection) or ask a valid question? |
| Strategic Quality | 0.5 | Does the action advance the case theory, or is it a wasted move? |
| Pedagogical Value | 0.5 | Does the action teach the student something? Includes intentional errors. |
| OC Fact Grounding | 0.6 | Does the agent only reference facts already established in the transcript? |
Running Evaluation Batches
Batches can be triggered via the Eval Dashboard or the API. The system supports both ad-hoc runs for testing prompt changes and scheduled monitoring for detecting performance drift.Configuration
| Parameter | Description |
|---|---|
agentTypes | Which agents to evaluate: witness, judge, opposing_counsel |
daysBack | Only include sessions from the last N days |
maxTestCases | Cap on total evaluations to run |
evaluatorModel | Model for the evaluator (default: openai/gpt-4o-mini) |
includeHumanRated | Include cases that have human ratings for correlation analysis |
sessionIds | Limit to specific sessions |
scenarioIds | Limit to specific scenarios |
What Happens
- Dataset building — The system extracts test cases from session transcripts. Each test case is one agent response with its triggering context, the preceding conversation, and (for witnesses) the relevant affidavit.
- Metric evaluation — Each test case is scored against its agent-specific metric suite (4-5 metrics per response). Every metric call is traced individually in Langfuse.
-
Results storage — Individual results are stored in the
automatedEvalResultscollection. Batch summaries go intoautomatedEvalBatches. - Summary calculation — The system computes per-agent averages, per-metric pass rates, and an overall pass rate across all evaluations.
Dashboard Metrics
The Eval Dashboard surfaces these top-level indicators:| Metric | Interpretation |
|---|---|
| Overall Pass Rate | Percentage of agent responses meeting the quality threshold across all metrics. |
| Average Score | Aggregate grade (0-100%) of agent performance. |
| Human Correlation | Alignment between AI and Human scoring. >= 0.7 means automated metrics are reliable. |
| Divergent Cases | Outliers where AI and Human scores differ by >= 30% — the most valuable data points for tuning. |
Viewing Results in Langfuse
Batch Traces
Each batch evaluation creates a trace namedevals/automated that contains:
- The batch configuration as input.
- Summary statistics as output.
- Individual
g-eval-*generations nested inside for each metric evaluation.
Inspecting a Score
To understand why a specific response received a low score:- Find the batch trace in Langfuse.
- Locate the
g-eval-{MetricName}generation for that test case. - Read the evaluator’s rationale — it explains which criteria the response failed and why.
Practical Workflows
Regression Testing
After updating an agent prompt, run a batch evaluation on a standard scenario. Compare metric scores to the previous batch. If Affidavit Faithfulness drops, the prompt change may need revision.Tuning the Evaluator
If human raters consistently disagree with automated scores on a specific metric, the evaluation criteria may be too strict or missing context. Inspect theg-eval-* generation to see exactly what the evaluator was told, and adjust the metric definition accordingly.