Documentation Index
Fetch the complete documentation index at: https://docs.litigationlabs.io/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The evaluation system is the central mechanism for monitoring and improving agent behavior. It uses G-Eval — an LLM-as-judge framework — to score every agent response against domain-specific criteria like legal correctness, behavioral compliance, and pedagogical value. Every evaluation call is traced in Langfuse under the evals/automated trace name.
How G-Eval Works
G-Eval sends the agent’s response, its context, and a set of evaluation criteria to a high-reasoning model (GPT-4o by default). The evaluator returns:
- A score (0 to 1) for each metric.
- A rationale explaining why it scored the response that way.
- A pass/fail determination based on a predefined threshold.
Each metric evaluation is traced as a generation in Langfuse with the name g-eval-{MetricName}, so you can inspect the evaluator’s reasoning for any score.
Agent-Specific Metrics
Witness Metrics
| Metric | Threshold | What It Measures |
|---|
| Affidavit Faithfulness | 0.7 | Does the response contain only facts from the sworn affidavit? Penalizes fabrications, embellishments, and contradictions. |
| Behavioral Compliance | 0.6 | Does the response match the witness’s configured personality — cooperativeness, verbosity, memory quality? |
| Response Authenticity | 0.6 | Does the output sound like a real human witness, not AI? Checks for first-person perspective and natural speech patterns. |
| Witness Answer Relevancy | 0.6 | Does the witness directly address the question without unnecessary tangents? |
Judge Metrics
| Metric | Threshold | What It Measures |
|---|
| Ruling Correctness | 0.7 | Is the sustain/overrule decision legally sound under the Federal Rules of Evidence? |
| Rule Citation | 0.5 | Does the judge cite the correct FRE rule number and apply it properly? |
| Judicial Demeanor | 0.5 | Is the tone professional, impartial, and appropriate for a courtroom? |
| Judge Response Format | 0.5 | Does the output follow the required JSON schema for the orchestrator? |
Opposing Counsel Metrics
| Metric | Threshold | What It Measures |
|---|
| OC Task Completion | 0.6 | Did the agent make a clear decision (objection vs. no objection) or ask a valid question? |
| Strategic Quality | 0.5 | Does the action advance the case theory, or is it a wasted move? |
| Pedagogical Value | 0.5 | Does the action teach the student something? Includes intentional errors. |
| OC Fact Grounding | 0.6 | Does the agent only reference facts already established in the transcript? |
Running Evaluation Batches
Batches can be triggered via the Eval Dashboard or the API. The system supports both ad-hoc runs for testing prompt changes and scheduled monitoring for detecting performance drift.
POST /api/evals/automated
Configuration
| Parameter | Description |
|---|
agentTypes | Which agents to evaluate: witness, judge, opposing_counsel |
daysBack | Only include sessions from the last N days |
maxTestCases | Cap on total evaluations to run |
evaluatorModel | Model for the evaluator (default: openai/gpt-4o-mini) |
includeHumanRated | Include cases that have human ratings for correlation analysis |
sessionIds | Limit to specific sessions |
scenarioIds | Limit to specific scenarios |
What Happens
-
Dataset building — The system extracts test cases from session transcripts. Each test case is one agent response with its triggering context, the preceding conversation, and (for witnesses) the relevant affidavit.
-
Metric evaluation — Each test case is scored against its agent-specific metric suite (4-5 metrics per response). Every metric call is traced individually in Langfuse.
-
Results storage — Individual results are stored in the
automatedEvalResults collection. Batch summaries go into automatedEvalBatches.
-
Summary calculation — The system computes per-agent averages, per-metric pass rates, and an overall pass rate across all evaluations.
Dashboard Metrics
The Eval Dashboard surfaces these top-level indicators:
| Metric | Interpretation |
|---|
| Overall Pass Rate | Percentage of agent responses meeting the quality threshold across all metrics. |
| Average Score | Aggregate grade (0-100%) of agent performance. |
| Human Correlation | Alignment between AI and Human scoring. >= 0.7 means automated metrics are reliable. |
| Divergent Cases | Outliers where AI and Human scores differ by >= 30% — the most valuable data points for tuning. |
Trend charts track performance over time. A sudden dip in “Rule Citation” for the Judge agent might indicate a regression from a model update or a particularly tricky new scenario.
Viewing Results in Langfuse
Batch Traces
Each batch evaluation creates a trace named evals/automated that contains:
- The batch configuration as input.
- Summary statistics as output.
- Individual
g-eval-* generations nested inside for each metric evaluation.
Inspecting a Score
To understand why a specific response received a low score:
- Find the batch trace in Langfuse.
- Locate the
g-eval-{MetricName} generation for that test case.
- Read the evaluator’s rationale — it explains which criteria the response failed and why.
Practical Workflows
Regression Testing
After updating an agent prompt, run a batch evaluation on a standard scenario. Compare metric scores to the previous batch. If Affidavit Faithfulness drops, the prompt change may need revision.
Tuning the Evaluator
If human raters consistently disagree with automated scores on a specific metric, the evaluation criteria may be too strict or missing context. Inspect the g-eval-* generation to see exactly what the evaluator was told, and adjust the metric definition accordingly.
Scaling Quality Across Scenarios
Seed a new scenario with 5-10 human ratings. Once correlation is established, run automated evals on the rest to identify weak spots without manual review.