Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.litigationlabs.io/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The evaluation system is the central mechanism for monitoring and improving agent behavior. It uses G-Eval — an LLM-as-judge framework — to score every agent response against domain-specific criteria like legal correctness, behavioral compliance, and pedagogical value. Every evaluation call is traced in Langfuse under the evals/automated trace name.

How G-Eval Works

G-Eval sends the agent’s response, its context, and a set of evaluation criteria to a high-reasoning model (GPT-4o by default). The evaluator returns:
  • A score (0 to 1) for each metric.
  • A rationale explaining why it scored the response that way.
  • A pass/fail determination based on a predefined threshold.
Each metric evaluation is traced as a generation in Langfuse with the name g-eval-{MetricName}, so you can inspect the evaluator’s reasoning for any score.

Agent-Specific Metrics

Witness Metrics

MetricThresholdWhat It Measures
Affidavit Faithfulness0.7Does the response contain only facts from the sworn affidavit? Penalizes fabrications, embellishments, and contradictions.
Behavioral Compliance0.6Does the response match the witness’s configured personality — cooperativeness, verbosity, memory quality?
Response Authenticity0.6Does the output sound like a real human witness, not AI? Checks for first-person perspective and natural speech patterns.
Witness Answer Relevancy0.6Does the witness directly address the question without unnecessary tangents?

Judge Metrics

MetricThresholdWhat It Measures
Ruling Correctness0.7Is the sustain/overrule decision legally sound under the Federal Rules of Evidence?
Rule Citation0.5Does the judge cite the correct FRE rule number and apply it properly?
Judicial Demeanor0.5Is the tone professional, impartial, and appropriate for a courtroom?
Judge Response Format0.5Does the output follow the required JSON schema for the orchestrator?

Opposing Counsel Metrics

MetricThresholdWhat It Measures
OC Task Completion0.6Did the agent make a clear decision (objection vs. no objection) or ask a valid question?
Strategic Quality0.5Does the action advance the case theory, or is it a wasted move?
Pedagogical Value0.5Does the action teach the student something? Includes intentional errors.
OC Fact Grounding0.6Does the agent only reference facts already established in the transcript?

Running Evaluation Batches

Batches can be triggered via the Eval Dashboard or the API. The system supports both ad-hoc runs for testing prompt changes and scheduled monitoring for detecting performance drift.
POST /api/evals/automated

Configuration

ParameterDescription
agentTypesWhich agents to evaluate: witness, judge, opposing_counsel
daysBackOnly include sessions from the last N days
maxTestCasesCap on total evaluations to run
evaluatorModelModel for the evaluator (default: openai/gpt-4o-mini)
includeHumanRatedInclude cases that have human ratings for correlation analysis
sessionIdsLimit to specific sessions
scenarioIdsLimit to specific scenarios

What Happens

  1. Dataset building — The system extracts test cases from session transcripts. Each test case is one agent response with its triggering context, the preceding conversation, and (for witnesses) the relevant affidavit.
  2. Metric evaluation — Each test case is scored against its agent-specific metric suite (4-5 metrics per response). Every metric call is traced individually in Langfuse.
  3. Results storage — Individual results are stored in the automatedEvalResults collection. Batch summaries go into automatedEvalBatches.
  4. Summary calculation — The system computes per-agent averages, per-metric pass rates, and an overall pass rate across all evaluations.

Dashboard Metrics

The Eval Dashboard surfaces these top-level indicators:
MetricInterpretation
Overall Pass RatePercentage of agent responses meeting the quality threshold across all metrics.
Average ScoreAggregate grade (0-100%) of agent performance.
Human CorrelationAlignment between AI and Human scoring. >= 0.7 means automated metrics are reliable.
Divergent CasesOutliers where AI and Human scores differ by >= 30% — the most valuable data points for tuning.
Trend charts track performance over time. A sudden dip in “Rule Citation” for the Judge agent might indicate a regression from a model update or a particularly tricky new scenario.

Viewing Results in Langfuse

Batch Traces

Each batch evaluation creates a trace named evals/automated that contains:
  • The batch configuration as input.
  • Summary statistics as output.
  • Individual g-eval-* generations nested inside for each metric evaluation.

Inspecting a Score

To understand why a specific response received a low score:
  1. Find the batch trace in Langfuse.
  2. Locate the g-eval-{MetricName} generation for that test case.
  3. Read the evaluator’s rationale — it explains which criteria the response failed and why.

Practical Workflows

Regression Testing

After updating an agent prompt, run a batch evaluation on a standard scenario. Compare metric scores to the previous batch. If Affidavit Faithfulness drops, the prompt change may need revision.

Tuning the Evaluator

If human raters consistently disagree with automated scores on a specific metric, the evaluation criteria may be too strict or missing context. Inspect the g-eval-* generation to see exactly what the evaluator was told, and adjust the metric definition accordingly.

Scaling Quality Across Scenarios

Seed a new scenario with 5-10 human ratings. Once correlation is established, run automated evals on the rest to identify weak spots without manual review.