Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.litigationlabs.io/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The Evaluation System is the central nervous system for monitoring and improving agent behavior. It bridges the gap between raw courtroom transcripts and actionable prompt engineering by correlating human expertise with automated “LLM-as-Judge” scoring, enabling a continuous feedback loop.

The Evaluation Engine

G-Eval Framework

Automated scoring relies on G-Eval, a framework that uses high-reasoning models (GPT-4o) to evaluate agent outputs against specific criteria.
  • Criteria-Driven: Each agent type (Witness, Judge, Opposing Counsel) has unique metrics like “Affidavit Faithfulness,” “Judicial Demeanor,” or “Strategic Quality.”
  • Reasoning-First: The evaluator provides a written rationale explaining why a response succeeded or failed.
  • Probabilistic Scoring: Results are mapped to a 0-1 scale, with defined thresholds for “Passing” grades.

Human Feedback Loop

Trial teams and scenario designers rate agent responses through the Model Evaluation Feedback tool.
  1. Batching: Responses are pulled from completed CaseSim sessions in slices (e.g., 10 at a time).
  2. Scoring: Users provide a 1-5 star rating and optional feedback on specific interactions.
  3. Automatic Embedding: Upon batch completion, rated responses are automatically converted into vector embeddings (source: human) to serve as future RAG exemplars.

Automated Evaluation Batches

  • Continuous Monitoring: A nightly cron job (currently ad-hoc only) evaluates recent sessions to detect performance drift.
  • On-Demand Testing: Admins can trigger ad-hoc batches via the Dashboard to test new scenarios or prompt changes.
  • Human Correlation: When automated evaluations include human-rated cases, the system calculates Pearson Correlation to verify that automated scores track with human judgment.

Dashboard Metrics

MetricInterpretation
Overall Pass RatePercentage of agent responses meeting the quality threshold.
Average ScoreAggregate grade (0-100%) of agent performance.
Human CorrelationAlignment between AI and Human scoring. High correlation (>=0.7) means automated metrics are reliable.
Divergent CasesOutliers where AI and Human scores differ by >=30%.

Metric Definitions

Witness Metrics

  • Affidavit Faithfulness: Validates that the witness response contains ONLY facts from their sworn affidavit.
  • Behavioral Compliance: Measures alignment with the witness’s configured profile (e.g., Hostile vs. Cooperative).
  • Response Authenticity: Evaluates if the output sounds like a real human witness.
  • Witness Answer Relevancy: Assesses whether the witness directly addresses the question asked.

Judge Metrics

  • Ruling Correctness: Checks if the sustain/overrule decision is legally sound under the FRE.
  • Rule Citation: Evaluates whether the judge cites the correct rule number.
  • Judicial Demeanor: Tracks professionalism, impartiality, and courtroom decorum.
  • Judge Response Format: Ensures the ruling follows the required JSON schema.

Opposing Counsel Metrics

  • OC Task Completion: Measures if the agent successfully made a decision or asked a valid question.
  • Strategic Quality: Evaluates whether the action advances the case theory.
  • Pedagogical Value: Checks if the action teaches the student something.
  • OC Fact Grounding: Ensures the OC only references facts already established in the transcript.

The Improvement Loop

  1. Validation: A batch is reviewed on the dashboard. If human correlation is strong (>=0.40), it is deemed safe for RAG.
  2. Embedding: Clicking “Generate RAG Embeddings” encodes the Context + Response + Feedback into a 1536-dimensional vector.
  3. Retrieval: When an agent generates a new response, the system retrieves the most similar positive (4-5) and negative (1-2) examples.
  4. Steering: These examples are injected into the agent’s system prompt as “Do” and “Don’t” guidance.