Evaluation System

Overview

The Evaluation System is the central nervous system for monitoring and improving agent behavior. It bridges the gap between raw courtroom transcripts and actionable prompt engineering by correlating human expertise with automated “LLM-as-Judge” scoring, enabling a continuous feedback loop.

The Evaluation Engine

G-Eval Framework

Automated scoring relies on G-Eval, a framework that uses high-reasoning models (GPT-4o) to evaluate agent outputs against specific criteria.

Criteria-Driven: Each agent type (Witness, Judge, Opposing Counsel) has unique metrics like “Affidavit Faithfulness,” “Judicial Demeanor,” or “Strategic Quality.”
Reasoning-First: The evaluator provides a written rationale explaining why a response succeeded or failed.
Probabilistic Scoring: Results are mapped to a 0-1 scale, with defined thresholds for “Passing” grades.

Human Feedback Loop

Trial teams and scenario designers rate agent responses through the Model Evaluation Feedback tool.

Batching: Responses are pulled from completed CaseSim sessions in slices (e.g., 10 at a time).
Scoring: Users provide a 1-5 star rating and optional feedback on specific interactions.
Automatic Embedding: Upon batch completion, rated responses are automatically converted into vector embeddings (source: human) to serve as future RAG exemplars.

Automated Evaluation Batches

Continuous Monitoring: A nightly cron job (currently ad-hoc only) evaluates recent sessions to detect performance drift.
On-Demand Testing: Admins can trigger ad-hoc batches via the Dashboard to test new scenarios or prompt changes.
Human Correlation: When automated evaluations include human-rated cases, the system calculates Pearson Correlation to verify that automated scores track with human judgment.

Dashboard Metrics

Metric	Interpretation
Overall Pass Rate	Percentage of agent responses meeting the quality threshold.
Average Score	Aggregate grade (0-100%) of agent performance.
Human Correlation	Alignment between AI and Human scoring. High correlation (>=0.7) means automated metrics are reliable.
Divergent Cases	Outliers where AI and Human scores differ by >=30%.

Metric Definitions

Witness Metrics

Affidavit Faithfulness: Validates that the witness response contains ONLY facts from their sworn affidavit.
Behavioral Compliance: Measures alignment with the witness’s configured profile (e.g., Hostile vs. Cooperative).
Response Authenticity: Evaluates if the output sounds like a real human witness.
Witness Answer Relevancy: Assesses whether the witness directly addresses the question asked.

Judge Metrics

Ruling Correctness: Checks if the sustain/overrule decision is legally sound under the FRE.
Rule Citation: Evaluates whether the judge cites the correct rule number.
Judicial Demeanor: Tracks professionalism, impartiality, and courtroom decorum.
Judge Response Format: Ensures the ruling follows the required JSON schema.

Opposing Counsel Metrics

OC Task Completion: Measures if the agent successfully made a decision or asked a valid question.
Strategic Quality: Evaluates whether the action advances the case theory.
Pedagogical Value: Checks if the action teaches the student something.
OC Fact Grounding: Ensures the OC only references facts already established in the transcript.

The Improvement Loop

Validation: A batch is reviewed on the dashboard. If human correlation is strong (>=0.40), it is deemed safe for RAG.
Embedding: Clicking “Generate RAG Embeddings” encodes the Context + Response + Feedback into a 1536-dimensional vector.
Retrieval: When an agent generates a new response, the system retrieves the most similar positive (4-5) and negative (1-2) examples.
Steering: These examples are injected into the agent’s system prompt as “Do” and “Don’t” guidance.

Observability

Prompts & Testing

Quality & Evaluation

Overview

The Evaluation Engine

G-Eval Framework

Human Feedback Loop

Automated Evaluation Batches

Dashboard Metrics

Metric Definitions

Witness Metrics

Judge Metrics

Opposing Counsel Metrics

The Improvement Loop

Observability

Prompts & Testing

Quality & Evaluation

Documentation Index

​Overview

​The Evaluation Engine

​G-Eval Framework

​Human Feedback Loop

​Automated Evaluation Batches

​Dashboard Metrics

​Metric Definitions

​Witness Metrics

​Judge Metrics

​Opposing Counsel Metrics

​The Improvement Loop

Overview

The Evaluation Engine

G-Eval Framework

Human Feedback Loop

Automated Evaluation Batches

Dashboard Metrics

Metric Definitions

Witness Metrics

Judge Metrics

Opposing Counsel Metrics

The Improvement Loop