> ## Documentation Index
> Fetch the complete documentation index at: https://docs.litigationlabs.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation

> How LitigationLabs uses G-Eval automated scoring to measure agent quality, with every evaluation traced through Langfuse.

## Overview

The evaluation system is the central mechanism for monitoring and improving agent behavior. It uses **G-Eval** — an LLM-as-judge framework — to score every agent response against domain-specific criteria like legal correctness, behavioral compliance, and pedagogical value. Every evaluation call is traced in Langfuse under the `evals/automated` trace name.

## How G-Eval Works

G-Eval sends the agent's response, its context, and a set of evaluation criteria to a high-reasoning model (GPT-4o by default). The evaluator returns:

* A **score** (0 to 1) for each metric.
* A **rationale** explaining why it scored the response that way.
* A **pass/fail** determination based on a predefined threshold.

Each metric evaluation is traced as a generation in Langfuse with the name `g-eval-{MetricName}`, so you can inspect the evaluator's reasoning for any score.

## Agent-Specific Metrics

### Witness Metrics

| Metric                       | Threshold | What It Measures                                                                                                           |
| :--------------------------- | :-------- | :------------------------------------------------------------------------------------------------------------------------- |
| **Affidavit Faithfulness**   | 0.7       | Does the response contain only facts from the sworn affidavit? Penalizes fabrications, embellishments, and contradictions. |
| **Behavioral Compliance**    | 0.6       | Does the response match the witness's configured personality — cooperativeness, verbosity, memory quality?                 |
| **Response Authenticity**    | 0.6       | Does the output sound like a real human witness, not AI? Checks for first-person perspective and natural speech patterns.  |
| **Witness Answer Relevancy** | 0.6       | Does the witness directly address the question without unnecessary tangents?                                               |

### Judge Metrics

| Metric                    | Threshold | What It Measures                                                                    |
| :------------------------ | :-------- | :---------------------------------------------------------------------------------- |
| **Ruling Correctness**    | 0.7       | Is the sustain/overrule decision legally sound under the Federal Rules of Evidence? |
| **Rule Citation**         | 0.5       | Does the judge cite the correct FRE rule number and apply it properly?              |
| **Judicial Demeanor**     | 0.5       | Is the tone professional, impartial, and appropriate for a courtroom?               |
| **Judge Response Format** | 0.5       | Does the output follow the required JSON schema for the orchestrator?               |

### Opposing Counsel Metrics

| Metric                 | Threshold | What It Measures                                                                          |
| :--------------------- | :-------- | :---------------------------------------------------------------------------------------- |
| **OC Task Completion** | 0.6       | Did the agent make a clear decision (objection vs. no objection) or ask a valid question? |
| **Strategic Quality**  | 0.5       | Does the action advance the case theory, or is it a wasted move?                          |
| **Pedagogical Value**  | 0.5       | Does the action teach the student something? Includes intentional errors.                 |
| **OC Fact Grounding**  | 0.6       | Does the agent only reference facts already established in the transcript?                |

## Running Evaluation Batches

Batches can be triggered via the Eval Dashboard or the API. The system supports both ad-hoc runs for testing prompt changes and scheduled monitoring for detecting performance drift.

```
POST /api/evals/automated
```

### Configuration

| Parameter           | Description                                                      |
| :------------------ | :--------------------------------------------------------------- |
| `agentTypes`        | Which agents to evaluate: `witness`, `judge`, `opposing_counsel` |
| `daysBack`          | Only include sessions from the last N days                       |
| `maxTestCases`      | Cap on total evaluations to run                                  |
| `evaluatorModel`    | Model for the evaluator (default: `openai/gpt-4o-mini`)          |
| `includeHumanRated` | Include cases that have human ratings for correlation analysis   |
| `sessionIds`        | Limit to specific sessions                                       |
| `scenarioIds`       | Limit to specific scenarios                                      |

### What Happens

1. **Dataset building** — The system extracts test cases from session transcripts. Each test case is one agent response with its triggering context, the preceding conversation, and (for witnesses) the relevant affidavit.

2. **Metric evaluation** — Each test case is scored against its agent-specific metric suite (4-5 metrics per response). Every metric call is traced individually in Langfuse.

3. **Results storage** — Individual results are stored in the `automatedEvalResults` collection. Batch summaries go into `automatedEvalBatches`.

4. **Summary calculation** — The system computes per-agent averages, per-metric pass rates, and an overall pass rate across all evaluations.

## Dashboard Metrics

The Eval Dashboard surfaces these top-level indicators:

| Metric                | Interpretation                                                                                  |
| :-------------------- | :---------------------------------------------------------------------------------------------- |
| **Overall Pass Rate** | Percentage of agent responses meeting the quality threshold across all metrics.                 |
| **Average Score**     | Aggregate grade (0-100%) of agent performance.                                                  |
| **Human Correlation** | Alignment between AI and Human scoring. >= 0.7 means automated metrics are reliable.            |
| **Divergent Cases**   | Outliers where AI and Human scores differ by >= 30% — the most valuable data points for tuning. |

Trend charts track performance over time. A sudden dip in "Rule Citation" for the Judge agent might indicate a regression from a model update or a particularly tricky new scenario.

## Viewing Results in Langfuse

### Batch Traces

Each batch evaluation creates a trace named `evals/automated` that contains:

* The batch configuration as input.
* Summary statistics as output.
* Individual `g-eval-*` generations nested inside for each metric evaluation.

### Inspecting a Score

To understand why a specific response received a low score:

1. Find the batch trace in Langfuse.
2. Locate the `g-eval-{MetricName}` generation for that test case.
3. Read the evaluator's rationale — it explains which criteria the response failed and why.

## Practical Workflows

### Regression Testing

After updating an agent prompt, run a batch evaluation on a standard scenario. Compare metric scores to the previous batch. If Affidavit Faithfulness drops, the prompt change may need revision.

### Tuning the Evaluator

If human raters consistently disagree with automated scores on a specific metric, the evaluation criteria may be too strict or missing context. Inspect the `g-eval-*` generation to see exactly what the evaluator was told, and adjust the metric definition accordingly.

### Scaling Quality Across Scenarios

Seed a new scenario with 5-10 human ratings. Once correlation is established, run automated evals on the rest to identify weak spots without manual review.
