> ## Documentation Index
> Fetch the complete documentation index at: https://docs.litigationlabs.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Annotations

> How human feedback and ratings are captured, correlated with automated scores, and used to improve agent quality.

## Overview

Annotations are the human side of the evaluation system. When team members rate agent responses — scoring them 1 to 5 and leaving feedback — those ratings serve as the ground truth for validating automated evaluations and driving agent improvement through RAG.

## How Annotations Work

### Rating Agent Responses

Through the Eval Dashboard, reviewers rate agent responses from completed CaseSim sessions:

1. **Batching** — Responses are pulled from completed sessions in groups (e.g., 10 at a time). Each response shows the question that prompted it, the agent's answer, and the surrounding context.

2. **Scoring** — The reviewer assigns a 1-5 star rating:
   * **5 (Excellent)** — Response is legally accurate, behaviorally appropriate, and well-crafted.
   * **4 (Good)** — Solid response with minor issues.
   * **3 (Average)** — Acceptable but room for improvement.
   * **2 (Poor)** — Notable problems with accuracy, tone, or relevance.
   * **1 (Bad)** — Fundamentally wrong or inappropriate.

3. **Feedback** — Optional written feedback explaining the rating. This feedback becomes part of the embedding when the rating is used for RAG.

### What Gets Rated

Any agent response can be annotated:

* **Witness answers** — Was the testimony faithful to the affidavit? Did the witness stay in character?
* **Judge rulings** — Was the ruling legally correct? Were the right FRE rules cited?
* **OCA actions** — Was the objection strategically sound? Was the question well-crafted?

## Correlation with Automated Scores

Human ratings are the benchmark for validating the G-Eval automated scoring system. When an evaluation batch includes responses that have both human and automated ratings, the system calculates correlation metrics:

| Metric                   | What It Tells You                                                                                             |
| :----------------------- | :------------------------------------------------------------------------------------------------------------ |
| **Pearson correlation**  | How closely automated scores track with human ratings on a linear scale. >= 0.7 is strong.                    |
| **Spearman correlation** | How well the ranking order matches — do humans and the evaluator agree on which responses are best and worst? |
| **Mean difference**      | The average gap between normalized human (1-5 mapped to 0-1) and automated scores. Smaller is better.         |
| **Sample size**          | How many paired ratings were available. At least 5 are needed for meaningful correlation.                     |

### When Correlation Is Low

If correlation drops below 0.4, investigate:

* **Small sample size** — Fewer than 5 paired ratings can produce unreliable correlation numbers.
* **Subjective metrics** — Metrics like "Strategic Quality" are harder for LLMs to evaluate consistently. Consider simplifying the criteria.
* **Criteria mismatch** — The automated evaluator may be applying different standards than the human reviewers. Inspect the `g-eval-*` generations in Langfuse to compare the evaluator's reasoning with human feedback.

## The RAG Improvement Loop

Human-rated responses power the continuous improvement system:

1. **Validation** — When a batch shows strong human correlation (>= 0.40), the ratings are considered reliable enough for RAG.

2. **Embedding** — Clicking "Generate RAG Embeddings" in the Eval Dashboard encodes each rated response into a 1536-dimensional vector. The embedding combines the context, the response, and the reviewer's feedback.

3. **Retrieval** — When an agent generates a new response during a CaseSim session, the system searches for the most similar past responses in the embedding database.

4. **Steering** — Highly-rated examples (4-5 stars) are injected into the agent's prompt as positive guidance ("responses like this are excellent"). Low-rated examples (1-2 stars) are injected as negative guidance ("avoid responses like this"). The feedback text is included so the agent understands *why* each example was good or bad.

### What Gets Embedded

Each embedding combines multiple fields for semantic richness:

```
Context: [the question or event that prompted the response]
Response: [the agent's actual response]
Feedback: [the reviewer's written feedback]
Agent: [witness | judge | opposing_counsel]
Phase: [courtroom phase]
```

### Metadata Filtering

RAG retrieval uses metadata to ensure relevant examples:

* **Agent type** — A witness agent only gets witness examples, never judge examples.
* **Phase** — Cross-examination responses are compared against other cross-examination examples.
* **Similarity threshold** — If no examples are similar enough, the system skips RAG injection rather than using irrelevant guidance.

## Visibility in Langfuse

Human ratings are stored in Payload CMS alongside the evaluation results. While they aren't pushed as annotations directly to individual Langfuse spans, you can cross-reference them:

* The evaluation batch trace in Langfuse includes the `batchRunId`.
* The same `batchRunId` links to results in Payload that contain the `humanRating` and `humanFeedback` fields.
* This lets you correlate Langfuse's automated generation data with human judgment stored in the eval system.