Annotations

Overview

Annotations are the human side of the evaluation system. When team members rate agent responses — scoring them 1 to 5 and leaving feedback — those ratings serve as the ground truth for validating automated evaluations and driving agent improvement through RAG.

How Annotations Work

Rating Agent Responses

Through the Eval Dashboard, reviewers rate agent responses from completed CaseSim sessions:

Batching — Responses are pulled from completed sessions in groups (e.g., 10 at a time). Each response shows the question that prompted it, the agent’s answer, and the surrounding context.
Scoring — The reviewer assigns a 1-5 star rating:
- 5 (Excellent) — Response is legally accurate, behaviorally appropriate, and well-crafted.
- 4 (Good) — Solid response with minor issues.
- 3 (Average) — Acceptable but room for improvement.
- 2 (Poor) — Notable problems with accuracy, tone, or relevance.
- 1 (Bad) — Fundamentally wrong or inappropriate.
Feedback — Optional written feedback explaining the rating. This feedback becomes part of the embedding when the rating is used for RAG.

What Gets Rated

Any agent response can be annotated:

Witness answers — Was the testimony faithful to the affidavit? Did the witness stay in character?
Judge rulings — Was the ruling legally correct? Were the right FRE rules cited?
OCA actions — Was the objection strategically sound? Was the question well-crafted?

Correlation with Automated Scores

Human ratings are the benchmark for validating the G-Eval automated scoring system. When an evaluation batch includes responses that have both human and automated ratings, the system calculates correlation metrics:

Metric	What It Tells You
Pearson correlation	How closely automated scores track with human ratings on a linear scale. >= 0.7 is strong.
Spearman correlation	How well the ranking order matches — do humans and the evaluator agree on which responses are best and worst?
Mean difference	The average gap between normalized human (1-5 mapped to 0-1) and automated scores. Smaller is better.
Sample size	How many paired ratings were available. At least 5 are needed for meaningful correlation.

When Correlation Is Low

If correlation drops below 0.4, investigate:

Small sample size — Fewer than 5 paired ratings can produce unreliable correlation numbers.
Subjective metrics — Metrics like “Strategic Quality” are harder for LLMs to evaluate consistently. Consider simplifying the criteria.
Criteria mismatch — The automated evaluator may be applying different standards than the human reviewers. Inspect the g-eval-* generations in Langfuse to compare the evaluator’s reasoning with human feedback.

The RAG Improvement Loop

Human-rated responses power the continuous improvement system:

Validation — When a batch shows strong human correlation (>= 0.40), the ratings are considered reliable enough for RAG.
Embedding — Clicking “Generate RAG Embeddings” in the Eval Dashboard encodes each rated response into a 1536-dimensional vector. The embedding combines the context, the response, and the reviewer’s feedback.
Retrieval — When an agent generates a new response during a CaseSim session, the system searches for the most similar past responses in the embedding database.
Steering — Highly-rated examples (4-5 stars) are injected into the agent’s prompt as positive guidance (“responses like this are excellent”). Low-rated examples (1-2 stars) are injected as negative guidance (“avoid responses like this”). The feedback text is included so the agent understands why each example was good or bad.

What Gets Embedded

Each embedding combines multiple fields for semantic richness:

Context: [the question or event that prompted the response]
Response: [the agent's actual response]
Feedback: [the reviewer's written feedback]
Agent: [witness | judge | opposing_counsel]
Phase: [courtroom phase]

Metadata Filtering

RAG retrieval uses metadata to ensure relevant examples:

Agent type — A witness agent only gets witness examples, never judge examples.
Phase — Cross-examination responses are compared against other cross-examination examples.
Similarity threshold — If no examples are similar enough, the system skips RAG injection rather than using irrelevant guidance.

Visibility in Langfuse

Human ratings are stored in Payload CMS alongside the evaluation results. While they aren’t pushed as annotations directly to individual Langfuse spans, you can cross-reference them:

The evaluation batch trace in Langfuse includes the batchRunId.
The same batchRunId links to results in Payload that contain the humanRating and humanFeedback fields.
This lets you correlate Langfuse’s automated generation data with human judgment stored in the eval system.

Observability

Prompts & Testing

Quality & Evaluation

Overview

How Annotations Work

Rating Agent Responses

What Gets Rated

Correlation with Automated Scores

When Correlation Is Low

The RAG Improvement Loop

What Gets Embedded

Metadata Filtering

Visibility in Langfuse

Observability

Prompts & Testing

Quality & Evaluation

Documentation Index

​Overview

​How Annotations Work

​Rating Agent Responses

​What Gets Rated

​Correlation with Automated Scores

​When Correlation Is Low

​The RAG Improvement Loop

​What Gets Embedded

​Metadata Filtering

​Visibility in Langfuse

Overview

How Annotations Work

Rating Agent Responses

What Gets Rated

Correlation with Automated Scores

When Correlation Is Low

The RAG Improvement Loop

What Gets Embedded

Metadata Filtering

Visibility in Langfuse