Embedding Atlas

Overview

The Embedding Atlas is an interactive visualization tool that displays all evaluation ratings as points in a 2D semantic space. Each point represents an agent response (witness, judge, or opposing counsel) that has been rated either by a human evaluator or by automated LLM evaluation.

What Are Embeddings?

Embeddings are high-dimensional vector representations of text that capture semantic meaning. When we evaluate an agent response, we create an embedding that encodes:

The agent’s response (the actual message text)
Context (preceding conversation/question that prompted the response)
Feedback (human or automated reasoning about quality)
Metadata (agent type, courtroom phase)

These embeddings use OpenAI’s text-embedding-ada-002 model, producing 1536-dimensional vectors.

Understanding the Axes

The visualization projects 1536 dimensions onto a 2D plane by selecting the two dimensions with the highest variance across all embeddings. The axes do NOT have fixed semantic meanings — they represent whichever embedding dimensions contain the most variation in your current dataset. Key insight: The absolute position on X or Y is not meaningful. What matters is the relative distances between points.

Interpreting the Visualization

Clusters

Pattern	Possible Interpretation
Cluster by agent type	Different agents produce fundamentally different response styles
Cluster by rating	Quality issues manifest as similar semantic patterns
Cluster by phase	Courtroom phases elicit different response types
Isolated outliers	Unusual responses that don’t fit normal patterns
Overlapping clusters	Multiple factors contribute to semantic similarity

Filters

Source: Human ratings vs Automated (LLM-judged) ratings
Agent Type: Witness, Judge, Opposing Counsel
Rating Label: Excellent (5), Good (4), Average (3), Poor (2), Bad (1)
Phase: Courtroom phase where response occurred

Practical Use Cases

Quality Pattern Detection — Look for clusters of low-rated responses to reveal systematic quality issues in agent prompts.
Human vs Automated Agreement — Compare cluster positions between human and automated ratings to verify evaluation criteria alignment.
Agent Behavior Analysis — Filter by agent type to assess whether responses are repetitive (tight clusters) or varied (spread out).
Finding Outliers — Points far from all clusters may be edge cases, data quality issues, or novel scenarios.
Evaluating Prompt Changes — After modifying an agent’s prompt, compare new response positions to historical data.

Data Pipeline

Agent Response -> Eval Rating -> Embedding Generation -> eval_embeddings DB -> /api/embeddings -> Atlas Visualization

Troubleshooting

“No Embeddings Yet”: Embeddings are generated when human or automated eval batches are completed.
Points All in One Cluster: Check for low semantic variety or verify embeddings are generating correctly.
Visualization Not Loading: Check browser console for CORS errors, DuckDB-WASM initialization failures, or memory issues.

Observability

Prompts & Testing

Quality & Evaluation

Overview

What Are Embeddings?

Understanding the Axes

Interpreting the Visualization

Clusters

Filters

Practical Use Cases

Data Pipeline

Troubleshooting

Observability

Prompts & Testing

Quality & Evaluation

Documentation Index

​Overview

​What Are Embeddings?

​Understanding the Axes

​Interpreting the Visualization

​Clusters

​Filters

​Practical Use Cases

​Data Pipeline

​Troubleshooting

Overview

What Are Embeddings?

Understanding the Axes

Interpreting the Visualization

Clusters

Filters

Practical Use Cases

Data Pipeline

Troubleshooting