Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.litigationlabs.io/llms.txt

Use this file to discover all available pages before exploring further.

LitigationLabs internal – do not distribute unless authorized.

Abstract

We present the design and evaluation methodology for a multi-agent adversarial dialogue system that simulates courtroom trial proceedings for legal pedagogy. The system coordinates three specialized language model agents — an opposing counsel, a judicial officer, and a witness — within a turn-based game loop that requires long-horizon consistency, domain-grounded generation, and pedagogically calibrated difficulty. We introduce a hybrid scoring architecture that combines keyword-based term overlap, embedding-based semantic similarity, and context-aware pronoun resolution to evaluate student performance against structured factual targets (elicits). We describe a dual-purpose retrieval-augmented generation (RAG) pipeline that serves both as a few-shot steering mechanism for agent quality and as a per-session vector memory for cross-examination planning. We detail our evaluation framework, which implements G-Eval (Liu et al., 2023) with 13 domain-specific metrics across three agent types, validated through Pearson and Spearman correlation analysis against human expert ratings. We find that the interplay between intentional agent errors (pedagogical traps) and faithful simulation creates a novel tension in evaluation design: metrics must reward both correctness and deliberate incorrectness depending on the pedagogical context. We discuss the implications of this dual objective for LLM-as-judge reliability in educational domains.

1 Introduction

Trial advocacy is a skill learned through repetition against adversarial opponents. Law students in the United States spend years studying the Federal Rules of Evidence (FRE), yet courtroom experience remains scarce — moot court competitions are infrequent, and supervised clinical rotations are limited. The gap between doctrinal knowledge and procedural fluency is well-documented in legal education literature. We address this gap with an interactive multi-agent simulation in which a student (the “player”) conducts direct and cross-examination of AI witnesses while an AI opposing counsel raises objections and an AI judge rules on them. The system must satisfy several constraints simultaneously:
  1. Factual grounding. Witnesses must testify only to facts contained in their sworn affidavit. Fabrication is a failure mode, not a feature.
  2. Adversarial fidelity. Opposing counsel must behave as a competent attorney — not a trivially beatable strawman — while occasionally introducing deliberate errors to test the student’s recognition skills.
  3. Long-horizon consistency. A trial session may span 50–100 turns. Witnesses must not contradict their prior testimony. Opposing counsel must not re-ask questions already answered.
  4. Measurable learning outcomes. Student performance must be quantified against structured factual targets, not merely tracked as token-level engagement.
These requirements place the system at the intersection of multi-agent coordination, constrained generation, retrieval-augmented memory, and domain-specific evaluation — areas where large language models excel individually but whose composition remains underexplored.

1.1 Contributions

We make the following contributions:
  • A multi-agent orchestration architecture that coordinates three LLM agents within a structured turn protocol, using mode-dependent prompt dispatch to support four distinct operational regimes.
  • A three-tier scoring system for measuring semantic entailment between free-form witness testimony and structured factual targets, with polarity-aware weighting for adversarial examination dynamics.
  • A context compression scheme that replaces raw transcript windows with semantically structured testimony state, enabling arbitrarily long trial sessions within fixed context budgets.
  • A closed-loop evaluation pipeline in which human expert ratings are embedded, retrieved via vector similarity, and injected as few-shot exemplars into agent prompts — creating a self-improving feedback loop.
  • A domain-specific evaluation framework with 13 G-Eval metrics, correlation-validated against human expert judgment using both Pearson and Spearman rank statistics.

2 System Architecture

2.1 Overview

The system is structured as a turn-based game loop in which each player utterance triggers a sequential pipeline of agent calls. The orchestrator (lib/courtroom/orchestrator.ts) manages three primary flows:
  1. Player Examination Flow (handleTurn): Player asks a question → OCA evaluates for objectionability → Judge rules (if objection raised) → Witness answers (if not sustained) → Score computed.
  2. OCA Examination Flow (handleOCAExaminationTurn): OCA generates a question → Player evaluates for objectionability → Judge rules → Witness answers → Score computed.
  3. Objection Resolution Flow (handlePlayerObjection): Player objects to OCA’s question → Judge rules → Score adjusted.
All three flows share a common post-processing pipeline: testimony state compression, vector memory ingestion, and session persistence.

2.2 Agent Specification

Each agent is parameterized by an AgentConfig stored in a CMS-backed database, enabling hot-swappable prompt templates, model selection, and behavioral parameters without code deployment:
AgentModelKey ParametersOutput Format
Opposing Counsel (OCA)Configurable (default: GPT-4.1 via gateway)intentionalErrorRate, objectionTypes[], temperatureStructured JSON (objection or question)
JudgeConfigurablerulingStyle, strictness, sustainBiasJSON {ruling, reason}
WitnessConfigurable (per-witness override)cooperativeness, verbosity, memoryQualityFree-form natural language

2.3 OCA Mode Dispatch

The opposing counsel agent operates in four modes, determined by a policy function determineOCAMode() that examines the current trial phase, the witness’s party alignment, and the player’s side:
ModePhase ContextOCA Behavior
objection_user_directPlayer conducts direct examinationOCA monitors for FRE violations
objection_user_crossPlayer conducts cross-examinationOCA monitors for scope/relevance violations
oc_directOCA conducts direct examination of own witnessOCA asks questions, player monitors
oc_crossOCA cross-examines player’s witnessOCA asks impeachment questions
This dispatch mechanism is functionally a policy selector over a discrete action space, where the state is the tuple (phase, witness_side, player_side). The selected mode determines the system prompt template, output schema, and post-processing pipeline.

2.4 Intentional Error System

A distinctive design requirement is that OCA must sometimes produce incorrect outputs on purpose. The intentionalErrorRate parameter (default: 30%) controls how often OCA generates:
  • Defective questions (in examination mode): Leading questions on direct, hearsay-laden prompts, or questions assuming facts not in evidence.
  • Incorrect objections (in objection mode): Objecting to perfectly proper questions on spurious grounds.
This is analogous to an exploration strategy in curriculum learning — the agent deliberately introduces noise to force the student to develop error-detection skills. The system tags these outputs with is_intentionally_defective or is_intentionally_incorrect flags for downstream scoring and evaluation filtering.

3 Scoring and Semantic Matching

3.1 Problem Formulation

The core scoring problem is: given a free-form witness answer aa and a set of structured factual targets (elicits) E={e1,e2,,en}E = \{e_1, e_2, \ldots, e_n\}, determine which elicits are semantically entailed by the answer. Each elicit eie_i has:
  • A natural language label i\ell_i (e.g., “The ship was traveling at 22.5 knots”)
  • A signed weight wiRw_i \in \mathbb{R} indicating polarity
  • A unique identifier for deduplication
An elicit is considered unlocked when the witness’s answer sufficiently entails its label. The scoring function must handle paraphrase, partial entailment, and pronominal reference — the witness rarely states facts in the exact language of the elicit.

3.2 Three-Tier Matching Architecture

We implement three progressively more powerful matching strategies, selectable at runtime via configuration:

Tier 1: Keyword Matching

The baseline approach extracts key terms from both the answer and elicit label after stop-word removal, then computes a coverage ratio: scorekw(a,ei)=terms(a)terms(i)+0.5fuzzy(a,i)terms(i)\text{score}_{\text{kw}}(a, e_i) = \frac{|\text{terms}(a) \cap \text{terms}(\ell_i)| + 0.5 \cdot |\text{fuzzy}(a, \ell_i)|}{|\text{terms}(\ell_i)|} where fuzzy(a,i)\text{fuzzy}(a, \ell_i) counts substring-matched terms with partial credit. The default threshold is τkw=0.30\tau_{\text{kw}} = 0.30 (30% of key terms must match).

Tier 2: Semantic Embedding Matching

The production approach generates embeddings for both the answer and all candidate elicits using text-embedding-3-small (1536 dimensions), then computes cosine similarity: scoresem(a,ei)=vaveivavei\text{score}_{\text{sem}}(a, e_i) = \frac{\mathbf{v}_a \cdot \mathbf{v}_{e_i}}{||\mathbf{v}_a|| \cdot ||\mathbf{v}_{e_i}||} The system uses a hybrid decision rule: an elicit is matched if either the semantic score exceeds τsem=0.40\tau_{\text{sem}} = 0.40 or the keyword score exceeds τkw=0.30\tau_{\text{kw}} = 0.30. This OR-gate design ensures that lexically similar but semantically distant matches (proper nouns, dates) and semantically equivalent but lexically different matches (paraphrases) are both captured. Elicit embeddings are cached permanently in memory (they are static per scenario), while answer embeddings use a 60-second TTL cache to avoid redundant API calls within a turn. A “strong match” threshold at τstrong=0.60\tau_{\text{strong}} = 0.60 provides a confidence signal used in downstream reporting.

Tier 3: Context-Aware Matching with Pronoun Resolution

The most advanced mode addresses a specific failure mode: short, pronominal witness answers like “Yes, that’s correct” or “He did” that are semantically empty without the preceding question context. When advancedMatching.enabled = true, the system:
  1. Extracts the kk most recent question-answer exchanges from the transcript.
  2. Calls an LLM to resolve pronouns and references in the answer, producing an enriched version (e.g., “Yes, that’s correct” → “The ship was traveling at 22.5 knots at the time of the collision”).
  3. Runs semantic matching on the enriched text.
This is functionally a coreference resolution step, implemented as an LLM call rather than a traditional NLP pipeline, which handles the domain-specific reference patterns common in courtroom testimony.

3.3 Polarity-Aware Scoring

Elicit weights carry sign information that encodes which trial side benefits from the fact being established: wi>0    benefits witness’s side (active during direct)w_i > 0 \implies \text{benefits witness's side (active during direct)} wi<0    benefits opposing side (active during cross)w_i < 0 \implies \text{benefits opposing side (active during cross)} The scoring function filters elicits by polarity before matching: Eactive={{eiE:wi0}if direct examination{eiE:wi<0}if cross-examinationE_{\text{active}} = \begin{cases} \{e_i \in E : w_i \geq 0\} & \text{if direct examination} \\ \{e_i \in E : w_i < 0\} & \text{if cross-examination} \end{cases} The absolute value wi|w_i| determines point value. This design reflects a core trial advocacy principle: on direct examination, an attorney seeks to establish favorable facts (positive elicits), while on cross-examination, the goal is to elicit unfavorable admissions (negative elicits).

3.4 Objection Scoring

Player objections to OCA questions are scored on a discrete scale:
OutcomePointsCondition
Correct objection (sustained, question was defective)+2wasSustained && wasDefective
Correct type bonus+1Player identified exact defect type
Incorrect objection (proper question)-1!wasDefective
Missed objection (defective question passed)-1wasDefective && !objected
Overruled on defective question0Learning moment, no penalty
This reward structure creates an asymmetric incentive: objecting carries risk (potential -1) but higher reward (+2 or +3), while passing is safe only when the question is proper. The 0-point “overruled on defective” case is a deliberate design choice — it acknowledges that the student recognized an issue but couldn’t articulate a winning objection, which has pedagogical value.

4 Context Compression and Agent Memory

4.1 The Context Window Problem

A full trial transcript can reach 50,000+ tokens over 100 turns. Naively including the full transcript in each agent prompt would either exceed context limits or consume budget that should be allocated to reasoning. We address this with a testimony state abstraction that replaces the raw transcript.

4.2 Testimony State

The TestimonyState object is a structured summary maintained incrementally after each exchange:
TestimonyState := {
  playerEstablishedFacts:  EstablishedFact[]   // Facts proven by player
  ocaEstablishedFacts:     EstablishedFact[]    // Facts proven by OCA
  questionsAsked:          QuestionRecord[]     // Compressed question log
  witnessRecords:          Map<WitnessId, WitnessTestimonyRecord>
  admissions:              string[]             // Key contradictions
}
Each EstablishedFact is derived from witness testimony through a fact extraction pipeline:
  1. Confirmatory answer detection: Short answers (“Yes”, “That’s correct”) are identified via regex pattern matching and enriched with the question premise to produce a complete factual statement (e.g., "Witness confirmed: the ship was traveling at 22.5 knots").
  2. Substantive answer decomposition: Longer answers are split into individual sentences, filtered for non-fact statements (“I don’t know”, “I’m not sure”), and prefixed with witness attribution.
  3. Elicit linking: Extracted facts are matched against scenario elicits using the same keyword/semantic pipeline described in Section 3, and tagged with the matched elicit ID and weight.

4.3 Question Deduplication

A critical failure mode in long sessions is the OCA asking semantically identical questions. We implement a two-layer deduplication system: Layer 1: Lexical Similarity (Jaccard) Questions are normalized (stop-word removal, word sorting for order independence) and compared using Jaccard similarity over word sets: J(q1,q2)=Wq1Wq2Wq1Wq2J(q_1, q_2) = \frac{|W_{q_1} \cap W_{q_2}|}{|W_{q_1} \cup W_{q_2}|} Layer 2: Topic Categorization Questions are classified into predefined topic categories (weather, speed, collision, observation, time, safety, etc.) using keyword matching. Topic overlap is computed: T(q1,q2)=topics(q1)topics(q2)max(topics(q1),topics(q2))T(q_1, q_2) = \frac{|\text{topics}(q_1) \cap \text{topics}(q_2)|}{\max(|\text{topics}(q_1)|, |\text{topics}(q_2)|)} Combined Score: The final similarity uses a weighted combination: sim(q1,q2)=0.6J(q1,q2)+0.4T(q1,q2)\text{sim}(q_1, q_2) = 0.6 \cdot J(q_1, q_2) + 0.4 \cdot T(q_1, q_2) A question is blocked if sim0.65\text{sim} \geq 0.65 against any previously asked question for the same witness. The 60/40 weighting reflects our empirical observation that lexical overlap is a stronger signal than topic co-occurrence for detecting paraphrased questions, while topic overlap catches cases where different words address the same subject.

4.4 Agent-Specific Memory Projections

The testimony state is not presented uniformly to all agents. Instead, each agent receives a projection tailored to its role:
  • OCA receives: topics already covered, questions already asked (as a “do not repeat” list), elicits remaining to target, and an examination plan derived from unestablished elicits or rebuttal items.
  • Witness receives: their own prior statements (for consistency), confirmed facts, and denied facts — constrained to the current witness’s record only.
  • Judge receives: prior rulings in the session, for consistency in judicial temperament.
This selective projection is analogous to an attention mask — each agent sees only the subset of accumulated state relevant to its generation task, preventing cross-contamination of role-specific information.

5 Retrieval-Augmented Generation

5.1 Dual-Purpose RAG Architecture

The system maintains two independent vector stores, both using pgvector with HNSW indexing (cosine distance) on 1536-dimensional embeddings:
StoreTablePurposeIndex
Evaluation RAGeval_embeddingsFew-shot quality steeringHNSW cosine
Case Memorycase_memory_chunksPer-session testimony recallHNSW cosine

5.2 Evaluation RAG: The Quality Feedback Loop

The evaluation RAG pipeline implements a closed-loop system in which human quality judgments are recycled as agent steering signals: Ingestion:
  1. Human raters evaluate agent responses on a 1–5 star scale with optional textual feedback.
  2. Upon batch completion, each rated response is embedded using text-embedding-3-small.
  3. Embeddings are stored with metadata: agent type, rating, response text, context, scenario, phase, and pedagogical flags (isIntentionallyDefective, isIntentionallyIncorrect).
Retrieval:
  1. Before generating an agent response, the system embeds the current conversation context.
  2. Vector similarity search retrieves the kk most similar previously-rated examples, split into positive (rating 4\geq 4) and negative (rating 2\leq 2) sets.
  3. Examples are formatted as “GOOD EXAMPLES (aim for this quality)” and “EXAMPLES TO AVOID” sections and injected into the agent’s system prompt.
Pedagogical Filtering: A critical design decision is that the retrieval pipeline excludes intentionally defective/incorrect examples by default. When OCA is generating a pedagogical trap turn, the filter is inverted — the system retrieves only examples tagged as intentionally defective, providing few-shot guidance for what a “good bad question” looks like. Without this filter, the RAG pipeline would conflate genuine quality failures with deliberate pedagogical errors. Configuration (from CMS global):
  • examplesPerType: Number of positive/negative examples to retrieve (default: 2)
  • minRatingsForRetrieval: Minimum total embeddings before RAG activates (default: 10)
  • similarityThreshold: Cosine similarity floor for retrieval (default: 0.70)

5.3 Case Memory: Per-Session Vector Recall

Each witness answer is ingested into a per-session vector memory table:
  1. Atomic claim extraction: The answer is decomposed into individual factual statements (currently a single-claim approach; LLM-based decomposition is planned).
  2. Embedding generation: Each claim is embedded using text-embedding-3-small.
  3. Storage: Claims are stored with session ID, witness ID, phase, turn number, and linked elicit/fact IDs.
This memory is queried at phase transitions (direct → cross) to generate cross-examination outlines — structured plans that identify contradictions, impeachment opportunities, and rebuttal items from direct testimony. Rebuttal items are classified by FRE 611(b) scope (within_direct_scope vs. credibility) and tracked for coverage during cross-examination.

6 Evaluation Framework

6.1 Design Philosophy

Evaluating a multi-agent pedagogical system presents a challenge absent from standard LLM benchmarking: the system must sometimes be deliberately wrong. A witness who fabricates facts has failed. An opposing counsel who raises a spurious objection may have succeeded — if the pedagogical intent was to test the student’s ability to recognize bad objections. This dual objective means that a single quality metric is insufficient; evaluation must be conditioned on the agent’s intended behavior.

6.2 G-Eval Implementation

We implement G-Eval (Liu et al., 2023) as our automated evaluation framework. For each agent response, an evaluator LLM (GPT-4o-mini, T=0T=0 for reproducibility) is prompted with:
  1. Criteria definition: A natural language description of what constitutes quality for this metric.
  2. Evaluation steps: An ordered checklist the evaluator should follow (optional, metric-dependent).
  3. Evaluation data: The input, actual output, and any reference context (affidavit, established facts).
  4. Scoring instructions: Return a JSON object with score (0–1 continuous) and reason (1–3 sentence rationale).
Scores are clamped to [0,1][0, 1] with a configurable pass/fail threshold per metric.

6.3 Metric Taxonomy

We define 13 metrics across three agent types:

6.3.1 Witness Metrics (4)

MetricCriteriaThresholdParams
AffidavitFaithfulnessResponse contains ONLY facts traceable to the affidavit. Penalizes fabrication, contradiction, and over-confidence on uncovered topics.0.70input, output, retrieval_context
BehavioralComplianceResponse matches configured personality profile along three axes: cooperativeness (hostile→eager), verbosity (terse→verbose), memory quality (poor→excellent). Conditioned on examination type (direct vs. cross).0.60input, output
ResponseAuthenticityResponse sounds like a real human witness — natural hesitations, appropriate emotion, first-person perspective. Penalizes robotic language, perfect rehearsed answers, meta-commentary.0.60input, output
WitnessAnswerRelevancyResponse directly addresses the question asked. Accommodates appropriate evasiveness for hostile witnesses.0.60input, output
The BehavioralCompliance metric is noteworthy because it is dynamically constructed: the evaluator prompt is parameterized by the witness’s configured cooperativeness, verbosity, and memory quality levels, producing a different evaluation rubric for each witness profile. This is necessary because a terse answer from a “verbose” witness is a failure, while the same answer from a “terse” witness is correct behavior. The AffidavitFaithfulness metric uses the witness’s affidavit as retrieval_context, enabling the evaluator to perform a claim-by-claim verification against the source document. The evaluation steps explicitly instruct the evaluator to:
  1. List all factual claims in the response
  2. Check each claim against the affidavit
  3. Flag unverifiable claims
  4. Check for contradictions
  5. Evaluate appropriate use of “I don’t know”

6.3.2 Judge Metrics (4)

MetricCriteriaThresholdParams
RulingCorrectnessRuling is legally correct under the FRE. Evaluator is provided a comprehensive FRE reference covering hearsay (801–802), leading questions (611), relevance (401–402), foundation (602, 901), speculation (701), character evidence (404), scope of cross (611(b)), and best evidence (1002).0.70input, output
RuleCitationRuling cites the correct FRE rule number for the objection type.0.50input, output
JudicialDemeanorResponse is impartial, decisive, professional. No favoritism, lecturing, sarcasm, or character-breaking.0.60output
JudgeResponseFormatResponse follows JSON schema {ruling, reason}.0.50output
The RulingCorrectness metric embeds a full FRE reference document in the evaluation prompt, providing the evaluator with ground-truth legal rules. The evaluation prompt also includes the specific objection type and questioned text, enabling the evaluator to determine the correct ruling before scoring the agent’s output. The evaluation steps guide the evaluator through a legal reasoning chain: identify objection → recall applicable rule → analyze the question → determine correct ruling → compare to agent’s ruling → assess reasoning quality.

6.3.3 Opposing Counsel Metrics (5)

MetricCriteriaThresholdParams
OCATaskCompletionProper execution of mode-appropriate action (objection decision or question formulation). Conditioned on shouldBeIncorrect flag for intentional error cases.0.60input, output
OCAStrategicQualityAction is strategically sound from an adversarial perspective — worth making, well-timed, professionally credible.0.50input, output, context
OCAPedagogicalValueAction provides learning opportunity. For intentional errors: the defect should be recognizable but not obvious. For correct actions: demonstrates proper technique.0.50input, output
OCAResponseFormatResponse follows expected JSON schema for the current mode.0.50output
OCAFactGroundingResponse references only established facts. Does not assume unestablished facts or reference unadmitted evidence.0.60output
The OCA metrics demonstrate the pedagogical-correctness tension: OCATaskCompletion explicitly changes its evaluation criteria based on the shouldBeIncorrect flag. When the flag is true, the evaluator is instructed to score based on whether the agent successfully executed an incorrect objection, not whether it made a correct one. This is a departure from standard LLM evaluation, where correctness is uniformly desirable.

6.4 Human Evaluation Pipeline

Human evaluation follows a batched workflow:
  1. Session Selection: Completed sessions are identified as evaluation candidates. Sessions are filtered by recency (daysBack) and optionally by scenario or agent type.
  2. Dataset Construction: The DatasetBuilder extracts evaluation test cases from raw transcripts. For each agent response in the transcript, it constructs an EvalTestCase with the preceding question/objection as input, the agent’s response as actualOutput, and a 5-event context window as context. Witness test cases include the affidavit as retrievalContext.
  3. Rating Collection: Human raters evaluate responses on a 1–5 star scale with optional textual feedback. Ratings are stored with full provenance: session ID, transcript index, agent type, phase, and witness ID.
  4. Embedding & RAG Ingestion: Upon batch completion, rated responses are embedded and stored in the eval_embeddings table for future RAG retrieval. Responses tagged as intentionally defective/incorrect are flagged for pedagogical filtering.

6.5 Correlation Analysis

We validate automated metrics against human ratings using two complementary statistics: Pearson Correlation measures linear agreement: r=nxiyixiyi(nxi2(xi)2)(nyi2(yi)2)r = \frac{n\sum x_i y_i - \sum x_i \sum y_i}{\sqrt{(n\sum x_i^2 - (\sum x_i)^2)(n\sum y_i^2 - (\sum y_i)^2)}} where xix_i is the automated score and yiy_i is the normalized human rating (mapped from [1,5][1,5] to [0,1][0,1] via yi=(hi1)/4y_i = (h_i - 1) / 4). Spearman Rank Correlation measures monotonic agreement without assuming linearity. We compute ranks for both automated and human scores, then apply Pearson correlation to the rank vectors. This captures cases where automated and human scores agree on relative ordering even if the absolute scale differs. We report both statistics because they capture different failure modes:
  • High Pearson, low Spearman → scores are linearly related but with rank inversions at the extremes
  • Low Pearson, high Spearman → scores agree on ordering but with non-linear scaling
Outlier Detection: We identify divergent cases where xiyi0.30|x_i - y_i| \geq 0.30 (on the normalized scale). Each outlier is analyzed for possible explanations:
  • Automated score higher than human → possible metric leniency
  • Human score higher than automated → possible metric stringency or human appreciation of factors not captured by rubrics
Interpretation Thresholds:
Pearson rrInterpretationAction
0.70\geq 0.70Strong correlationAutomated metrics are reliable; safe for RAG ingestion
0.40\geq 0.40Moderate correlationMetrics partially capture human preferences; threshold tuning recommended
<0.40< 0.40Weak correlationMetrics need significant revision; review domain-specific criteria
Batch Regression Detection: The system supports comparing consecutive batch runs to detect metric regression. For each metric, a change of >5%> 5\% is flagged as improvement or regression, enabling continuous monitoring of agent quality over prompt iterations.

6.6 The Feedback Loop

The evaluation pipeline implements a closed-loop improvement cycle:
Human Ratings → Embedding → RAG Store → Agent Prompt Injection → New Responses → Human Ratings → ...
The cycle operates as follows:
  1. Validation: A batch is reviewed. If rPearson0.40r_{\text{Pearson}} \geq 0.40, the batch is deemed reliable for RAG ingestion.
  2. Embedding: Rated responses are encoded into 1536-dimensional vectors with full metadata.
  3. Retrieval: When an agent generates a new response, the system retrieves the kk most similar positive (4–5 star) and negative (1–2 star) examples.
  4. Steering: Examples are injected into the system prompt as structured “Do” and “Don’t” guidance sections.
This is functionally a few-shot selection mechanism where the shots are dynamically chosen based on semantic similarity to the current context, rather than statically defined. The approach has the advantage of automatically adapting to new scenarios and question types as the example corpus grows.

6.7 Continuous Monitoring

A nightly cron job (2:00 AM UTC) runs automated evaluation on the 10 most recent completed sessions, storing results and computing correlation against any available human ratings. This provides:
  • Continuous regression detection for prompt changes
  • Longitudinal quality tracking per metric
  • Early warning for model degradation (e.g., after provider model updates)

7 Vector Memory and Cross-Examination Planning

7.1 Memory Chunking Pipeline

Each witness answer is processed through a memory ingestion pipeline that operates asynchronously (fire-and-forget) to avoid blocking the main turn loop:
  1. Claim Extraction: Atomic factual claims are identified from the answer.
  2. Embedding: Each claim is embedded using text-embedding-3-small.
  3. Storage: Claims are stored in case_memory_chunks with session scoping, witness attribution, phase, turn number, and linked elicit IDs.
  4. Confidence Scoring: Each claim receives a confidence score (0–1) based on extraction quality.
The HNSW index on this table uses cosine distance operations (<=> operator), with the similarity conversion: sim=1dist\text{sim} = 1 - \text{dist}.

7.2 Cross-Examination Outline Generation

At the phase transition from direct to cross-examination, the system generates a structured cross-examination outline:
  1. Memory Query: All testimony chunks from the direct examination phase for the current witness are retrieved.
  2. Similarity Search: Key claims are compared against the witness’s affidavit and prior session memory to identify inconsistencies.
  3. LLM Analysis: An LLM processes the direct testimony and produces RebuttalItem objects, each classified by:
    • Scope: within_direct_scope (addressable under FRE 611(b)) or credibility (always permissible)
    • Type: inconsistent_testimony, contradiction_with_prior_session, contradicts_evidence, prior_inconsistent_statement, impeachment_opportunity, bias_indicator
  4. Coverage Tracking: During cross-examination, each player question is matched against rebuttal items using semantic similarity. Covered items earn bonus points and are marked with the covering party and turn number.

8 Scenario Generation Pipeline

The system includes an LLM-powered pipeline for generating trial scenarios from uploaded legal documents:
  1. Content Extraction: Documents (PDF, DOCX) are processed through a multi-path extractor:
    • Text-based PDFs → MuPDF native extraction
    • Scanned PDFs → Smart OCR fallback (if words-per-page ratio is below threshold, pages are sent as base64 images to a vision model, with client-side Tesseract.js as a secondary path)
    • Word documents → Mammoth extraction
  2. Fact Extraction: An LLM processes the raw text to identify key facts, parties, claims, and legal theories.
  3. Scenario Generation: A second LLM call produces the full scenario structure: witnesses with affidavits and behavioral profiles, elicits with weights and categories, case theories, exhibits, and learning objectives.
The pipeline runs with elevated resource limits (3008 MB RAM, 300-second timeout) due to the computational intensity of document processing and multi-stage LLM inference.

9 Observability

All LLM calls are instrumented with OpenTelemetry spans via Langfuse:
  • Route-level tracing: withLangfuseTrace() wraps API handlers with trace context, recording session ID, user ID, and input.
  • Call-level telemetry: buildTelemetryConfig() tags each streamText/generateText call with a function ID that maps to the generation name in Langfuse (e.g., courtroom/turn-stream, g-eval-AffidavitFaithfulness).
  • Agent debug logging: An optional debug mode (?agentDebug=true) embeds agent-level debug events directly into the response stream, enabling real-time inspection of OCA decisions, scoring computations, and memory operations.

10 Limitations and Future Work

10.1 Current Limitations

  1. Atomic claim extraction: The memory chunking pipeline currently treats the full witness answer as a single claim. LLM-based decomposition into truly atomic statements would improve cross-examination outline quality and rebuttal coverage tracking.
  2. Static topic categories: Question deduplication uses a fixed set of topic keywords, which may not generalize across diverse legal domains (e.g., criminal law vs. maritime law). An embedding-based topic classification would be more robust.
  3. Evaluation sample size: Correlation analysis requires n20n \geq 20 paired ratings for reliable statistics. Early-stage deployments may not have sufficient human ratings to validate automated metrics.
  4. Context enrichment latency: The advanced pronoun resolution step adds an LLM call to the scoring pipeline, increasing turn latency by 500–1500ms. This is currently opt-in.
  5. Single-judge evaluation: G-Eval uses a single evaluator model (GPT-4o-mini). Multi-judge ensembles with majority voting could improve reliability, particularly for subjective metrics like ResponseAuthenticity.

10.2 Future Directions

  1. Reward modeling: Replace G-Eval with trained reward models fine-tuned on accumulated human ratings, reducing per-evaluation cost and improving domain specificity.
  2. Adaptive difficulty: Use student performance trajectories to automatically adjust intentionalErrorRate, witness cooperativeness, and scenario complexity — implementing a form of intelligent tutoring system within the multi-agent loop.
  3. Multi-session memory: Extend vector memory across sessions to enable longitudinal student skill tracking and personalized scenario recommendations.
  4. Adversarial robustness: Evaluate susceptibility to prompt injection through player questions designed to break agent role adherence, and develop guardrails.

References

  • Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634.
  • Abbasian Chaleshtari, M., et al. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. arXiv:2506.06941.
  • Federal Rules of Evidence, 28 U.S.C. (2024).

LitigationLabs, 2026. Internal Technical Report.