Multi-Agent Adversarial Dialogue for Pedagogical Legal Simulation

LitigationLabs internal – do not distribute unless authorized.

Abstract

We present the design and evaluation methodology for a multi-agent adversarial dialogue system that simulates courtroom trial proceedings for legal pedagogy. The system coordinates three specialized language model agents — an opposing counsel, a judicial officer, and a witness — within a turn-based game loop that requires long-horizon consistency, domain-grounded generation, and pedagogically calibrated difficulty. We introduce a hybrid scoring architecture that combines keyword-based term overlap, embedding-based semantic similarity, and context-aware pronoun resolution to evaluate student performance against structured factual targets (elicits). We describe a dual-purpose retrieval-augmented generation (RAG) pipeline that serves both as a few-shot steering mechanism for agent quality and as a per-session vector memory for cross-examination planning. We detail our evaluation framework, which implements G-Eval (Liu et al., 2023) with 13 domain-specific metrics across three agent types, validated through Pearson and Spearman correlation analysis against human expert ratings. We find that the interplay between intentional agent errors (pedagogical traps) and faithful simulation creates a novel tension in evaluation design: metrics must reward both correctness and deliberate incorrectness depending on the pedagogical context. We discuss the implications of this dual objective for LLM-as-judge reliability in educational domains.

1 Introduction

Trial advocacy is a skill learned through repetition against adversarial opponents. Law students in the United States spend years studying the Federal Rules of Evidence (FRE), yet courtroom experience remains scarce — moot court competitions are infrequent, and supervised clinical rotations are limited. The gap between doctrinal knowledge and procedural fluency is well-documented in legal education literature. We address this gap with an interactive multi-agent simulation in which a student (the “player”) conducts direct and cross-examination of AI witnesses while an AI opposing counsel raises objections and an AI judge rules on them. The system must satisfy several constraints simultaneously:

Factual grounding. Witnesses must testify only to facts contained in their sworn affidavit. Fabrication is a failure mode, not a feature.
Adversarial fidelity. Opposing counsel must behave as a competent attorney — not a trivially beatable strawman — while occasionally introducing deliberate errors to test the student’s recognition skills.
Long-horizon consistency. A trial session may span 50–100 turns. Witnesses must not contradict their prior testimony. Opposing counsel must not re-ask questions already answered.
Measurable learning outcomes. Student performance must be quantified against structured factual targets, not merely tracked as token-level engagement.

These requirements place the system at the intersection of multi-agent coordination, constrained generation, retrieval-augmented memory, and domain-specific evaluation — areas where large language models excel individually but whose composition remains underexplored.

1.1 Contributions

We make the following contributions:

A multi-agent orchestration architecture that coordinates three LLM agents within a structured turn protocol, using mode-dependent prompt dispatch to support four distinct operational regimes.
A three-tier scoring system for measuring semantic entailment between free-form witness testimony and structured factual targets, with polarity-aware weighting for adversarial examination dynamics.
A context compression scheme that replaces raw transcript windows with semantically structured testimony state, enabling arbitrarily long trial sessions within fixed context budgets.
A closed-loop evaluation pipeline in which human expert ratings are embedded, retrieved via vector similarity, and injected as few-shot exemplars into agent prompts — creating a self-improving feedback loop.
A domain-specific evaluation framework with 13 G-Eval metrics, correlation-validated against human expert judgment using both Pearson and Spearman rank statistics.

2 System Architecture

2.1 Overview

The system is structured as a turn-based game loop in which each player utterance triggers a sequential pipeline of agent calls. The orchestrator (lib/courtroom/orchestrator.ts) manages three primary flows:

Player Examination Flow (handleTurn): Player asks a question → OCA evaluates for objectionability → Judge rules (if objection raised) → Witness answers (if not sustained) → Score computed.
OCA Examination Flow (handleOCAExaminationTurn): OCA generates a question → Player evaluates for objectionability → Judge rules → Witness answers → Score computed.
Objection Resolution Flow (handlePlayerObjection): Player objects to OCA’s question → Judge rules → Score adjusted.

All three flows share a common post-processing pipeline: testimony state compression, vector memory ingestion, and session persistence.

2.2 Agent Specification

Each agent is parameterized by an AgentConfig stored in a CMS-backed database, enabling hot-swappable prompt templates, model selection, and behavioral parameters without code deployment:

Agent	Model	Key Parameters	Output Format
Opposing Counsel (OCA)	Configurable (default: GPT-4.1 via gateway)	`intentionalErrorRate`, `objectionTypes[]`, `temperature`	Structured JSON (objection or question)
Judge	Configurable	`rulingStyle`, `strictness`, `sustainBias`	JSON `{ruling, reason}`
Witness	Configurable (per-witness override)	`cooperativeness`, `verbosity`, `memoryQuality`	Free-form natural language

2.3 OCA Mode Dispatch

The opposing counsel agent operates in four modes, determined by a policy function determineOCAMode() that examines the current trial phase, the witness’s party alignment, and the player’s side:

Mode	Phase Context	OCA Behavior
`objection_user_direct`	Player conducts direct examination	OCA monitors for FRE violations
`objection_user_cross`	Player conducts cross-examination	OCA monitors for scope/relevance violations
`oc_direct`	OCA conducts direct examination of own witness	OCA asks questions, player monitors
`oc_cross`	OCA cross-examines player’s witness	OCA asks impeachment questions

This dispatch mechanism is functionally a policy selector over a discrete action space, where the state is the tuple (phase, witness_side, player_side). The selected mode determines the system prompt template, output schema, and post-processing pipeline.

2.4 Intentional Error System

A distinctive design requirement is that OCA must sometimes produce incorrect outputs on purpose. The intentionalErrorRate parameter (default: 30%) controls how often OCA generates:

Defective questions (in examination mode): Leading questions on direct, hearsay-laden prompts, or questions assuming facts not in evidence.
Incorrect objections (in objection mode): Objecting to perfectly proper questions on spurious grounds.

This is analogous to an exploration strategy in curriculum learning — the agent deliberately introduces noise to force the student to develop error-detection skills. The system tags these outputs with is_intentionally_defective or is_intentionally_incorrect flags for downstream scoring and evaluation filtering.

3 Scoring and Semantic Matching

3.1 Problem Formulation

The core scoring problem is: given a free-form witness answer

a

and a set of structured factual targets (elicits)

E = \{e_1, e_2, \ldots, e_n\}

, determine which elicits are semantically entailed by the answer. Each elicit

e_i

has:

A natural language label $\ell_i$ (e.g., “The ship was traveling at 22.5 knots”)
A signed weight $w_i \in \mathbb{R}$ indicating polarity
A unique identifier for deduplication

An elicit is considered unlocked when the witness’s answer sufficiently entails its label. The scoring function must handle paraphrase, partial entailment, and pronominal reference — the witness rarely states facts in the exact language of the elicit.

3.2 Three-Tier Matching Architecture

We implement three progressively more powerful matching strategies, selectable at runtime via configuration:

Tier 1: Keyword Matching

The baseline approach extracts key terms from both the answer and elicit label after stop-word removal, then computes a coverage ratio:

\text{score}_{\text{kw}}(a, e_i) = \frac{|\text{terms}(a) \cap \text{terms}(\ell_i)| + 0.5 \cdot |\text{fuzzy}(a, \ell_i)|}{|\text{terms}(\ell_i)|}

where

\text{fuzzy}(a, \ell_i)

counts substring-matched terms with partial credit. The default threshold is

\tau_{\text{kw}} = 0.30

(30% of key terms must match).

Tier 2: Semantic Embedding Matching

The production approach generates embeddings for both the answer and all candidate elicits using text-embedding-3-small (1536 dimensions), then computes cosine similarity:

\text{score}_{\text{sem}}(a, e_i) = \frac{\mathbf{v}_a \cdot \mathbf{v}_{e_i}}{||\mathbf{v}_a|| \cdot ||\mathbf{v}_{e_i}||}

The system uses a hybrid decision rule: an elicit is matched if either the semantic score exceeds

\tau_{\text{sem}} = 0.40

or the keyword score exceeds

\tau_{\text{kw}} = 0.30

. This OR-gate design ensures that lexically similar but semantically distant matches (proper nouns, dates) and semantically equivalent but lexically different matches (paraphrases) are both captured. Elicit embeddings are cached permanently in memory (they are static per scenario), while answer embeddings use a 60-second TTL cache to avoid redundant API calls within a turn. A “strong match” threshold at

\tau_{\text{strong}} = 0.60

provides a confidence signal used in downstream reporting.

Tier 3: Context-Aware Matching with Pronoun Resolution

The most advanced mode addresses a specific failure mode: short, pronominal witness answers like “Yes, that’s correct” or “He did” that are semantically empty without the preceding question context. When advancedMatching.enabled = true, the system:

Extracts the $k$ most recent question-answer exchanges from the transcript.
Calls an LLM to resolve pronouns and references in the answer, producing an enriched version (e.g., “Yes, that’s correct” → “The ship was traveling at 22.5 knots at the time of the collision”).
Runs semantic matching on the enriched text.

This is functionally a coreference resolution step, implemented as an LLM call rather than a traditional NLP pipeline, which handles the domain-specific reference patterns common in courtroom testimony.

3.3 Polarity-Aware Scoring

Elicit weights carry sign information that encodes which trial side benefits from the fact being established:

w_i > 0 \implies \text{benefits witness's side (active during direct)}

w_i < 0 \implies \text{benefits opposing side (active during cross)}

The scoring function filters elicits by polarity before matching:

E_{\text{active}} = \begin{cases} \{e_i \in E : w_i \geq 0\} & \text{if direct examination} \\ \{e_i \in E : w_i < 0\} & \text{if cross-examination} \end{cases}

The absolute value

|w_i|

determines point value. This design reflects a core trial advocacy principle: on direct examination, an attorney seeks to establish favorable facts (positive elicits), while on cross-examination, the goal is to elicit unfavorable admissions (negative elicits).

3.4 Objection Scoring

Player objections to OCA questions are scored on a discrete scale:

Outcome	Points	Condition
Correct objection (sustained, question was defective)	+2	`wasSustained && wasDefective`
Correct type bonus	+1	Player identified exact defect type
Incorrect objection (proper question)	-1	`!wasDefective`
Missed objection (defective question passed)	-1	`wasDefective && !objected`
Overruled on defective question	0	Learning moment, no penalty

This reward structure creates an asymmetric incentive: objecting carries risk (potential -1) but higher reward (+2 or +3), while passing is safe only when the question is proper. The 0-point “overruled on defective” case is a deliberate design choice — it acknowledges that the student recognized an issue but couldn’t articulate a winning objection, which has pedagogical value.

4 Context Compression and Agent Memory

4.1 The Context Window Problem

A full trial transcript can reach 50,000+ tokens over 100 turns. Naively including the full transcript in each agent prompt would either exceed context limits or consume budget that should be allocated to reasoning. We address this with a testimony state abstraction that replaces the raw transcript.

4.2 Testimony State

The TestimonyState object is a structured summary maintained incrementally after each exchange:

TestimonyState := {
  playerEstablishedFacts:  EstablishedFact[]   // Facts proven by player
  ocaEstablishedFacts:     EstablishedFact[]    // Facts proven by OCA
  questionsAsked:          QuestionRecord[]     // Compressed question log
  witnessRecords:          Map<WitnessId, WitnessTestimonyRecord>
  admissions:              string[]             // Key contradictions
}

Each EstablishedFact is derived from witness testimony through a fact extraction pipeline:

Confirmatory answer detection: Short answers (“Yes”, “That’s correct”) are identified via regex pattern matching and enriched with the question premise to produce a complete factual statement (e.g., "Witness confirmed: the ship was traveling at 22.5 knots").
Substantive answer decomposition: Longer answers are split into individual sentences, filtered for non-fact statements (“I don’t know”, “I’m not sure”), and prefixed with witness attribution.
Elicit linking: Extracted facts are matched against scenario elicits using the same keyword/semantic pipeline described in Section 3, and tagged with the matched elicit ID and weight.

4.3 Question Deduplication

A critical failure mode in long sessions is the OCA asking semantically identical questions. We implement a two-layer deduplication system: Layer 1: Lexical Similarity (Jaccard) Questions are normalized (stop-word removal, word sorting for order independence) and compared using Jaccard similarity over word sets:

J(q_1, q_2) = \frac{|W_{q_1} \cap W_{q_2}|}{|W_{q_1} \cup W_{q_2}|}

Layer 2: Topic Categorization Questions are classified into predefined topic categories (weather, speed, collision, observation, time, safety, etc.) using keyword matching. Topic overlap is computed:

T(q_1, q_2) = \frac{|\text{topics}(q_1) \cap \text{topics}(q_2)|}{\max(|\text{topics}(q_1)|, |\text{topics}(q_2)|)}

Combined Score: The final similarity uses a weighted combination:

\text{sim}(q_1, q_2) = 0.6 \cdot J(q_1, q_2) + 0.4 \cdot T(q_1, q_2)

A question is blocked if

\text{sim} \geq 0.65

against any previously asked question for the same witness. The 60/40 weighting reflects our empirical observation that lexical overlap is a stronger signal than topic co-occurrence for detecting paraphrased questions, while topic overlap catches cases where different words address the same subject.

4.4 Agent-Specific Memory Projections

The testimony state is not presented uniformly to all agents. Instead, each agent receives a projection tailored to its role:

OCA receives: topics already covered, questions already asked (as a “do not repeat” list), elicits remaining to target, and an examination plan derived from unestablished elicits or rebuttal items.
Witness receives: their own prior statements (for consistency), confirmed facts, and denied facts — constrained to the current witness’s record only.
Judge receives: prior rulings in the session, for consistency in judicial temperament.

This selective projection is analogous to an attention mask — each agent sees only the subset of accumulated state relevant to its generation task, preventing cross-contamination of role-specific information.

5 Retrieval-Augmented Generation

5.1 Dual-Purpose RAG Architecture

The system maintains two independent vector stores, both using pgvector with HNSW indexing (cosine distance) on 1536-dimensional embeddings:

Store	Table	Purpose	Index
Evaluation RAG	`eval_embeddings`	Few-shot quality steering	HNSW cosine
Case Memory	`case_memory_chunks`	Per-session testimony recall	HNSW cosine

5.2 Evaluation RAG: The Quality Feedback Loop

The evaluation RAG pipeline implements a closed-loop system in which human quality judgments are recycled as agent steering signals: Ingestion:

Human raters evaluate agent responses on a 1–5 star scale with optional textual feedback.
Upon batch completion, each rated response is embedded using text-embedding-3-small.
Embeddings are stored with metadata: agent type, rating, response text, context, scenario, phase, and pedagogical flags (isIntentionallyDefective, isIntentionallyIncorrect).

Retrieval:

Before generating an agent response, the system embeds the current conversation context.
Vector similarity search retrieves the $k$ most similar previously-rated examples, split into positive (rating $\geq 4$ ) and negative (rating $\leq 2$ ) sets.
Examples are formatted as “GOOD EXAMPLES (aim for this quality)” and “EXAMPLES TO AVOID” sections and injected into the agent’s system prompt.

Pedagogical Filtering: A critical design decision is that the retrieval pipeline excludes intentionally defective/incorrect examples by default. When OCA is generating a pedagogical trap turn, the filter is inverted — the system retrieves only examples tagged as intentionally defective, providing few-shot guidance for what a “good bad question” looks like. Without this filter, the RAG pipeline would conflate genuine quality failures with deliberate pedagogical errors. Configuration (from CMS global):

examplesPerType: Number of positive/negative examples to retrieve (default: 2)
minRatingsForRetrieval: Minimum total embeddings before RAG activates (default: 10)
similarityThreshold: Cosine similarity floor for retrieval (default: 0.70)

5.3 Case Memory: Per-Session Vector Recall

Each witness answer is ingested into a per-session vector memory table:

Atomic claim extraction: The answer is decomposed into individual factual statements (currently a single-claim approach; LLM-based decomposition is planned).
Embedding generation: Each claim is embedded using text-embedding-3-small.
Storage: Claims are stored with session ID, witness ID, phase, turn number, and linked elicit/fact IDs.

This memory is queried at phase transitions (direct → cross) to generate cross-examination outlines — structured plans that identify contradictions, impeachment opportunities, and rebuttal items from direct testimony. Rebuttal items are classified by FRE 611(b) scope (within_direct_scope vs. credibility) and tracked for coverage during cross-examination.

6 Evaluation Framework

6.1 Design Philosophy

Evaluating a multi-agent pedagogical system presents a challenge absent from standard LLM benchmarking: the system must sometimes be deliberately wrong. A witness who fabricates facts has failed. An opposing counsel who raises a spurious objection may have succeeded — if the pedagogical intent was to test the student’s ability to recognize bad objections. This dual objective means that a single quality metric is insufficient; evaluation must be conditioned on the agent’s intended behavior.

6.2 G-Eval Implementation

We implement G-Eval (Liu et al., 2023) as our automated evaluation framework. For each agent response, an evaluator LLM (GPT-4o-mini,

T=0

for reproducibility) is prompted with:

Criteria definition: A natural language description of what constitutes quality for this metric.
Evaluation steps: An ordered checklist the evaluator should follow (optional, metric-dependent).
Evaluation data: The input, actual output, and any reference context (affidavit, established facts).
Scoring instructions: Return a JSON object with score (0–1 continuous) and reason (1–3 sentence rationale).

Scores are clamped to

[0, 1]

with a configurable pass/fail threshold per metric.

6.3 Metric Taxonomy

We define 13 metrics across three agent types:

6.3.1 Witness Metrics (4)

Metric	Criteria	Threshold	Params
AffidavitFaithfulness	Response contains ONLY facts traceable to the affidavit. Penalizes fabrication, contradiction, and over-confidence on uncovered topics.	0.70	input, output, retrieval_context
BehavioralCompliance	Response matches configured personality profile along three axes: cooperativeness (hostile→eager), verbosity (terse→verbose), memory quality (poor→excellent). Conditioned on examination type (direct vs. cross).	0.60	input, output
ResponseAuthenticity	Response sounds like a real human witness — natural hesitations, appropriate emotion, first-person perspective. Penalizes robotic language, perfect rehearsed answers, meta-commentary.	0.60	input, output
WitnessAnswerRelevancy	Response directly addresses the question asked. Accommodates appropriate evasiveness for hostile witnesses.	0.60	input, output

The BehavioralCompliance metric is noteworthy because it is dynamically constructed: the evaluator prompt is parameterized by the witness’s configured cooperativeness, verbosity, and memory quality levels, producing a different evaluation rubric for each witness profile. This is necessary because a terse answer from a “verbose” witness is a failure, while the same answer from a “terse” witness is correct behavior. The AffidavitFaithfulness metric uses the witness’s affidavit as retrieval_context, enabling the evaluator to perform a claim-by-claim verification against the source document. The evaluation steps explicitly instruct the evaluator to:

List all factual claims in the response
Check each claim against the affidavit
Flag unverifiable claims
Check for contradictions
Evaluate appropriate use of “I don’t know”

6.3.2 Judge Metrics (4)

Metric	Criteria	Threshold	Params
RulingCorrectness	Ruling is legally correct under the FRE. Evaluator is provided a comprehensive FRE reference covering hearsay (801–802), leading questions (611), relevance (401–402), foundation (602, 901), speculation (701), character evidence (404), scope of cross (611(b)), and best evidence (1002).	0.70	input, output
RuleCitation	Ruling cites the correct FRE rule number for the objection type.	0.50	input, output
JudicialDemeanor	Response is impartial, decisive, professional. No favoritism, lecturing, sarcasm, or character-breaking.	0.60	output
JudgeResponseFormat	Response follows JSON schema `{ruling, reason}`.	0.50	output

The RulingCorrectness metric embeds a full FRE reference document in the evaluation prompt, providing the evaluator with ground-truth legal rules. The evaluation prompt also includes the specific objection type and questioned text, enabling the evaluator to determine the correct ruling before scoring the agent’s output. The evaluation steps guide the evaluator through a legal reasoning chain: identify objection → recall applicable rule → analyze the question → determine correct ruling → compare to agent’s ruling → assess reasoning quality.

6.3.3 Opposing Counsel Metrics (5)

Metric	Criteria	Threshold	Params
OCATaskCompletion	Proper execution of mode-appropriate action (objection decision or question formulation). Conditioned on `shouldBeIncorrect` flag for intentional error cases.	0.60	input, output
OCAStrategicQuality	Action is strategically sound from an adversarial perspective — worth making, well-timed, professionally credible.	0.50	input, output, context
OCAPedagogicalValue	Action provides learning opportunity. For intentional errors: the defect should be recognizable but not obvious. For correct actions: demonstrates proper technique.	0.50	input, output
OCAResponseFormat	Response follows expected JSON schema for the current mode.	0.50	output
OCAFactGrounding	Response references only established facts. Does not assume unestablished facts or reference unadmitted evidence.	0.60	output

The OCA metrics demonstrate the pedagogical-correctness tension: OCATaskCompletion explicitly changes its evaluation criteria based on the shouldBeIncorrect flag. When the flag is true, the evaluator is instructed to score based on whether the agent successfully executed an incorrect objection, not whether it made a correct one. This is a departure from standard LLM evaluation, where correctness is uniformly desirable.

6.4 Human Evaluation Pipeline

Human evaluation follows a batched workflow:

Session Selection: Completed sessions are identified as evaluation candidates. Sessions are filtered by recency (daysBack) and optionally by scenario or agent type.
Dataset Construction: The DatasetBuilder extracts evaluation test cases from raw transcripts. For each agent response in the transcript, it constructs an EvalTestCase with the preceding question/objection as input, the agent’s response as actualOutput, and a 5-event context window as context. Witness test cases include the affidavit as retrievalContext.
Rating Collection: Human raters evaluate responses on a 1–5 star scale with optional textual feedback. Ratings are stored with full provenance: session ID, transcript index, agent type, phase, and witness ID.
Embedding & RAG Ingestion: Upon batch completion, rated responses are embedded and stored in the eval_embeddings table for future RAG retrieval. Responses tagged as intentionally defective/incorrect are flagged for pedagogical filtering.

6.5 Correlation Analysis

We validate automated metrics against human ratings using two complementary statistics: Pearson Correlation measures linear agreement:

r = \frac{n\sum x_i y_i - \sum x_i \sum y_i}{\sqrt{(n\sum x_i^2 - (\sum x_i)^2)(n\sum y_i^2 - (\sum y_i)^2)}}

where

x_i

is the automated score and

y_i

is the normalized human rating (mapped from

[1,5]

[0,1]

via

y_i = (h_i - 1) / 4

). Spearman Rank Correlation measures monotonic agreement without assuming linearity. We compute ranks for both automated and human scores, then apply Pearson correlation to the rank vectors. This captures cases where automated and human scores agree on relative ordering even if the absolute scale differs. We report both statistics because they capture different failure modes:

High Pearson, low Spearman → scores are linearly related but with rank inversions at the extremes
Low Pearson, high Spearman → scores agree on ordering but with non-linear scaling

Outlier Detection: We identify divergent cases where

|x_i - y_i| \geq 0.30

(on the normalized scale). Each outlier is analyzed for possible explanations:

Automated score higher than human → possible metric leniency
Human score higher than automated → possible metric stringency or human appreciation of factors not captured by rubrics

Interpretation Thresholds:

Pearson $r$	Interpretation	Action
$\geq 0.70$	Strong correlation	Automated metrics are reliable; safe for RAG ingestion
$\geq 0.40$	Moderate correlation	Metrics partially capture human preferences; threshold tuning recommended
$< 0.40$	Weak correlation	Metrics need significant revision; review domain-specific criteria

Batch Regression Detection: The system supports comparing consecutive batch runs to detect metric regression. For each metric, a change of

> 5\%

is flagged as improvement or regression, enabling continuous monitoring of agent quality over prompt iterations.

6.6 The Feedback Loop

The evaluation pipeline implements a closed-loop improvement cycle:

Human Ratings → Embedding → RAG Store → Agent Prompt Injection → New Responses → Human Ratings → ...

The cycle operates as follows:

Validation: A batch is reviewed. If $r_{\text{Pearson}} \geq 0.40$ , the batch is deemed reliable for RAG ingestion.
Embedding: Rated responses are encoded into 1536-dimensional vectors with full metadata.
Retrieval: When an agent generates a new response, the system retrieves the $k$ most similar positive (4–5 star) and negative (1–2 star) examples.
Steering: Examples are injected into the system prompt as structured “Do” and “Don’t” guidance sections.

This is functionally a few-shot selection mechanism where the shots are dynamically chosen based on semantic similarity to the current context, rather than statically defined. The approach has the advantage of automatically adapting to new scenarios and question types as the example corpus grows.

6.7 Continuous Monitoring

A nightly cron job (2:00 AM UTC) runs automated evaluation on the 10 most recent completed sessions, storing results and computing correlation against any available human ratings. This provides:

Continuous regression detection for prompt changes
Longitudinal quality tracking per metric
Early warning for model degradation (e.g., after provider model updates)

7 Vector Memory and Cross-Examination Planning

7.1 Memory Chunking Pipeline

Each witness answer is processed through a memory ingestion pipeline that operates asynchronously (fire-and-forget) to avoid blocking the main turn loop:

Claim Extraction: Atomic factual claims are identified from the answer.
Embedding: Each claim is embedded using text-embedding-3-small.
Storage: Claims are stored in case_memory_chunks with session scoping, witness attribution, phase, turn number, and linked elicit IDs.
Confidence Scoring: Each claim receives a confidence score (0–1) based on extraction quality.

The HNSW index on this table uses cosine distance operations (<=> operator), with the similarity conversion:

\text{sim} = 1 - \text{dist}

7.2 Cross-Examination Outline Generation

At the phase transition from direct to cross-examination, the system generates a structured cross-examination outline:

Memory Query: All testimony chunks from the direct examination phase for the current witness are retrieved.
Similarity Search: Key claims are compared against the witness’s affidavit and prior session memory to identify inconsistencies.
LLM Analysis: An LLM processes the direct testimony and produces RebuttalItem objects, each classified by:
- Scope: within_direct_scope (addressable under FRE 611(b)) or credibility (always permissible)
- Type: inconsistent_testimony, contradiction_with_prior_session, contradicts_evidence, prior_inconsistent_statement, impeachment_opportunity, bias_indicator
Coverage Tracking: During cross-examination, each player question is matched against rebuttal items using semantic similarity. Covered items earn bonus points and are marked with the covering party and turn number.

8 Scenario Generation Pipeline

The system includes an LLM-powered pipeline for generating trial scenarios from uploaded legal documents:

Content Extraction: Documents (PDF, DOCX) are processed through a multi-path extractor:
- Text-based PDFs → MuPDF native extraction
- Scanned PDFs → Smart OCR fallback (if words-per-page ratio is below threshold, pages are sent as base64 images to a vision model, with client-side Tesseract.js as a secondary path)
- Word documents → Mammoth extraction
Fact Extraction: An LLM processes the raw text to identify key facts, parties, claims, and legal theories.
Scenario Generation: A second LLM call produces the full scenario structure: witnesses with affidavits and behavioral profiles, elicits with weights and categories, case theories, exhibits, and learning objectives.

The pipeline runs with elevated resource limits (3008 MB RAM, 300-second timeout) due to the computational intensity of document processing and multi-stage LLM inference.

9 Observability

All LLM calls are instrumented with OpenTelemetry spans via Langfuse:

Route-level tracing: withLangfuseTrace() wraps API handlers with trace context, recording session ID, user ID, and input.
Call-level telemetry: buildTelemetryConfig() tags each streamText/generateText call with a function ID that maps to the generation name in Langfuse (e.g., courtroom/turn-stream, g-eval-AffidavitFaithfulness).
Agent debug logging: An optional debug mode (?agentDebug=true) embeds agent-level debug events directly into the response stream, enabling real-time inspection of OCA decisions, scoring computations, and memory operations.

10 Limitations and Future Work

10.1 Current Limitations

Atomic claim extraction: The memory chunking pipeline currently treats the full witness answer as a single claim. LLM-based decomposition into truly atomic statements would improve cross-examination outline quality and rebuttal coverage tracking.
Static topic categories: Question deduplication uses a fixed set of topic keywords, which may not generalize across diverse legal domains (e.g., criminal law vs. maritime law). An embedding-based topic classification would be more robust.
Evaluation sample size: Correlation analysis requires $n \geq 20$ paired ratings for reliable statistics. Early-stage deployments may not have sufficient human ratings to validate automated metrics.
Context enrichment latency: The advanced pronoun resolution step adds an LLM call to the scoring pipeline, increasing turn latency by 500–1500ms. This is currently opt-in.
Single-judge evaluation: G-Eval uses a single evaluator model (GPT-4o-mini). Multi-judge ensembles with majority voting could improve reliability, particularly for subjective metrics like ResponseAuthenticity.

10.2 Future Directions

Reward modeling: Replace G-Eval with trained reward models fine-tuned on accumulated human ratings, reducing per-evaluation cost and improving domain specificity.
Adaptive difficulty: Use student performance trajectories to automatically adjust intentionalErrorRate, witness cooperativeness, and scenario complexity — implementing a form of intelligent tutoring system within the multi-agent loop.
Multi-session memory: Extend vector memory across sessions to enable longitudinal student skill tracking and personalized scenario recommendations.
Adversarial robustness: Evaluate susceptibility to prompt injection through player questions designed to break agent role adherence, and develop guardrails.

References

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634.
Abbasian Chaleshtari, M., et al. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. arXiv:2506.06941.
Federal Rules of Evidence, 28 U.S.C. (2024).

LitigationLabs, 2026. Internal Technical Report.

Technical Reports

Documentation Index

​Abstract

​1 Introduction

​1.1 Contributions

​2 System Architecture

​2.1 Overview

​2.2 Agent Specification

​2.3 OCA Mode Dispatch

​2.4 Intentional Error System

​3 Scoring and Semantic Matching

​3.1 Problem Formulation

​3.2 Three-Tier Matching Architecture

​Tier 1: Keyword Matching

​Tier 2: Semantic Embedding Matching

​Tier 3: Context-Aware Matching with Pronoun Resolution

​3.3 Polarity-Aware Scoring

​3.4 Objection Scoring

​4 Context Compression and Agent Memory

​4.1 The Context Window Problem

​4.2 Testimony State

​4.3 Question Deduplication

​4.4 Agent-Specific Memory Projections

​5 Retrieval-Augmented Generation

​5.1 Dual-Purpose RAG Architecture

​5.2 Evaluation RAG: The Quality Feedback Loop

​5.3 Case Memory: Per-Session Vector Recall

​6 Evaluation Framework

​6.1 Design Philosophy

​6.2 G-Eval Implementation

​6.3 Metric Taxonomy

​6.3.1 Witness Metrics (4)

​6.3.2 Judge Metrics (4)

​6.3.3 Opposing Counsel Metrics (5)

​6.4 Human Evaluation Pipeline

​6.5 Correlation Analysis

​6.6 The Feedback Loop

​6.7 Continuous Monitoring

​7 Vector Memory and Cross-Examination Planning

​7.1 Memory Chunking Pipeline

​7.2 Cross-Examination Outline Generation

​8 Scenario Generation Pipeline

​9 Observability

​10 Limitations and Future Work

​10.1 Current Limitations

​10.2 Future Directions

​References