Abstract
We present the design and evaluation methodology for a multi-agent adversarial dialogue system that simulates courtroom trial proceedings for legal pedagogy. The system coordinates three specialized language model agents — an opposing counsel, a judicial officer, and a witness — within a turn-based game loop that requires long-horizon consistency, domain-grounded generation, and pedagogically calibrated difficulty. We introduce a hybrid scoring architecture that combines keyword-based term overlap, embedding-based semantic similarity, and context-aware pronoun resolution to evaluate student performance against structured factual targets (elicits). We describe a dual-purpose retrieval-augmented generation (RAG) pipeline that serves both as a few-shot steering mechanism for agent quality and as a per-session vector memory for cross-examination planning. We detail our evaluation framework, which implements G-Eval (Liu et al., 2023) with 13 domain-specific metrics across three agent types, validated through Pearson and Spearman correlation analysis against human expert ratings. We find that the interplay between intentional agent errors (pedagogical traps) and faithful simulation creates a novel tension in evaluation design: metrics must reward both correctness and deliberate incorrectness depending on the pedagogical context. We discuss the implications of this dual objective for LLM-as-judge reliability in educational domains.1 Introduction
Trial advocacy is a skill learned through repetition against adversarial opponents. Law students in the United States spend years studying the Federal Rules of Evidence (FRE), yet courtroom experience remains scarce — moot court competitions are infrequent, and supervised clinical rotations are limited. The gap between doctrinal knowledge and procedural fluency is well-documented in legal education literature. We address this gap with an interactive multi-agent simulation in which a student (the “player”) conducts direct and cross-examination of AI witnesses while an AI opposing counsel raises objections and an AI judge rules on them. The system must satisfy several constraints simultaneously:- Factual grounding. Witnesses must testify only to facts contained in their sworn affidavit. Fabrication is a failure mode, not a feature.
- Adversarial fidelity. Opposing counsel must behave as a competent attorney — not a trivially beatable strawman — while occasionally introducing deliberate errors to test the student’s recognition skills.
- Long-horizon consistency. A trial session may span 50–100 turns. Witnesses must not contradict their prior testimony. Opposing counsel must not re-ask questions already answered.
- Measurable learning outcomes. Student performance must be quantified against structured factual targets, not merely tracked as token-level engagement.
1.1 Contributions
We make the following contributions:- A multi-agent orchestration architecture that coordinates three LLM agents within a structured turn protocol, using mode-dependent prompt dispatch to support four distinct operational regimes.
- A three-tier scoring system for measuring semantic entailment between free-form witness testimony and structured factual targets, with polarity-aware weighting for adversarial examination dynamics.
- A context compression scheme that replaces raw transcript windows with semantically structured testimony state, enabling arbitrarily long trial sessions within fixed context budgets.
- A closed-loop evaluation pipeline in which human expert ratings are embedded, retrieved via vector similarity, and injected as few-shot exemplars into agent prompts — creating a self-improving feedback loop.
- A domain-specific evaluation framework with 13 G-Eval metrics, correlation-validated against human expert judgment using both Pearson and Spearman rank statistics.
2 System Architecture
2.1 Overview
The system is structured as a turn-based game loop in which each player utterance triggers a sequential pipeline of agent calls. The orchestrator (lib/courtroom/orchestrator.ts) manages three primary flows:
- Player Examination Flow (
handleTurn): Player asks a question → OCA evaluates for objectionability → Judge rules (if objection raised) → Witness answers (if not sustained) → Score computed. - OCA Examination Flow (
handleOCAExaminationTurn): OCA generates a question → Player evaluates for objectionability → Judge rules → Witness answers → Score computed. - Objection Resolution Flow (
handlePlayerObjection): Player objects to OCA’s question → Judge rules → Score adjusted.
2.2 Agent Specification
Each agent is parameterized by anAgentConfig stored in a CMS-backed database, enabling hot-swappable prompt templates, model selection, and behavioral parameters without code deployment:
| Agent | Model | Key Parameters | Output Format |
|---|---|---|---|
| Opposing Counsel (OCA) | Configurable (default: GPT-4.1 via gateway) | intentionalErrorRate, objectionTypes[], temperature | Structured JSON (objection or question) |
| Judge | Configurable | rulingStyle, strictness, sustainBias | JSON {ruling, reason} |
| Witness | Configurable (per-witness override) | cooperativeness, verbosity, memoryQuality | Free-form natural language |
2.3 OCA Mode Dispatch
The opposing counsel agent operates in four modes, determined by a policy functiondetermineOCAMode() that examines the current trial phase, the witness’s party alignment, and the player’s side:
| Mode | Phase Context | OCA Behavior |
|---|---|---|
objection_user_direct | Player conducts direct examination | OCA monitors for FRE violations |
objection_user_cross | Player conducts cross-examination | OCA monitors for scope/relevance violations |
oc_direct | OCA conducts direct examination of own witness | OCA asks questions, player monitors |
oc_cross | OCA cross-examines player’s witness | OCA asks impeachment questions |
(phase, witness_side, player_side). The selected mode determines the system prompt template, output schema, and post-processing pipeline.
2.4 Intentional Error System
A distinctive design requirement is that OCA must sometimes produce incorrect outputs on purpose. TheintentionalErrorRate parameter (default: 30%) controls how often OCA generates:
- Defective questions (in examination mode): Leading questions on direct, hearsay-laden prompts, or questions assuming facts not in evidence.
- Incorrect objections (in objection mode): Objecting to perfectly proper questions on spurious grounds.
is_intentionally_defective or is_intentionally_incorrect flags for downstream scoring and evaluation filtering.
3 Scoring and Semantic Matching
3.1 Problem Formulation
The core scoring problem is: given a free-form witness answer and a set of structured factual targets (elicits) , determine which elicits are semantically entailed by the answer. Each elicit has:- A natural language label (e.g., “The ship was traveling at 22.5 knots”)
- A signed weight indicating polarity
- A unique identifier for deduplication
3.2 Three-Tier Matching Architecture
We implement three progressively more powerful matching strategies, selectable at runtime via configuration:Tier 1: Keyword Matching
The baseline approach extracts key terms from both the answer and elicit label after stop-word removal, then computes a coverage ratio: where counts substring-matched terms with partial credit. The default threshold is (30% of key terms must match).Tier 2: Semantic Embedding Matching
The production approach generates embeddings for both the answer and all candidate elicits usingtext-embedding-3-small (1536 dimensions), then computes cosine similarity:
The system uses a hybrid decision rule: an elicit is matched if either the semantic score exceeds or the keyword score exceeds . This OR-gate design ensures that lexically similar but semantically distant matches (proper nouns, dates) and semantically equivalent but lexically different matches (paraphrases) are both captured.
Elicit embeddings are cached permanently in memory (they are static per scenario), while answer embeddings use a 60-second TTL cache to avoid redundant API calls within a turn.
A “strong match” threshold at provides a confidence signal used in downstream reporting.
Tier 3: Context-Aware Matching with Pronoun Resolution
The most advanced mode addresses a specific failure mode: short, pronominal witness answers like “Yes, that’s correct” or “He did” that are semantically empty without the preceding question context. WhenadvancedMatching.enabled = true, the system:
- Extracts the most recent question-answer exchanges from the transcript.
- Calls an LLM to resolve pronouns and references in the answer, producing an enriched version (e.g., “Yes, that’s correct” → “The ship was traveling at 22.5 knots at the time of the collision”).
- Runs semantic matching on the enriched text.
3.3 Polarity-Aware Scoring
Elicit weights carry sign information that encodes which trial side benefits from the fact being established: The scoring function filters elicits by polarity before matching: The absolute value determines point value. This design reflects a core trial advocacy principle: on direct examination, an attorney seeks to establish favorable facts (positive elicits), while on cross-examination, the goal is to elicit unfavorable admissions (negative elicits).3.4 Objection Scoring
Player objections to OCA questions are scored on a discrete scale:| Outcome | Points | Condition |
|---|---|---|
| Correct objection (sustained, question was defective) | +2 | wasSustained && wasDefective |
| Correct type bonus | +1 | Player identified exact defect type |
| Incorrect objection (proper question) | -1 | !wasDefective |
| Missed objection (defective question passed) | -1 | wasDefective && !objected |
| Overruled on defective question | 0 | Learning moment, no penalty |
4 Context Compression and Agent Memory
4.1 The Context Window Problem
A full trial transcript can reach 50,000+ tokens over 100 turns. Naively including the full transcript in each agent prompt would either exceed context limits or consume budget that should be allocated to reasoning. We address this with a testimony state abstraction that replaces the raw transcript.4.2 Testimony State
TheTestimonyState object is a structured summary maintained incrementally after each exchange:
EstablishedFact is derived from witness testimony through a fact extraction pipeline:
- Confirmatory answer detection: Short answers (“Yes”, “That’s correct”) are identified via regex pattern matching and enriched with the question premise to produce a complete factual statement (e.g.,
"Witness confirmed: the ship was traveling at 22.5 knots"). - Substantive answer decomposition: Longer answers are split into individual sentences, filtered for non-fact statements (“I don’t know”, “I’m not sure”), and prefixed with witness attribution.
- Elicit linking: Extracted facts are matched against scenario elicits using the same keyword/semantic pipeline described in Section 3, and tagged with the matched elicit ID and weight.
4.3 Question Deduplication
A critical failure mode in long sessions is the OCA asking semantically identical questions. We implement a two-layer deduplication system: Layer 1: Lexical Similarity (Jaccard) Questions are normalized (stop-word removal, word sorting for order independence) and compared using Jaccard similarity over word sets: Layer 2: Topic Categorization Questions are classified into predefined topic categories (weather, speed, collision, observation, time, safety, etc.) using keyword matching. Topic overlap is computed: Combined Score: The final similarity uses a weighted combination: A question is blocked if against any previously asked question for the same witness. The 60/40 weighting reflects our empirical observation that lexical overlap is a stronger signal than topic co-occurrence for detecting paraphrased questions, while topic overlap catches cases where different words address the same subject.4.4 Agent-Specific Memory Projections
The testimony state is not presented uniformly to all agents. Instead, each agent receives a projection tailored to its role:- OCA receives: topics already covered, questions already asked (as a “do not repeat” list), elicits remaining to target, and an examination plan derived from unestablished elicits or rebuttal items.
- Witness receives: their own prior statements (for consistency), confirmed facts, and denied facts — constrained to the current witness’s record only.
- Judge receives: prior rulings in the session, for consistency in judicial temperament.
5 Retrieval-Augmented Generation
5.1 Dual-Purpose RAG Architecture
The system maintains two independent vector stores, both using pgvector with HNSW indexing (cosine distance) on 1536-dimensional embeddings:| Store | Table | Purpose | Index |
|---|---|---|---|
| Evaluation RAG | eval_embeddings | Few-shot quality steering | HNSW cosine |
| Case Memory | case_memory_chunks | Per-session testimony recall | HNSW cosine |
5.2 Evaluation RAG: The Quality Feedback Loop
The evaluation RAG pipeline implements a closed-loop system in which human quality judgments are recycled as agent steering signals: Ingestion:- Human raters evaluate agent responses on a 1–5 star scale with optional textual feedback.
- Upon batch completion, each rated response is embedded using
text-embedding-3-small. - Embeddings are stored with metadata: agent type, rating, response text, context, scenario, phase, and pedagogical flags (
isIntentionallyDefective,isIntentionallyIncorrect).
- Before generating an agent response, the system embeds the current conversation context.
- Vector similarity search retrieves the most similar previously-rated examples, split into positive (rating ) and negative (rating ) sets.
- Examples are formatted as “GOOD EXAMPLES (aim for this quality)” and “EXAMPLES TO AVOID” sections and injected into the agent’s system prompt.
examplesPerType: Number of positive/negative examples to retrieve (default: 2)minRatingsForRetrieval: Minimum total embeddings before RAG activates (default: 10)similarityThreshold: Cosine similarity floor for retrieval (default: 0.70)
5.3 Case Memory: Per-Session Vector Recall
Each witness answer is ingested into a per-session vector memory table:- Atomic claim extraction: The answer is decomposed into individual factual statements (currently a single-claim approach; LLM-based decomposition is planned).
- Embedding generation: Each claim is embedded using
text-embedding-3-small. - Storage: Claims are stored with session ID, witness ID, phase, turn number, and linked elicit/fact IDs.
within_direct_scope vs. credibility) and tracked for coverage during cross-examination.
6 Evaluation Framework
6.1 Design Philosophy
Evaluating a multi-agent pedagogical system presents a challenge absent from standard LLM benchmarking: the system must sometimes be deliberately wrong. A witness who fabricates facts has failed. An opposing counsel who raises a spurious objection may have succeeded — if the pedagogical intent was to test the student’s ability to recognize bad objections. This dual objective means that a single quality metric is insufficient; evaluation must be conditioned on the agent’s intended behavior.6.2 G-Eval Implementation
We implement G-Eval (Liu et al., 2023) as our automated evaluation framework. For each agent response, an evaluator LLM (GPT-4o-mini, for reproducibility) is prompted with:- Criteria definition: A natural language description of what constitutes quality for this metric.
- Evaluation steps: An ordered checklist the evaluator should follow (optional, metric-dependent).
- Evaluation data: The input, actual output, and any reference context (affidavit, established facts).
- Scoring instructions: Return a JSON object with
score(0–1 continuous) andreason(1–3 sentence rationale).
6.3 Metric Taxonomy
We define 13 metrics across three agent types:6.3.1 Witness Metrics (4)
| Metric | Criteria | Threshold | Params |
|---|---|---|---|
| AffidavitFaithfulness | Response contains ONLY facts traceable to the affidavit. Penalizes fabrication, contradiction, and over-confidence on uncovered topics. | 0.70 | input, output, retrieval_context |
| BehavioralCompliance | Response matches configured personality profile along three axes: cooperativeness (hostile→eager), verbosity (terse→verbose), memory quality (poor→excellent). Conditioned on examination type (direct vs. cross). | 0.60 | input, output |
| ResponseAuthenticity | Response sounds like a real human witness — natural hesitations, appropriate emotion, first-person perspective. Penalizes robotic language, perfect rehearsed answers, meta-commentary. | 0.60 | input, output |
| WitnessAnswerRelevancy | Response directly addresses the question asked. Accommodates appropriate evasiveness for hostile witnesses. | 0.60 | input, output |
retrieval_context, enabling the evaluator to perform a claim-by-claim verification against the source document. The evaluation steps explicitly instruct the evaluator to:
- List all factual claims in the response
- Check each claim against the affidavit
- Flag unverifiable claims
- Check for contradictions
- Evaluate appropriate use of “I don’t know”
6.3.2 Judge Metrics (4)
| Metric | Criteria | Threshold | Params |
|---|---|---|---|
| RulingCorrectness | Ruling is legally correct under the FRE. Evaluator is provided a comprehensive FRE reference covering hearsay (801–802), leading questions (611), relevance (401–402), foundation (602, 901), speculation (701), character evidence (404), scope of cross (611(b)), and best evidence (1002). | 0.70 | input, output |
| RuleCitation | Ruling cites the correct FRE rule number for the objection type. | 0.50 | input, output |
| JudicialDemeanor | Response is impartial, decisive, professional. No favoritism, lecturing, sarcasm, or character-breaking. | 0.60 | output |
| JudgeResponseFormat | Response follows JSON schema {ruling, reason}. | 0.50 | output |
6.3.3 Opposing Counsel Metrics (5)
| Metric | Criteria | Threshold | Params |
|---|---|---|---|
| OCATaskCompletion | Proper execution of mode-appropriate action (objection decision or question formulation). Conditioned on shouldBeIncorrect flag for intentional error cases. | 0.60 | input, output |
| OCAStrategicQuality | Action is strategically sound from an adversarial perspective — worth making, well-timed, professionally credible. | 0.50 | input, output, context |
| OCAPedagogicalValue | Action provides learning opportunity. For intentional errors: the defect should be recognizable but not obvious. For correct actions: demonstrates proper technique. | 0.50 | input, output |
| OCAResponseFormat | Response follows expected JSON schema for the current mode. | 0.50 | output |
| OCAFactGrounding | Response references only established facts. Does not assume unestablished facts or reference unadmitted evidence. | 0.60 | output |
OCATaskCompletion explicitly changes its evaluation criteria based on the shouldBeIncorrect flag. When the flag is true, the evaluator is instructed to score based on whether the agent successfully executed an incorrect objection, not whether it made a correct one. This is a departure from standard LLM evaluation, where correctness is uniformly desirable.
6.4 Human Evaluation Pipeline
Human evaluation follows a batched workflow:- Session Selection: Completed sessions are identified as evaluation candidates. Sessions are filtered by recency (
daysBack) and optionally by scenario or agent type. - Dataset Construction: The
DatasetBuilderextracts evaluation test cases from raw transcripts. For each agent response in the transcript, it constructs anEvalTestCasewith the preceding question/objection as input, the agent’s response asactualOutput, and a 5-event context window ascontext. Witness test cases include the affidavit asretrievalContext. - Rating Collection: Human raters evaluate responses on a 1–5 star scale with optional textual feedback. Ratings are stored with full provenance: session ID, transcript index, agent type, phase, and witness ID.
- Embedding & RAG Ingestion: Upon batch completion, rated responses are embedded and stored in the
eval_embeddingstable for future RAG retrieval. Responses tagged as intentionally defective/incorrect are flagged for pedagogical filtering.
6.5 Correlation Analysis
We validate automated metrics against human ratings using two complementary statistics: Pearson Correlation measures linear agreement: where is the automated score and is the normalized human rating (mapped from to via ). Spearman Rank Correlation measures monotonic agreement without assuming linearity. We compute ranks for both automated and human scores, then apply Pearson correlation to the rank vectors. This captures cases where automated and human scores agree on relative ordering even if the absolute scale differs. We report both statistics because they capture different failure modes:- High Pearson, low Spearman → scores are linearly related but with rank inversions at the extremes
- Low Pearson, high Spearman → scores agree on ordering but with non-linear scaling
- Automated score higher than human → possible metric leniency
- Human score higher than automated → possible metric stringency or human appreciation of factors not captured by rubrics
| Pearson | Interpretation | Action |
|---|---|---|
| Strong correlation | Automated metrics are reliable; safe for RAG ingestion | |
| Moderate correlation | Metrics partially capture human preferences; threshold tuning recommended | |
| Weak correlation | Metrics need significant revision; review domain-specific criteria |
6.6 The Feedback Loop
The evaluation pipeline implements a closed-loop improvement cycle:- Validation: A batch is reviewed. If , the batch is deemed reliable for RAG ingestion.
- Embedding: Rated responses are encoded into 1536-dimensional vectors with full metadata.
- Retrieval: When an agent generates a new response, the system retrieves the most similar positive (4–5 star) and negative (1–2 star) examples.
- Steering: Examples are injected into the system prompt as structured “Do” and “Don’t” guidance sections.
6.7 Continuous Monitoring
A nightly cron job (2:00 AM UTC) runs automated evaluation on the 10 most recent completed sessions, storing results and computing correlation against any available human ratings. This provides:- Continuous regression detection for prompt changes
- Longitudinal quality tracking per metric
- Early warning for model degradation (e.g., after provider model updates)
7 Vector Memory and Cross-Examination Planning
7.1 Memory Chunking Pipeline
Each witness answer is processed through a memory ingestion pipeline that operates asynchronously (fire-and-forget) to avoid blocking the main turn loop:- Claim Extraction: Atomic factual claims are identified from the answer.
- Embedding: Each claim is embedded using
text-embedding-3-small. - Storage: Claims are stored in
case_memory_chunkswith session scoping, witness attribution, phase, turn number, and linked elicit IDs. - Confidence Scoring: Each claim receives a confidence score (0–1) based on extraction quality.
<=> operator), with the similarity conversion: .
7.2 Cross-Examination Outline Generation
At the phase transition from direct to cross-examination, the system generates a structured cross-examination outline:- Memory Query: All testimony chunks from the direct examination phase for the current witness are retrieved.
- Similarity Search: Key claims are compared against the witness’s affidavit and prior session memory to identify inconsistencies.
- LLM Analysis: An LLM processes the direct testimony and produces
RebuttalItemobjects, each classified by:- Scope:
within_direct_scope(addressable under FRE 611(b)) orcredibility(always permissible) - Type:
inconsistent_testimony,contradiction_with_prior_session,contradicts_evidence,prior_inconsistent_statement,impeachment_opportunity,bias_indicator
- Scope:
- Coverage Tracking: During cross-examination, each player question is matched against rebuttal items using semantic similarity. Covered items earn bonus points and are marked with the covering party and turn number.
8 Scenario Generation Pipeline
The system includes an LLM-powered pipeline for generating trial scenarios from uploaded legal documents:- Content Extraction: Documents (PDF, DOCX) are processed through a multi-path extractor:
- Text-based PDFs → MuPDF native extraction
- Scanned PDFs → Smart OCR fallback (if words-per-page ratio is below threshold, pages are sent as base64 images to a vision model, with client-side Tesseract.js as a secondary path)
- Word documents → Mammoth extraction
- Fact Extraction: An LLM processes the raw text to identify key facts, parties, claims, and legal theories.
- Scenario Generation: A second LLM call produces the full scenario structure: witnesses with affidavits and behavioral profiles, elicits with weights and categories, case theories, exhibits, and learning objectives.
9 Observability
All LLM calls are instrumented with OpenTelemetry spans via Langfuse:- Route-level tracing:
withLangfuseTrace()wraps API handlers with trace context, recording session ID, user ID, and input. - Call-level telemetry:
buildTelemetryConfig()tags eachstreamText/generateTextcall with a function ID that maps to the generation name in Langfuse (e.g.,courtroom/turn-stream,g-eval-AffidavitFaithfulness). - Agent debug logging: An optional debug mode (
?agentDebug=true) embeds agent-level debug events directly into the response stream, enabling real-time inspection of OCA decisions, scoring computations, and memory operations.
10 Limitations and Future Work
10.1 Current Limitations
- Atomic claim extraction: The memory chunking pipeline currently treats the full witness answer as a single claim. LLM-based decomposition into truly atomic statements would improve cross-examination outline quality and rebuttal coverage tracking.
- Static topic categories: Question deduplication uses a fixed set of topic keywords, which may not generalize across diverse legal domains (e.g., criminal law vs. maritime law). An embedding-based topic classification would be more robust.
- Evaluation sample size: Correlation analysis requires paired ratings for reliable statistics. Early-stage deployments may not have sufficient human ratings to validate automated metrics.
- Context enrichment latency: The advanced pronoun resolution step adds an LLM call to the scoring pipeline, increasing turn latency by 500–1500ms. This is currently opt-in.
- Single-judge evaluation: G-Eval uses a single evaluator model (GPT-4o-mini). Multi-judge ensembles with majority voting could improve reliability, particularly for subjective metrics like ResponseAuthenticity.
10.2 Future Directions
- Reward modeling: Replace G-Eval with trained reward models fine-tuned on accumulated human ratings, reducing per-evaluation cost and improving domain specificity.
-
Adaptive difficulty: Use student performance trajectories to automatically adjust
intentionalErrorRate, witness cooperativeness, and scenario complexity — implementing a form of intelligent tutoring system within the multi-agent loop. - Multi-session memory: Extend vector memory across sessions to enable longitudinal student skill tracking and personalized scenario recommendations.
- Adversarial robustness: Evaluate susceptibility to prompt injection through player questions designed to break agent role adherence, and develop guardrails.
References
- Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634.
- Abbasian Chaleshtari, M., et al. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. arXiv:2506.06941.
- Federal Rules of Evidence, 28 U.S.C. (2024).
LitigationLabs, 2026. Internal Technical Report.