Multi-Agent Adversarial Dialogue for Pedagogical Legal Simulation
System design and evaluation methodology for a multi-agent courtroom simulation with hybrid scoring, RAG-based quality steering, and G-Eval validation.
We present the design and evaluation methodology for a multi-agent adversarial dialogue system that simulates courtroom trial proceedings for legal pedagogy. The system coordinates three specialized language model agents — an opposing counsel, a judicial officer, and a witness — within a turn-based game loop that requires long-horizon consistency, domain-grounded generation, and pedagogically calibrated difficulty. We introduce a hybrid scoring architecture that combines keyword-based term overlap, embedding-based semantic similarity, and context-aware pronoun resolution to evaluate student performance against structured factual targets (elicits). We describe a dual-purpose retrieval-augmented generation (RAG) pipeline that serves both as a few-shot steering mechanism for agent quality and as a per-session vector memory for cross-examination planning. We detail our evaluation framework, which implements G-Eval (Liu et al., 2023) with 13 domain-specific metrics across three agent types, validated through Pearson and Spearman correlation analysis against human expert ratings. We find that the interplay between intentional agent errors (pedagogical traps) and faithful simulation creates a novel tension in evaluation design: metrics must reward both correctness and deliberate incorrectness depending on the pedagogical context. We discuss the implications of this dual objective for LLM-as-judge reliability in educational domains.
Trial advocacy is a skill learned through repetition against adversarial opponents. Law students in the United States spend years studying the Federal Rules of Evidence (FRE), yet courtroom experience remains scarce — moot court competitions are infrequent, and supervised clinical rotations are limited. The gap between doctrinal knowledge and procedural fluency is well-documented in legal education literature.We address this gap with an interactive multi-agent simulation in which a student (the “player”) conducts direct and cross-examination of AI witnesses while an AI opposing counsel raises objections and an AI judge rules on them. The system must satisfy several constraints simultaneously:
Factual grounding. Witnesses must testify only to facts contained in their sworn affidavit. Fabrication is a failure mode, not a feature.
Adversarial fidelity. Opposing counsel must behave as a competent attorney — not a trivially beatable strawman — while occasionally introducing deliberate errors to test the student’s recognition skills.
Long-horizon consistency. A trial session may span 50–100 turns. Witnesses must not contradict their prior testimony. Opposing counsel must not re-ask questions already answered.
Measurable learning outcomes. Student performance must be quantified against structured factual targets, not merely tracked as token-level engagement.
These requirements place the system at the intersection of multi-agent coordination, constrained generation, retrieval-augmented memory, and domain-specific evaluation — areas where large language models excel individually but whose composition remains underexplored.
A multi-agent orchestration architecture that coordinates three LLM agents within a structured turn protocol, using mode-dependent prompt dispatch to support four distinct operational regimes.
A three-tier scoring system for measuring semantic entailment between free-form witness testimony and structured factual targets, with polarity-aware weighting for adversarial examination dynamics.
A context compression scheme that replaces raw transcript windows with semantically structured testimony state, enabling arbitrarily long trial sessions within fixed context budgets.
A closed-loop evaluation pipeline in which human expert ratings are embedded, retrieved via vector similarity, and injected as few-shot exemplars into agent prompts — creating a self-improving feedback loop.
A domain-specific evaluation framework with 13 G-Eval metrics, correlation-validated against human expert judgment using both Pearson and Spearman rank statistics.
The system is structured as a turn-based game loop in which each player utterance triggers a sequential pipeline of agent calls. The orchestrator (lib/courtroom/orchestrator.ts) manages three primary flows:
Player Examination Flow (handleTurn): Player asks a question → OCA evaluates for objectionability → Judge rules (if objection raised) → Witness answers (if not sustained) → Score computed.
OCA Examination Flow (handleOCAExaminationTurn): OCA generates a question → Player evaluates for objectionability → Judge rules → Witness answers → Score computed.
Objection Resolution Flow (handlePlayerObjection): Player objects to OCA’s question → Judge rules → Score adjusted.
All three flows share a common post-processing pipeline: testimony state compression, vector memory ingestion, and session persistence.
Each agent is parameterized by an AgentConfig stored in a CMS-backed database, enabling hot-swappable prompt templates, model selection, and behavioral parameters without code deployment:
Agent
Model
Key Parameters
Output Format
Opposing Counsel (OCA)
Configurable (default: GPT-4.1 via gateway)
intentionalErrorRate, objectionTypes[], temperature
The opposing counsel agent operates in four modes, determined by a policy function determineOCAMode() that examines the current trial phase, the witness’s party alignment, and the player’s side:
Mode
Phase Context
OCA Behavior
objection_user_direct
Player conducts direct examination
OCA monitors for FRE violations
objection_user_cross
Player conducts cross-examination
OCA monitors for scope/relevance violations
oc_direct
OCA conducts direct examination of own witness
OCA asks questions, player monitors
oc_cross
OCA cross-examines player’s witness
OCA asks impeachment questions
This dispatch mechanism is functionally a policy selector over a discrete action space, where the state is the tuple (phase, witness_side, player_side). The selected mode determines the system prompt template, output schema, and post-processing pipeline.
A distinctive design requirement is that OCA must sometimes produce incorrect outputs on purpose. The intentionalErrorRate parameter (default: 30%) controls how often OCA generates:
Defective questions (in examination mode): Leading questions on direct, hearsay-laden prompts, or questions assuming facts not in evidence.
Incorrect objections (in objection mode): Objecting to perfectly proper questions on spurious grounds.
This is analogous to an exploration strategy in curriculum learning — the agent deliberately introduces noise to force the student to develop error-detection skills. The system tags these outputs with is_intentionally_defective or is_intentionally_incorrect flags for downstream scoring and evaluation filtering.
The core scoring problem is: given a free-form witness answer a and a set of structured factual targets (elicits) E={e1,e2,…,en}, determine which elicits are semantically entailed by the answer. Each elicit ei has:
A natural language label ℓi (e.g., “The ship was traveling at 22.5 knots”)
A signed weight wi∈R indicating polarity
A unique identifier for deduplication
An elicit is considered unlocked when the witness’s answer sufficiently entails its label. The scoring function must handle paraphrase, partial entailment, and pronominal reference — the witness rarely states facts in the exact language of the elicit.
The baseline approach extracts key terms from both the answer and elicit label after stop-word removal, then computes a coverage ratio:scorekw(a,ei)=∣terms(ℓi)∣∣terms(a)∩terms(ℓi)∣+0.5⋅∣fuzzy(a,ℓi)∣where fuzzy(a,ℓi) counts substring-matched terms with partial credit. The default threshold is τkw=0.30 (30% of key terms must match).
The production approach generates embeddings for both the answer and all candidate elicits using text-embedding-3-small (1536 dimensions), then computes cosine similarity:scoresem(a,ei)=∣∣va∣∣⋅∣∣vei∣∣va⋅veiThe system uses a hybrid decision rule: an elicit is matched if either the semantic score exceeds τsem=0.40or the keyword score exceeds τkw=0.30. This OR-gate design ensures that lexically similar but semantically distant matches (proper nouns, dates) and semantically equivalent but lexically different matches (paraphrases) are both captured.Elicit embeddings are cached permanently in memory (they are static per scenario), while answer embeddings use a 60-second TTL cache to avoid redundant API calls within a turn.A “strong match” threshold at τstrong=0.60 provides a confidence signal used in downstream reporting.
Tier 3: Context-Aware Matching with Pronoun Resolution
The most advanced mode addresses a specific failure mode: short, pronominal witness answers like “Yes, that’s correct” or “He did” that are semantically empty without the preceding question context. When advancedMatching.enabled = true, the system:
Extracts the k most recent question-answer exchanges from the transcript.
Calls an LLM to resolve pronouns and references in the answer, producing an enriched version (e.g., “Yes, that’s correct” → “The ship was traveling at 22.5 knots at the time of the collision”).
Runs semantic matching on the enriched text.
This is functionally a coreference resolution step, implemented as an LLM call rather than a traditional NLP pipeline, which handles the domain-specific reference patterns common in courtroom testimony.
Elicit weights carry sign information that encodes which trial side benefits from the fact being established:wi>0⟹benefits witness’s side (active during direct)wi<0⟹benefits opposing side (active during cross)The scoring function filters elicits by polarity before matching:Eactive={{ei∈E:wi≥0}{ei∈E:wi<0}if direct examinationif cross-examinationThe absolute value ∣wi∣ determines point value. This design reflects a core trial advocacy principle: on direct examination, an attorney seeks to establish favorable facts (positive elicits), while on cross-examination, the goal is to elicit unfavorable admissions (negative elicits).
Player objections to OCA questions are scored on a discrete scale:
Outcome
Points
Condition
Correct objection (sustained, question was defective)
+2
wasSustained && wasDefective
Correct type bonus
+1
Player identified exact defect type
Incorrect objection (proper question)
-1
!wasDefective
Missed objection (defective question passed)
-1
wasDefective && !objected
Overruled on defective question
0
Learning moment, no penalty
This reward structure creates an asymmetric incentive: objecting carries risk (potential -1) but higher reward (+2 or +3), while passing is safe only when the question is proper. The 0-point “overruled on defective” case is a deliberate design choice — it acknowledges that the student recognized an issue but couldn’t articulate a winning objection, which has pedagogical value.
A full trial transcript can reach 50,000+ tokens over 100 turns. Naively including the full transcript in each agent prompt would either exceed context limits or consume budget that should be allocated to reasoning. We address this with a testimony state abstraction that replaces the raw transcript.
The TestimonyState object is a structured summary maintained incrementally after each exchange:
TestimonyState := { playerEstablishedFacts: EstablishedFact[] // Facts proven by player ocaEstablishedFacts: EstablishedFact[] // Facts proven by OCA questionsAsked: QuestionRecord[] // Compressed question log witnessRecords: Map<WitnessId, WitnessTestimonyRecord> admissions: string[] // Key contradictions}
Each EstablishedFact is derived from witness testimony through a fact extraction pipeline:
Confirmatory answer detection: Short answers (“Yes”, “That’s correct”) are identified via regex pattern matching and enriched with the question premise to produce a complete factual statement (e.g., "Witness confirmed: the ship was traveling at 22.5 knots").
Substantive answer decomposition: Longer answers are split into individual sentences, filtered for non-fact statements (“I don’t know”, “I’m not sure”), and prefixed with witness attribution.
Elicit linking: Extracted facts are matched against scenario elicits using the same keyword/semantic pipeline described in Section 3, and tagged with the matched elicit ID and weight.
A critical failure mode in long sessions is the OCA asking semantically identical questions. We implement a two-layer deduplication system:Layer 1: Lexical Similarity (Jaccard)Questions are normalized (stop-word removal, word sorting for order independence) and compared using Jaccard similarity over word sets:J(q1,q2)=∣Wq1∪Wq2∣∣Wq1∩Wq2∣Layer 2: Topic CategorizationQuestions are classified into predefined topic categories (weather, speed, collision, observation, time, safety, etc.) using keyword matching. Topic overlap is computed:T(q1,q2)=max(∣topics(q1)∣,∣topics(q2)∣)∣topics(q1)∩topics(q2)∣Combined Score: The final similarity uses a weighted combination:sim(q1,q2)=0.6⋅J(q1,q2)+0.4⋅T(q1,q2)A question is blocked if sim≥0.65 against any previously asked question for the same witness. The 60/40 weighting reflects our empirical observation that lexical overlap is a stronger signal than topic co-occurrence for detecting paraphrased questions, while topic overlap catches cases where different words address the same subject.
The testimony state is not presented uniformly to all agents. Instead, each agent receives a projection tailored to its role:
OCA receives: topics already covered, questions already asked (as a “do not repeat” list), elicits remaining to target, and an examination plan derived from unestablished elicits or rebuttal items.
Witness receives: their own prior statements (for consistency), confirmed facts, and denied facts — constrained to the current witness’s record only.
Judge receives: prior rulings in the session, for consistency in judicial temperament.
This selective projection is analogous to an attention mask — each agent sees only the subset of accumulated state relevant to its generation task, preventing cross-contamination of role-specific information.
The evaluation RAG pipeline implements a closed-loop system in which human quality judgments are recycled as agent steering signals:Ingestion:
Human raters evaluate agent responses on a 1–5 star scale with optional textual feedback.
Upon batch completion, each rated response is embedded using text-embedding-3-small.
Embeddings are stored with metadata: agent type, rating, response text, context, scenario, phase, and pedagogical flags (isIntentionallyDefective, isIntentionallyIncorrect).
Retrieval:
Before generating an agent response, the system embeds the current conversation context.
Vector similarity search retrieves the k most similar previously-rated examples, split into positive (rating ≥4) and negative (rating ≤2) sets.
Examples are formatted as “GOOD EXAMPLES (aim for this quality)” and “EXAMPLES TO AVOID” sections and injected into the agent’s system prompt.
Pedagogical Filtering: A critical design decision is that the retrieval pipeline excludes intentionally defective/incorrect examples by default. When OCA is generating a pedagogical trap turn, the filter is inverted — the system retrieves only examples tagged as intentionally defective, providing few-shot guidance for what a “good bad question” looks like. Without this filter, the RAG pipeline would conflate genuine quality failures with deliberate pedagogical errors.Configuration (from CMS global):
examplesPerType: Number of positive/negative examples to retrieve (default: 2)
minRatingsForRetrieval: Minimum total embeddings before RAG activates (default: 10)
similarityThreshold: Cosine similarity floor for retrieval (default: 0.70)
Each witness answer is ingested into a per-session vector memory table:
Atomic claim extraction: The answer is decomposed into individual factual statements (currently a single-claim approach; LLM-based decomposition is planned).
Embedding generation: Each claim is embedded using text-embedding-3-small.
Storage: Claims are stored with session ID, witness ID, phase, turn number, and linked elicit/fact IDs.
This memory is queried at phase transitions (direct → cross) to generate cross-examination outlines — structured plans that identify contradictions, impeachment opportunities, and rebuttal items from direct testimony. Rebuttal items are classified by FRE 611(b) scope (within_direct_scope vs. credibility) and tracked for coverage during cross-examination.
Evaluating a multi-agent pedagogical system presents a challenge absent from standard LLM benchmarking: the system must sometimes be deliberately wrong. A witness who fabricates facts has failed. An opposing counsel who raises a spurious objection may have succeeded — if the pedagogical intent was to test the student’s ability to recognize bad objections. This dual objective means that a single quality metric is insufficient; evaluation must be conditioned on the agent’s intended behavior.
We implement G-Eval (Liu et al., 2023) as our automated evaluation framework. For each agent response, an evaluator LLM (GPT-4o-mini, T=0 for reproducibility) is prompted with:
Criteria definition: A natural language description of what constitutes quality for this metric.
Evaluation steps: An ordered checklist the evaluator should follow (optional, metric-dependent).
Evaluation data: The input, actual output, and any reference context (affidavit, established facts).
Scoring instructions: Return a JSON object with score (0–1 continuous) and reason (1–3 sentence rationale).
Scores are clamped to [0,1] with a configurable pass/fail threshold per metric.
Response contains ONLY facts traceable to the affidavit. Penalizes fabrication, contradiction, and over-confidence on uncovered topics.
0.70
input, output, retrieval_context
BehavioralCompliance
Response matches configured personality profile along three axes: cooperativeness (hostile→eager), verbosity (terse→verbose), memory quality (poor→excellent). Conditioned on examination type (direct vs. cross).
0.60
input, output
ResponseAuthenticity
Response sounds like a real human witness — natural hesitations, appropriate emotion, first-person perspective. Penalizes robotic language, perfect rehearsed answers, meta-commentary.
0.60
input, output
WitnessAnswerRelevancy
Response directly addresses the question asked. Accommodates appropriate evasiveness for hostile witnesses.
0.60
input, output
The BehavioralCompliance metric is noteworthy because it is dynamically constructed: the evaluator prompt is parameterized by the witness’s configured cooperativeness, verbosity, and memory quality levels, producing a different evaluation rubric for each witness profile. This is necessary because a terse answer from a “verbose” witness is a failure, while the same answer from a “terse” witness is correct behavior.The AffidavitFaithfulness metric uses the witness’s affidavit as retrieval_context, enabling the evaluator to perform a claim-by-claim verification against the source document. The evaluation steps explicitly instruct the evaluator to:
Ruling is legally correct under the FRE. Evaluator is provided a comprehensive FRE reference covering hearsay (801–802), leading questions (611), relevance (401–402), foundation (602, 901), speculation (701), character evidence (404), scope of cross (611(b)), and best evidence (1002).
0.70
input, output
RuleCitation
Ruling cites the correct FRE rule number for the objection type.
0.50
input, output
JudicialDemeanor
Response is impartial, decisive, professional. No favoritism, lecturing, sarcasm, or character-breaking.
0.60
output
JudgeResponseFormat
Response follows JSON schema {ruling, reason}.
0.50
output
The RulingCorrectness metric embeds a full FRE reference document in the evaluation prompt, providing the evaluator with ground-truth legal rules. The evaluation prompt also includes the specific objection type and questioned text, enabling the evaluator to determine the correct ruling before scoring the agent’s output. The evaluation steps guide the evaluator through a legal reasoning chain: identify objection → recall applicable rule → analyze the question → determine correct ruling → compare to agent’s ruling → assess reasoning quality.
Proper execution of mode-appropriate action (objection decision or question formulation). Conditioned on shouldBeIncorrect flag for intentional error cases.
0.60
input, output
OCAStrategicQuality
Action is strategically sound from an adversarial perspective — worth making, well-timed, professionally credible.
0.50
input, output, context
OCAPedagogicalValue
Action provides learning opportunity. For intentional errors: the defect should be recognizable but not obvious. For correct actions: demonstrates proper technique.
0.50
input, output
OCAResponseFormat
Response follows expected JSON schema for the current mode.
0.50
output
OCAFactGrounding
Response references only established facts. Does not assume unestablished facts or reference unadmitted evidence.
0.60
output
The OCA metrics demonstrate the pedagogical-correctness tension: OCATaskCompletion explicitly changes its evaluation criteria based on the shouldBeIncorrect flag. When the flag is true, the evaluator is instructed to score based on whether the agent successfully executed an incorrect objection, not whether it made a correct one. This is a departure from standard LLM evaluation, where correctness is uniformly desirable.
Session Selection: Completed sessions are identified as evaluation candidates. Sessions are filtered by recency (daysBack) and optionally by scenario or agent type.
Dataset Construction: The DatasetBuilder extracts evaluation test cases from raw transcripts. For each agent response in the transcript, it constructs an EvalTestCase with the preceding question/objection as input, the agent’s response as actualOutput, and a 5-event context window as context. Witness test cases include the affidavit as retrievalContext.
Rating Collection: Human raters evaluate responses on a 1–5 star scale with optional textual feedback. Ratings are stored with full provenance: session ID, transcript index, agent type, phase, and witness ID.
Embedding & RAG Ingestion: Upon batch completion, rated responses are embedded and stored in the eval_embeddings table for future RAG retrieval. Responses tagged as intentionally defective/incorrect are flagged for pedagogical filtering.
We validate automated metrics against human ratings using two complementary statistics:Pearson Correlation measures linear agreement:r=(n∑xi2−(∑xi)2)(n∑yi2−(∑yi)2)n∑xiyi−∑xi∑yiwhere xi is the automated score and yi is the normalized human rating (mapped from [1,5] to [0,1] via yi=(hi−1)/4).Spearman Rank Correlation measures monotonic agreement without assuming linearity. We compute ranks for both automated and human scores, then apply Pearson correlation to the rank vectors. This captures cases where automated and human scores agree on relative ordering even if the absolute scale differs.We report both statistics because they capture different failure modes:
High Pearson, low Spearman → scores are linearly related but with rank inversions at the extremes
Low Pearson, high Spearman → scores agree on ordering but with non-linear scaling
Outlier Detection: We identify divergent cases where ∣xi−yi∣≥0.30 (on the normalized scale). Each outlier is analyzed for possible explanations:
Automated score higher than human → possible metric leniency
Human score higher than automated → possible metric stringency or human appreciation of factors not captured by rubrics
Interpretation Thresholds:
Pearson r
Interpretation
Action
≥0.70
Strong correlation
Automated metrics are reliable; safe for RAG ingestion
≥0.40
Moderate correlation
Metrics partially capture human preferences; threshold tuning recommended
<0.40
Weak correlation
Metrics need significant revision; review domain-specific criteria
Batch Regression Detection: The system supports comparing consecutive batch runs to detect metric regression. For each metric, a change of >5% is flagged as improvement or regression, enabling continuous monitoring of agent quality over prompt iterations.
The evaluation pipeline implements a closed-loop improvement cycle:
Human Ratings → Embedding → RAG Store → Agent Prompt Injection → New Responses → Human Ratings → ...
The cycle operates as follows:
Validation: A batch is reviewed. If rPearson≥0.40, the batch is deemed reliable for RAG ingestion.
Embedding: Rated responses are encoded into 1536-dimensional vectors with full metadata.
Retrieval: When an agent generates a new response, the system retrieves the k most similar positive (4–5 star) and negative (1–2 star) examples.
Steering: Examples are injected into the system prompt as structured “Do” and “Don’t” guidance sections.
This is functionally a few-shot selection mechanism where the shots are dynamically chosen based on semantic similarity to the current context, rather than statically defined. The approach has the advantage of automatically adapting to new scenarios and question types as the example corpus grows.
A nightly cron job (2:00 AM UTC) runs automated evaluation on the 10 most recent completed sessions, storing results and computing correlation against any available human ratings. This provides:
Continuous regression detection for prompt changes
Longitudinal quality tracking per metric
Early warning for model degradation (e.g., after provider model updates)
Each witness answer is processed through a memory ingestion pipeline that operates asynchronously (fire-and-forget) to avoid blocking the main turn loop:
Claim Extraction: Atomic factual claims are identified from the answer.
Embedding: Each claim is embedded using text-embedding-3-small.
Storage: Claims are stored in case_memory_chunks with session scoping, witness attribution, phase, turn number, and linked elicit IDs.
Confidence Scoring: Each claim receives a confidence score (0–1) based on extraction quality.
The HNSW index on this table uses cosine distance operations (<=> operator), with the similarity conversion: sim=1−dist.
Coverage Tracking: During cross-examination, each player question is matched against rebuttal items using semantic similarity. Covered items earn bonus points and are marked with the covering party and turn number.
The system includes an LLM-powered pipeline for generating trial scenarios from uploaded legal documents:
Content Extraction: Documents (PDF, DOCX) are processed through a multi-path extractor:
Text-based PDFs → MuPDF native extraction
Scanned PDFs → Smart OCR fallback (if words-per-page ratio is below threshold, pages are sent as base64 images to a vision model, with client-side Tesseract.js as a secondary path)
Word documents → Mammoth extraction
Fact Extraction: An LLM processes the raw text to identify key facts, parties, claims, and legal theories.
Scenario Generation: A second LLM call produces the full scenario structure: witnesses with affidavits and behavioral profiles, elicits with weights and categories, case theories, exhibits, and learning objectives.
The pipeline runs with elevated resource limits (3008 MB RAM, 300-second timeout) due to the computational intensity of document processing and multi-stage LLM inference.
All LLM calls are instrumented with OpenTelemetry spans via Langfuse:
Route-level tracing: withLangfuseTrace() wraps API handlers with trace context, recording session ID, user ID, and input.
Call-level telemetry: buildTelemetryConfig() tags each streamText/generateText call with a function ID that maps to the generation name in Langfuse (e.g., courtroom/turn-stream, g-eval-AffidavitFaithfulness).
Agent debug logging: An optional debug mode (?agentDebug=true) embeds agent-level debug events directly into the response stream, enabling real-time inspection of OCA decisions, scoring computations, and memory operations.
Atomic claim extraction: The memory chunking pipeline currently treats the full witness answer as a single claim. LLM-based decomposition into truly atomic statements would improve cross-examination outline quality and rebuttal coverage tracking.
Static topic categories: Question deduplication uses a fixed set of topic keywords, which may not generalize across diverse legal domains (e.g., criminal law vs. maritime law). An embedding-based topic classification would be more robust.
Evaluation sample size: Correlation analysis requires n≥20 paired ratings for reliable statistics. Early-stage deployments may not have sufficient human ratings to validate automated metrics.
Context enrichment latency: The advanced pronoun resolution step adds an LLM call to the scoring pipeline, increasing turn latency by 500–1500ms. This is currently opt-in.
Single-judge evaluation: G-Eval uses a single evaluator model (GPT-4o-mini). Multi-judge ensembles with majority voting could improve reliability, particularly for subjective metrics like ResponseAuthenticity.
Reward modeling: Replace G-Eval with trained reward models fine-tuned on accumulated human ratings, reducing per-evaluation cost and improving domain specificity.
Adaptive difficulty: Use student performance trajectories to automatically adjust intentionalErrorRate, witness cooperativeness, and scenario complexity — implementing a form of intelligent tutoring system within the multi-agent loop.
Multi-session memory: Extend vector memory across sessions to enable longitudinal student skill tracking and personalized scenario recommendations.
Adversarial robustness: Evaluate susceptibility to prompt injection through player questions designed to break agent role adherence, and develop guardrails.
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634.
Abbasian Chaleshtari, M., et al. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. arXiv:2506.06941.