Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.litigationlabs.io/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Langfuse captures the full prompt, messages, and model response for every agent call in CaseSim. While LitigationLabs doesn’t use Langfuse’s Playground feature directly, the generation data in Langfuse gives you everything you need to test and iterate on prompts.

Using Generations as a Playground

Every generation span in Langfuse contains:
  • The system prompt — The full instructions the agent received.
  • The message history — The conversation context sent to the model.
  • The model response — What the agent produced.
  • The model and parameters — Which model, temperature, and tokens were used.
This means any generation can serve as a starting point for experimentation.

Replaying a Prompt

  1. Open a trace in Langfuse and find the generation you want to test.
  2. Copy the system prompt and messages from the generation details.
  3. Paste them into any LLM playground (OpenAI Playground, Anthropic Console, or Langfuse’s built-in playground if enabled).
  4. Modify the prompt and re-run to see how the output changes.
  5. When satisfied, update the agent config in Payload to deploy the new prompt.

Comparing Across Sessions

To evaluate a prompt change:
  1. Note the generation outputs for a specific scenario before the change.
  2. Apply the prompt update via the Prompt Editor.
  3. Run the same scenario again.
  4. Compare the new generations against the old ones in Langfuse.
Filter by trace name (e.g., courtroom/turn-stream) and date range to isolate before/after data.

Testing with Evaluations

For more structured testing, use the evaluation system:
  1. Run an automated eval batch against sessions that used the old prompt.
  2. Update the prompt and run new sessions.
  3. Run another eval batch against the new sessions.
  4. Compare metric scores (Affidavit Faithfulness, Ruling Correctness, etc.) between the two batches.
This gives you quantitative evidence that a prompt change improved or degraded agent quality.

Single Response Testing

The evaluation API also supports testing individual responses:
PUT /api/evals/automated
Send a single agent response with its context and receive G-Eval scores back immediately. This is useful for spot-checking a new prompt against a known-good or known-bad example without running a full batch.