Resampling & Replay¶

AgentLens provides four ways to re-run agent behavior, each at a different level of fidelity. They form a spectrum from cheap/fast (resample a single API call) to expensive/thorough (replay a full trajectory with tool execution).

For CLI flags and syntax, see the CLI Reference.

Overview¶

The spectrum¶

Cheapest / fastest                                    Most thorough
─────────────────────────────────────────────────────────────────────

  Turn resample     Intervention     Session resample   Turn replay
  (API only)        (edit + API)     (full session)     (branch mid-session)

  Same request,     Modified input,  Re-run a whole     Resume from any
  new response.     new response.    session N times.   turn with full
  No tools run.     No tools run.    Tools execute.     tool execution.

When to use what¶

I want to...	Method	Command
Check if the model would say the same thing again	Turn resample	`harness resample`
See what happens if the model had different thinking	Intervention	`harness resample-edit`
See what happens if a tool returned something different	Intervention	`harness resample-edit`
Compare N complete trajectories for the same task	Session resample	`harness resample-session`
Branch from a specific point and let the agent continue	Turn replay	`harness replay`
Test a prompt injection at a specific turn	Turn replay	`harness replay --prompt`

Comparison table¶

	Turn resample	Intervention	Session resample	Turn replay
Tools execute	No	No	Yes	Yes
Filesystem reset	No	No	Yes (fork point)	Yes (git worktree)
Parallel	Yes	Yes	Yes	Yes
Creates new run	No	No	No (appends replicates)	Yes
Editable inputs	No	Yes	No	Prompt only
Requires	`capture_api_requests`	`capture_api_requests`	`fork_from` session	`transcript.jsonl`

Turn-level resampling¶

Send the exact same API request again N times and save each response. This is stateless — no tools execute, no files change. Use it to quickly check how much variance exists at a specific decision point.

What you get: N alternative responses to the same context. Useful for measuring how deterministic the model is at a given turn — does it always pick the same tool? Always hedge the same way?

Requires: capture_api_requests: true in the original experiment config.

Discovering requests¶

$ harness resample runs/my-run --session 1 --list-requests

Session 1: 12 captured requests

    1  |  15 messages  |  claude-sonnet-4  |  Explore the project...
    2  |  17 messages  |  claude-sonnet-4  |  [tool_result for toolu_01H...]
   ...

Running¶

# Resample request 5 ten times
harness resample runs/my-run --session 1 --request 5 --count 10

# From a replicate session
harness resample runs/my-run --session 2 --replicate 3 --request 5 --count 5

Output¶

session_01/resamples/request_005/
├── sample_01.json
├── sample_02.json
└── ...

Intervention testing¶

Edit the conversation inputs — thinking blocks, text, tool results, or system prompt — then resample. This lets you test counterfactuals: "What would the model do differently if it had seen X instead of Y?"

Like turn-level resampling, this is stateless — no tools execute. But the input is modified before sending, so you can study causal effects.

What you can edit:

Thinking blocks — change the model's internal reasoning
Text responses — alter what the model said in prior turns
Tool results — change what a tool returned (e.g., different file contents)
System prompt — modify instructions

From the CLI¶

Two-step workflow: dump the request, edit it, resample.

# 1. Dump the request to a file
harness resample-edit runs/my-run --session 1 --request 5 --dump > edit.json

# 2. Edit edit.json (change thinking, text, tool results, system prompt...)

# 3. Resample with the modified request
harness resample-edit runs/my-run --session 1 --request 5 \
  --input edit.json --label "removed hedging" --count 5

For scriptable interventions, pipe through jq:

harness resample-edit runs/my-run --session 1 --request 5 --dump \
  | jq '.messages[-1].content[0].thinking = "Be more direct."' \
  | harness resample-edit runs/my-run --session 1 --request 5 \
      --input - --label "direct thinking" --count 10

Batch across multiple requests:

for req in 3 5 7 9; do
  harness resample-edit runs/my-run --session 1 --request $req --dump \
    | jq '.messages[-1].content[0].thinking = "Skip exploration."' \
    | harness resample-edit runs/my-run --session 1 --request $req \
        --input - --label "skip-exploration" --count 5
done

From the web UI¶

Open a session's API captures
Click "Edit & Resample" on any request
Modify thinking blocks, text, tool results, or system prompts
Resample with the modified input

Output¶

Both CLI and web UI produce the same structure:

session_01/resamples/request_005_v01/
├── variant.json     # label + metadata
├── request.json     # modified request body
├── sample_01.json   # response to modified input
└── ...

CLI-created variants appear in the web UI and vice versa.

Session-level resampling¶

Re-run a full session N times from scratch. Unlike turn-level methods, this executes tools — each replicate is a complete agent session with real file reads, writes, and tool calls.

What you get: N complete trajectories for the same task, each potentially diverging from the first tool call onward. Useful for studying how much the agent's overall approach varies.

Requires: The session must have a fork_from target (or be in forked mode).

harness resample-session runs/my-run --session 2 --count 5

Each replicate runs in its own git worktree, so all 5 execute in parallel. New directories are appended with auto-incrementing replicate numbers (session_02_r01, session_02_r02, ...), and run_meta.json is updated. The source working directory is never modified.

Turn-level replay¶

Branch execution from any API turn with full tool execution and filesystem reset. This is the highest-fidelity method — the agent sees the exact same conversation context and filesystem state up to the branch point, then generates a fresh response that may diverge.

What you get: A new independent run where the agent resumed from a specific point. The agent can take completely different actions from that point forward, using real tools on a real filesystem.

Key difference from resampling: Resampling gives you alternative responses. Replay gives you alternative trajectories — the agent continues running with full tool execution, potentially for many more turns.

Requires: transcript.jsonl and .shadow_git/ in the source run.

How it works¶

For replay from turn N:

Transcript truncation — The original transcript is cut after turn N-1's assistant messages, before the tool results
Filesystem reset via git worktrees — Each replicate gets its own worktree checked out from the source shadow git at the filesystem state as of turn N. Worktrees share the git object store (space efficient) but are fully isolated
Tool result injection — The original tool results from turn N-1 are sent to the SDK, so the agent sees the exact same context
Fresh response — The agent generates a new response (the branch point) and continues with full tool execution
Parallel execution — When count > 1, all replicates run concurrently. Each operates in its own worktree — no contention. The source working directory is never modified

Discovering turns¶

$ harness replay runs/my-run --session 1 --list-turns

Turns in session 1 (12 total):

  Turn 1: Read  (1 results)
  Turn 2: Read, Grep  (2 results)  [_step_1_3]
  Turn 3: Edit, Write  (2 results)  [_step_1_5]
  Turn 4: Bash  (1 results)  [_step_1_7]
  ...
  Turn 12: (no tools)

Bracketed tags (e.g. [_step_1_3]) indicate shadow git snapshots — turns where file writes were detected. The replay resets the filesystem to the nearest snapshot at or before the target turn.

Running¶

# Replay from turn 5, three times (runs in parallel)
harness replay runs/my-run --session 1 --turn 5 --count 3

# Replay with an additional prompt after tool results
harness replay runs/my-run --session 1 --turn 5 --prompt "Try a different approach"

# Replay from turn 1 (re-run from scratch with same config)
harness replay runs/my-run --session 1 --turn 1 --count 2

# Replay session 1 turn 5, then continue with sessions 2..end
harness replay runs/my-run --session 1 --turn 5 --continue-sessions

When --continue-sessions is enabled, each replicate runs the replayed session first, then continues with sessions N+1..end from the original config.

Output¶

Each replay creates a new independent run directory:

runs/replay_my-run_s1_t5_r01_2026-03-16T00-00-00/
├── config.yaml                          # frozen config from source
├── run_meta.json                        # standard run metadata + replay fields
├── replay_meta.json                     # full provenance (source run, session, turn, etc.)
├── .shadow_git/                         # fresh shadow git for this replay
└── session_01/
    ├── trajectory.json                  # ATIF trajectory (from turn 5 onward)
    ├── transcript.jsonl                 # Claude Code transcript
    ├── uuid_map.json                    # turn correlation map
    ├── session_diff.patch               # file changes during replay
    └── source_transcript_truncated.jsonl # truncated source for reference

Technical notes¶

UUID map¶

Each session generates a uuid_map.json that correlates entries across the three data formats (transcript, ATIF trajectory, raw API dumps). The primary join key is tool_call_id. The replay system uses this to find shadow git tags for filesystem reset.

Thinking signatures¶

When resampling, the harness automatically strips thinking block signatures from the request. Signatures are response-specific and would cause errors if replayed verbatim.