When should AI agents use snapshot/restore vs keeping error records?

They address different failure categories. For lightweight failures (tool returned an error, agent state is intact), keep the error record so the model learns from it — following Manus's recommendation. For destructive operations (tool executed with real side effects but wrong outcome), use snapshot/restore to roll back to a stable state. The criterion: can the model self-correct by seeing the error? If yes, keep it. If no, restore.

What does ContextChef's snapshot capture and what does it exclude?

Snapshot captures all dynamic state that determines the next compile() output: conversation history, dynamic state, Janitor's token count cursor, Memory's current key-value pairs, and Pruner's current tool state. It deliberately excludes static configuration: module configs and VFS filesystem content. This boundary ensures predictability — after restore(), you know exactly what rolled back and what stayed.

ContextChef (6): Snapshot & Restore — Capture Everything That Determines the Next Compile

10 Mar 2026

4 min read

AI Agent ContextChef Context Engineering

中文版：ContextChef (6)：Snapshot & Restore——捕获决定下次编译的一切

Manus has a counterintuitive recommendation in their blog: keep error records; don’t clean up failed tool calls. When the model sees a failed operation and its error output, it implicitly updates its internal judgment, reducing the chance of repeating the same mistake. This is the foundation of error recovery ability, and Manus considers it one of the clearest indicators of genuinely agentic behavior.

That’s correct. But it solves one class of problem: lightweight failures — a tool returned an error, and the model needs to see it to adjust strategy.

There’s another class: destructive operations — the tool executed, the side effects are real, but the outcome is wrong. “Show the model the error” doesn’t fix this. You need to roll back the entire context state and restart from a stable point.

Design Angle: Capture Everything That Determines the Next compile(), Nothing More

Snapshot’s design angle is that chef.snapshot() should capture all dynamic state that determines what the next compile() will produce — no more, no less.

What it captures: conversation history, dynamic state, Janitor’s token count cursor, Memory’s current key-value pairs, and Pruner’s current tool state. Together, these fully determine what compile() will produce.

What it doesn’t capture: static module configuration (Janitor/VFS config, etc.) and VFS filesystem content. These are the agent’s static skeleton — they should remain unchanged after restore().

The value of this boundary is predictability: after restore(), you know exactly what went back to its previous state (history, task state, memory, tool registrations) and what didn’t change (module configuration). No ambiguity, no hidden side effects.

The Value of Immutable Snapshots

ChefSnapshot is a read-only object. This design choice lets you safely keep multiple snapshots without them interfering with each other:

const snap1 = chef.snapshot("phase 1 complete");
// ... execute phase 2 ...
const snap2 = chef.snapshot("phase 2 complete");
// ... phase 3 fails ...
chef.restore(snap2); // back to phase 2
// or
chef.restore(snap1); // back to phase 1, retry all of phase 2

If snapshots were mutable, you’d need to constantly worry about accidentally modifying one after a restore. Immutable snapshots eliminate that concern entirely — a snapshot is always the state at the moment it was created; restore just brings the instance back to that state without touching the snapshot itself.

Branch Exploration: Testing Two Paths with One Instance

Another high-value use of Snapshot is strategy comparison. When you need to compare two prompt strategies or processing paths, you don’t need two ContextChef instances — a single instance can repeatedly return to the same starting point via snapshot/restore:

Take a snapshot at the stable point, run strategy A to completion, record results; restore to the stable point, run strategy B to completion, record results; pick the winner and continue. After each restore, the instance is at a perfectly consistent starting point — the variable between the two tests is controlled, and context differences can’t contaminate the comparison.

This pattern is complementary to Anthropic’s sub-agent architecture in a way: sub-agents suit large-scale parallel exploration with clean context windows; Snapshot suits lightweight branch comparison within a single agent instance.

Relationship with Keeping Error Records

The two approaches target different failure categories and don’t conflict:

Tool call returned an error, agent state wasn’t damaged → keep the error record, let the model learn from failure
Tool call produced a destructive side effect, state needs to be reset → restore() to a stable snapshot

The criterion: can the model adjust on its own by seeing the error? If yes, keep it. If no, restore. In practice, read-only operations and retryable queries don’t need snapshots; data writes, environment mutations, and other high-risk operations warrant a snapshot before execution.

Final post: The Provider Adapter layer — why “write three sets of prompts” is a trap you should escape.