Why is RAG not suitable as the primary memory system for AI agents?

RAG access is probabilistic — it only retrieves memories when the current query is semantically similar. But critical information like user preferences, project conventions, and persona settings should be present unconditionally every turn. ContextChef's Memory module uses structured key-value blocks that are automatically injected at every compile() call, following the Letta/MemGPT memory blocks pattern rather than retrieval-based approaches.

How does ContextChef prevent the model from making wrong memory tier decisions?

ContextChef narrows the model's decision surface by fixing the storage tier via the call path. The model only decides whether and what to write — not which tier. Whatever is written via create_memory/modify_memory tools always goes to the memory tier. This avoids the hard-to-debug classification errors seen in systems like MemGPT where the model must choose between core vs archival memory.

ContextChef (5): Memory — Zero-Cost Reads, Structured Writes

9 Mar 2026

7 min read

AI Agent ContextChef Context Engineering

中文版：ContextChef (5)：Memory——读取零成本，写入结构化

The agent tells the user “I remember you mentioned you prefer TypeScript strict mode.” The user thinks this product actually remembers things. Next session, the agent asks again.

This isn’t a memory problem — it’s a persistence problem. The information existed; it just didn’t survive the session boundary.

Memory Is Not RAG

The first instinct for solving memory is a vector database: store conversations, retrieve relevant chunks, inject them into context. Letta (the successor to MemGPT) specifically wrote RAG is not Agent Memory in their Context Engineering guide to push back on this thinking.

The distinction is access pattern: RAG access is probabilistic — the current query must be semantically close enough to the stored memory to retrieve it. But a user’s programming language preferences, project conventions, and the AI’s persona shouldn’t depend on “this turn’s conversation happening to be semantically similar to them.” These things should be present every time, unconditionally injected.

Letta calls this class of information memory blocks: reserved portions of the context window with fixed size limits, automatically injected by the system rather than retrieved on demand. Anthropic’s Claude Code uses structured note-taking — the agent maintains a persistent notes file, reads it after every context reset, and continues where it left off. Their example is Claude playing Pokémon: after thousands of game steps, the agent maintained maps of explored regions, level-up progress, and effective combat strategies through its notes. Without them, it couldn’t sustain any long-horizon strategy.

Design Angle: Zero-Cost Reads, Structured Writes

Memory’s design is built around one principle: reads should be zero-cost; writes should be structured.

Zero-cost reads means memory content is automatically injected into context at every compile() call. Developers don’t need to manually fetch memories or concatenate them into messages in every agent loop. The library handles it. Benefits: you won’t cause amnesia by forgetting an inject memory call; the injection position is also fixed (after the system prompt, before conversation history), so the position doesn’t vary by how different developers assemble their messages.

Structured writes means memory isn’t stored as freeform text appended to messages — it’s stored as key-value pairs with clear semantics. Two benefits: memory entries can be precisely manipulated programmatically (overwrite, delete, query) without parsing from long text; the injected XML uses the key as the tag name (<memory><entry key="lang"><value>TypeScript</value></entry></memory>), which is more reliably parsed by LLMs than freeform text and makes entries self-documenting.

Pluggable storage backends come as an additional benefit of this design. InMemoryStore is for testing and rapid prototyping; VFSMemoryStore is for production persistence; you can implement a custom MemoryStore interface to connect to Redis or a database. Switching backends requires no changes to business logic — the read/write interface is uniform, and the storage implementation is isolated behind it.

TTL and Selector: Two-Axis Control Over Injection

As an agent runs longer, memory has two potential failure modes: entries accumulate and injection cost grows turn-by-turn; task-specific short-term context (“current step is 3/10”) persists indefinitely and bleeds into later tasks. TTL and Selector address these on two independent axes.

TTL: Control how long a memory entry lives. Each write can specify a lifetime, in two forms: { turns: N } for turn-based TTL — “valid for the next N turns”; { ms: N } for wall-clock TTL — for time-sensitive data like API response snapshots. Entries without TTL never expire, which is appropriate for long-term knowledge like user preferences and project conventions. During compile(), expired entries are swept automatically.

Selector: Control which live entries get injected. Even if an entry hasn’t expired, it doesn’t necessarily need to be injected every turn. The selector function is called before injection, receives all live entries, and returns the subset to actually inject — top N by importance, most recently updated, or any custom logic. importance is an optional numeric field you can set at write time for the selector to sort on.

The two axes have clear, non-overlapping responsibilities: TTL decides “does this entry still exist?”; Selector decides “of the entries that exist, which ones go into context?” They can be configured independently or combined.

Narrowing the Model’s Decision Surface, Not Eliminating It

One of the easiest traps in memory system design is letting the model decide whether information should go into “core memory” versus “archival memory.”

MemGPT addressed this by exposing the two tiers as differently-named tools — core_memory_replace and archival_memory_insert — and using tool descriptions to guide the model toward the right classification. It also set a hard character limit on core memory: once full, the model can’t write more, forcing it to think about whether each piece of information truly needs to be always-present. This is an effective mitigation, but it’s still fundamentally using prompt design and tool descriptions to guide LLM classification. The model can still pick the wrong tier, and these errors are extremely difficult to reproduce and debug.

To be fair, ContextChef doesn’t fundamentally solve this problem either — the model is still deciding “whether to write” and “what to write.” What ContextChef does is narrow the model’s decision surface: tier is fixed by the call path, and the model only needs to decide whether and what to write — not which tier. This turns a hard-to-observe classification decision (“core or archive?”) into predictable behavior: whatever gets written in always goes into memory. The trade-off is flexibility: if you need the model to autonomously decide between tiers, this design isn’t for you.

Concretely: whatever the model writes via create_memory / modify_memory tools always goes to the memory tier. Whatever a developer writes via chef.getMemory().set() defaults to memory as well. The interface design locks in the tier, leaving the model no choice in the matter.

Write Protocol: Separating Create from Modify

Memory writes go through two dedicated tools — create_memory and modify_memory — and the separation is intentional.

create_memory creates a new entry. The key is free-form by default; if allowedKeys is configured, the key is constrained to an enum of permitted values. Either way, the model is only creating — it can’t accidentally overwrite or delete something that already exists.

modify_memory updates or deletes an existing entry. Its action parameter is an enum (update | delete), and its key parameter is an enum of existing keys. This tool only appears in the tool list when there are existing entries — if memory is empty, modify_memory isn’t injected at all. This avoids the model hallucinating modification of non-existent entries.

compile() auto-injects both tool definitions into the payload alongside your other tools. Reading is still zero-cost — memory content appears in context automatically, no tool call required. The write protocol doesn’t affect reads.

The design rationale: separating create from modify constrains the model’s action space at the schema level. Instead of one generic “write” tool where the model must infer intent, each tool’s parameter schema encodes exactly what operations are valid given the current state. The model picks from a smaller, more precise set of actions, which reduces classification errors without relying on prompt engineering.

Memory has three distinct hooks: onMemoryUpdate (veto validated writes from createMemory/updateMemory/deleteMemory), onMemoryChanged (notification for all changes), and onMemoryExpired (TTL expiry notification) — their triggers and responsibilities differ meaningfully, see ContextChef (8): Five Extension Points in the Compile Pipeline for details.

Next: Snapshot & Restore — Manus says keep error records, but sometimes you genuinely need to roll back.